From 405f133efb88de526e453ea6a0b370413bb0f83a Mon Sep 17 00:00:00 2001 From: Duncan Date: Mon, 27 Apr 2020 12:00:51 -0400 Subject: [PATCH] Small fixes, edits, and adding keywords --- trackingPaper.tex | 130 ++++++++++++++++++++++++---------------------- 1 file changed, 67 insertions(+), 63 deletions(-) diff --git a/trackingPaper.tex b/trackingPaper.tex index 79e40fb..7a38f78 100644 --- a/trackingPaper.tex +++ b/trackingPaper.tex @@ -90,6 +90,8 @@ title=\lstname % show the filename of files included with \lstinputlisting; also try caption instead of title } +\providecommand{\keywords}[1]{\textbf{\textit{Index terms---}} #1} + \ifCLASSINFOpdf % \usepackage[pdftex]{graphicx} % declare the path(s) where your graphic files are @@ -128,12 +130,14 @@ \begin{abstract} Storage system traces are important for examining real-world applications, studying potential bottlenecks, as well as driving benchmarks in the evaluation of new system designs. While file system traces have been well-studied in earlier work, it has been some time since the last examination of the SMB network file system. -The purpose of this work is to continue previous SMB studies to better understand the use of the protocol in a real-world production system in use at \textcolor{green}{a major research university}. %\textcolor{red}{the University of Connecticut}. +The purpose of this work is to continue previous SMB studies to better understand the use of the protocol in a real-world production system in use at a major research university. %\textcolor{red}{the University of Connecticut}. The main contribution of our work is the exploration of I/O behavior in modern file system workloads as well as new examinations of the inter-arrival times and run times for I/O events. We further investigate if the recent standard models for traffic remain accurate. -Our findings reveal interesting data relating to the number of read and write events. We notice that the number of read and write events is significantly less than creates and \textcolor{green}{the} \textcolor{blue}{average number of bytes exchanged per I/O} \textcolor{green}{is much smaller than what has been seen in previous studies}. +Our findings reveal interesting data relating to the number of read and write events. We notice that the number of read and write events is significantly less than creates and the average number of bytes exchanged per I/O is much smaller than what has been seen in previous studies. %the average of bytes transferred over the wire is much smaller than what has been seen in previous studies. Furthermore we find an increase in the use of metadata for overall network communication that can be taken advantage of through the use of smart storage devices. +\keywords{Server Message Block, +Network Benchmark, Storage Systems, Distributed I/O, Network Communication Analysis.} \end{abstract} \section{Introduction} @@ -162,7 +166,7 @@ Since an SMB-based trace study has not been undertaken recently, we took a look at its current implementation and use in a large university network. %Due to the sensitivity of the captured information, we ensure that all sensitive information is hashed and that the original network captures are not saved. -Our study is based on network packet traces collected on \textcolor{green}{a major research university}'s +Our study is based on network packet traces collected on a major research university's %\textcolor{red}{the University of Connecticut}'s centralized storage facility over a period of three weeks in May 2019. This trace-driven analysis can help in the design of future storage products as well as providing data for future performance benchmarks. %Benchmarks are important for the purpose of developing technologies as well as taking accurate metrics. The reasoning behind this tracing capture work is to eventually better develop accurate benchmarks for network protocol evaluation. @@ -184,7 +188,7 @@ Benchmarks allow for the stress testing of various aspects of a system (e.g. net % \end{enumerate} %\end{itemize} -We created a new tracing system to collect data from the \textcolor{green}{university} +We created a new tracing system to collect data from the university %\textcolor{red}{UConn} storage network system. The tracing system was built around the high-speed PF\_RING packet capture system and required the use of proper hardware and software to handle incoming data%\textcolor{blue}{; however interaction with later third-party code did require re-design for processing of the information} . We also created a new trace capture format based on the DataSeries structured data format developed by HP~\cite{DataSeries}. @@ -233,14 +237,14 @@ This paper & 2020 & SMB & x & Dynamic & %In this section we discuss previous studies examining traces and testing that has advanced benchmark development. We summarize major works in trace study in Table~\ref{tbl:studySummary}. %In addition we examine issues that occur with traces and the assumptions in their study. -Tracing collection and analysis \textcolor{green}{from previous studies have provided important insights and lessons such as an observations of read/write event changes}, overhead concerns originating in system implementation, bottlenecks in communication, and other revelations found in the traces. -Previous tracing work has shown that one of the largest and broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error \textcolor{blue}{(e.g. timing, accuracy, resource usage)} ~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}. +Tracing collection and analysis from previous studies have provided important insights and lessons such as an observations of read/write event changes, overhead concerns originating in system implementation, bottlenecks in communication, and other revelations found in the traces. +Previous tracing work has shown that one of the largest and broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error (e.g. timing, accuracy, resource usage) ~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}. To produce a benchmark with high fidelity one needs to understand not only the technology being used but how it is being implemented within the system~\cite{roselli2000comparison,traeger2008nine,ruemmler1992unix}. All these aspects lend to the behavior of the system; from timing and resource elements to how the managing software governs actions~\cite{douceur1999large,malkani2003passive,seltzer2003nfs}. Furthermore, in pursuing this work one may find unexpected results and learn new things through examination~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}. These studies are required in order to evaluate the development of technologies and methodologies along with furthering knowledge of different system aspects and capabilities. As has been pointed out by past work, the design of systems is usually guided by an understanding of the file system workloads and user behavior~\cite{leung2008measurement}. %It is for that reason that new studies are constantly performed by the science community, from large scale studies to individual protocol studies~\cite{leung2008measurement,vogels1999file,roselli2000comparison,seltzer2003nfs,anderson2004buttress}. Even within these studies, the information gleaned is only as meaningful as the considerations of how the data is handled. %The work done by -Leung et al.~\cite{leung2008measurement} found \textcolor{green}{that} +Leung et al.~\cite{leung2008measurement} found that %observations related to the infrequency of files to be shared by more than one client. over 67\% of files were never opened by more than one client. %Work by Leung \textit{et al.} led to a series of observations, from the fact that files are rarely re-opened to finding @@ -249,7 +253,7 @@ and that read-write access patterns are more frequent ~\cite{leung2008measuremen %Concerns of the accuracy achieved of the trace data was due to using standard system calls as well as errors in issuing I/Os leading to substantial I/O statistical errors. % Anderson Paper %The 2004 paper by -Anderson et al.~~\cite{anderson2004buttress} \textcolor{green}{found that a } +Anderson et al.~~\cite{anderson2004buttress} found that a %has the following observations. A source of decreased precision came from the kernel overhead for providing timestamp resolution. This would introduce substantial errors in the observed system metrics due to the use inaccurate tools when benchmarking I/O systems. These errors in perceived I/O response times can range from +350\% to -15\%. %I/O benchmarking widespread practice in storage industry and serves as basis for purchasing decisions, performance tuning studies and marketing campaigns. @@ -257,7 +261,7 @@ Issues of inaccuracies in scheduling I/O can result in as much as a factor 3.5 d Orosz and Skopko examined the effect of the kernel on packet loss and %in their 2013 paper~\cite{Orosz2013}. Their work -showed that when taking network measurements the precision of the timestamping of packets is a more important criterion than low clock offset, especially when measuring packet inter-arrival times and round-trip delays at a single point of the network. One \textcolor{blue}{solution for network capture is the tool Dumpcap. However the} concern \textcolor{blue}{with} Dumpcap is \textcolor{blue}{that it is a} single threaded application and was suspected to be unable to handle new arriving packets due to \textcolor{green}{the} small size of the kernel buffer. Work by +showed that when taking network measurements the precision of the timestamping of packets is a more important criterion than low clock offset, especially when measuring packet inter-arrival times and round-trip delays at a single point of the network. One solution for network capture is the tool Dumpcap. However the concern with Dumpcap is that it is a single threaded application and was suspected to be unable to handle new arriving packets due to the small size of the kernel buffer. Work by Dabir and Matrawy%, in 2008 ~\cite{dabir2007bottleneck} attempted to overcome this limitation by using two semaphores to buffer incoming strings and improve the writing of packet information to disk. %Narayan and Chandy examined the concerns of distributed I/O and the different models of parallel application I/O. @@ -266,7 +270,7 @@ Dabir and Matrawy%, in 2008 %Observations from Skopk\'o %in a 2012 paper -~\cite{skopko2012loss} examined the concerns of software based capture solutions \textcolor{green}{and observed that } +~\cite{skopko2012loss} examined the concerns of software based capture solutions and observed that %. The main observation was software solutions relied heavily on OS packet processing mechanisms. Furthermore, depending on the mode of operation (e.g. interrupt or polling), the timestamping of packets would change. @@ -328,12 +332,12 @@ Some nuances of the SMB protocol I/O to note are that SMB/SMB2 write requests ar % and on the python dissection code we wrote for performing traffic analysis. -\subsection{\textcolor{green}{University Storage} System Overview} -We collected traces from \textcolor{green}{the university} +\subsection{University Storage System Overview} +We collected traces from the university %\textcolor{red}{the University of Connecticut University Information Technology Services (UITS)} centralized storage server%The \textcolor{red}{UITS system} -\textcolor{green}{, which} consists of five Microsoft file server cluster nodes. These blade servers are used to host SMB file shares for various departments at -\textcolor{green}{the university} +, which consists of five Microsoft file server cluster nodes. These blade servers are used to host SMB file shares for various departments at +the university %\textcolor{red}{UConn} as well as personal drive share space for faculty, staff and students, along with at least one small group of users. Each server is capable of handling 1~Gb/s of traffic in each direction (e.g. outbound and inbound traffic). Altogether, the five-blade server system can in theory handle 5~Gb/s of data traffic in each direction. %Some of these blade servers have local storage but the majority do not have any. @@ -362,7 +366,7 @@ The filesize used was in a ring buffer where each file captured was 64000 kB. The \texttt{.pcap} files from \texttt{tshark} do not lend themselves to easy data analysis, so we translate these files into the DataSeries~\cite{DataSeries} format, an XML-based structured data format designed to be self-descriptive, storage and access efficient, and highly flexible. %The system for taking captured \texttt{.pcap} files and writing them into the DataSeries format (i.e. \texttt{.ds}) does so by first creating a structure (based on a pre-written determination of the data desired to capture). Once the code builds this structure, it then reads through the capture traffic packets while dissecting and filling in the prepared structure with the desired information and format. -Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure \textcolor{green}{the privacy} of the university's network communications. \textcolor{blue}{The reasoning for this limit was to allow for capture of longer SMB AndX message chains due to negotiated \textit{MaxBufferSize}.} It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. \textcolor{blue}{This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes. In those scenarios the AndX messages specify only a single SMB header with the rest of the AndX Chain attached in a series of block pairs.} +Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure the privacy of the university's network communications. The reasoning for this limit was to allow for capture of longer SMB AndX message chains due to negotiated \textit{MaxBufferSize}. It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes. In those scenarios the AndX messages specify only a single SMB header with the rest of the AndX Chain attached in a series of block pairs. \subsection{DataSeries Analysis} @@ -429,15 +433,15 @@ The latter two relate to metadata information of shares and files accessed. Howe \begin{tabular}{|l|c|c|c|} \hline I/O Operation & SMB & SMB2 & Both \\ \hline -General Operations & 2418980 & 208286887 & 210705867 \\ +General Operations & 2,418,980 & 208,286,887 & 210,705,867 \\ General \% & 99.91\% & 74.66\% & 74.87\% \\ %\hline -Create Operations & 0 & 54486043 & 54486043 \\ +Create Operations & 0 & 54,486,043 & 54,486,043 \\ Create \% & 0.00\% & 19.53\% & 19.36\% \\ -Read Operations & 1931 & 8353626 & 8355557 \\ +Read Operations & 1,931 & 8,353,626 & 8,355,557 \\ Read \% & 0.08\% & 2.99\%& 2.97\%\\ -Write Operations & 303 & 7871916 & 7872219 \\ +Write Operations & 303 & 7,871,916 & 7,872,219 \\ Write \% & 0.01\% & 2.82\% & 2.80\% \\ \hline -Combine Protocol Operations & 2421214 & 278998472 & 281419686 \\ +Combine Protocol Operations & 2,421,214 & 278,998,472 & 281,419,686 \\ Combined Protocols \% & 0.86\% & 99.14\% & 100\% \\ \hline \end{tabular} \caption{\label{tbl:SMBCommands}Percentage of SMB and SMB2 Protocol Commands on March 15th} @@ -446,23 +450,23 @@ Combined Protocols \% & 0.86\% & 99.14\% & 100\% \\ \hline \begin{table}[] \centering \begin{tabular}{|l|c|c|c|} -\hline \hline +\hline SMB2 General Operation & \multicolumn{2}{|c|}{Occurrences} & Percentage of Total \\ \hline -Close & \multicolumn{2}{|c|}{80114256} & 28.71\% \\ -Tree Connect & \multicolumn{2}{|c|}{48414491} & 17.35\% \\ -Query Info & \multicolumn{2}{|c|}{27155528} & 9.73\% \\ -Negotiate & \multicolumn{2}{|c|}{25276447} & 9.06\% \\ -Tree Disconnect & \multicolumn{2}{|c|}{9773361} & 3.5\% \\ -IOCtl & \multicolumn{2}{|c|}{4475494} & 1.6\% \\ -Set Info & \multicolumn{2}{|c|}{4447218} & 1.59\% \\ -Query Directory & \multicolumn{2}{|c|}{3443491} & 1.23\% \\ -Session Setup & \multicolumn{2}{|c|}{2041208} & 0.73\%\\ -Lock & \multicolumn{2}{|c|}{1389250} & 0.5\% \\ -Flush & \multicolumn{2}{|c|}{972790} & 0.35\% \\ -Change Notify & \multicolumn{2}{|c|}{612850} & 0.22\% \\ -Logoff & \multicolumn{2}{|c|}{143592} & 0.05\% \\ -Oplock Break & \multicolumn{2}{|c|}{22397} & 0.008\% \\ -Echo & \multicolumn{2}{|c|}{4715} & 0.002\% \\ +Close & \multicolumn{2}{|c|}{80,114,256} & 28.71\% \\ +Tree Connect & \multicolumn{2}{|c|}{48,414,491} & 17.35\% \\ +Query Info & \multicolumn{2}{|c|}{27,155,528} & 9.73\% \\ +Negotiate & \multicolumn{2}{|c|}{25,276,447} & 9.06\% \\ +Tree Disconnect & \multicolumn{2}{|c|}{9,773,361} & 3.5\% \\ +IOCtl & \multicolumn{2}{|c|}{4,475,494} & 1.6\% \\ +Set Info & \multicolumn{2}{|c|}{4,447,218} & 1.59\% \\ +Query Directory & \multicolumn{2}{|c|}{3,443,491} & 1.23\% \\ +Session Setup & \multicolumn{2}{|c|}{2,041,208} & 0.73\%\\ +Lock & \multicolumn{2}{|c|}{1,389,250} & 0.5\% \\ +Flush & \multicolumn{2}{|c|}{972,790} & 0.35\% \\ +Change Notify & \multicolumn{2}{|c|}{612,850} & 0.22\% \\ +Logoff & \multicolumn{2}{|c|}{143,592} & 0.05\% \\ +Oplock Break & \multicolumn{2}{|c|}{22,397} & 0.008\% \\ +Echo & \multicolumn{2}{|c|}{4,715} & 0.002\% \\ Cancel & \multicolumn{2}{|c|}{0} & 0.00\% \\ \hline \end{tabular} @@ -486,10 +490,10 @@ Cancel & \multicolumn{2}{|c|}{0} & 0.00\% \\ %\end{figure} Each SMB Read and Write command is associated with a data request size that indicates how many bytes are to be read or written as part of that command. Figure~\ref{fig:SMB-Bytes-IO} %and~\ref{fig:PDF-Bytes-Write} -shows the probability density function (PDF) of the different sizes of bytes transferred for read and write I/O operations respectively. The most noticeable aspect of these graphs are that the majority of bytes transferred for read and write operations is around 64 bytes. It is worth noting that write I/Os also have a larger number of very small transfer amounts. This is unexpected in terms of the amount of data passed in a frame. \textcolor{green}{Part of the reason} is due to a large number of long term %calculations/ -scripts that only require small but frequent updates, \textcolor{green}{as we observed several} +shows the probability density function (PDF) of the different sizes of bytes transferred for read and write I/O operations respectively. The most noticeable aspect of these graphs are that the majority of bytes transferred for read and write operations is around 64 bytes. It is worth noting that write I/Os also have a larger number of very small transfer amounts. This is unexpected in terms of the amount of data passed in a frame. Part of the reason is due to a large number of long term %calculations/ +scripts that only require small but frequent updates, as we observed several %. This assumption was later validated in part when examining the files transferred, as some were related to -running scripts creating a large volume of files. \textcolor{green}{A more significant reason was because we noticed} Microsoft Word would perform a large number of small reads at ever growing offsets. This was interpreted as when a user is viewing a document over the network and Word would load the next few lines of text as the user scrolled down the document; causing ``loading times'' amid use. \textcolor{green}{Finally,} a large degree of small writes were observed to be related to application cookies or other such smaller data communications. +running scripts creating a large volume of files. A more significant reason was because we noticed Microsoft Word would perform a large number of small reads at ever growing offsets. This was interpreted as when a user is viewing a document over the network and Word would load the next few lines of text as the user scrolled down the document; causing ``loading times'' amid use. Finally, a large degree of small writes were observed to be related to application cookies or other such smaller data communications. %This could also be attributed to simple reads relating to metadata\textcolor{red}{???} %\textcolor{blue}{Reviewing of the SMB and SMB2 leads to some confusion in understanding this behavior. According to the specification the default ``MaxBuffSize'' for reads and writes should be between 4,356 bytes and 16,644 bytes depending on the use of either a client version of server version of Windows; respectively. In the SMB2 protocol specification, specific version of Windows (e.g. Vista SP1, Server 2008, 7, Server 2008 R2, 8, Server 2012, 8.1, Server 2012 R2) disconnect if the ``MaxReadSize''/``MaxWriteSize'' value is less than 4096. However, further examination of the specification states that for SMB2 the read length and write length can be zero. Thus, this seems to conflict that the size has to be greater than 4096 but allows for it to also be zero. It is due to this protocol specification of allowing zero that supports the smaller read/write sizes seen in the captured traffic. The author's assumption here is that the university's configuration allows for smaller traffic to be exchanged without the disconnection for sizes smaller than 4096.} @@ -680,8 +684,8 @@ We also observe that the number of writes is very close to the number of reads. \hline & Reads & Writes & Creates & General \\ \hline I/O \% & 2.97 & \multicolumn{1}{r|}{2.80} & \multicolumn{1}{r|}{19.36} & \multicolumn{1}{r|}{74.87} \\ \hline -Avg RT ($\mu$s) & 59819.7 & \multicolumn{1}{r|}{519.7} & \multicolumn{1}{r|}{698.1} & \multicolumn{1}{r|}{7013.4} \\ \hline -Avg IAT ($\mu$s) & 33220.8 & \multicolumn{1}{r|}{35260.4} & \multicolumn{1}{r|}{5094.5} & \multicolumn{1}{r|}{1317.4} \\ \hline +Avg RT ($\mu$s) & 59,819.7 & \multicolumn{1}{r|}{519.7} & \multicolumn{1}{r|}{698.1} & \multicolumn{1}{r|}{7,013.4} \\ \hline +Avg IAT ($\mu$s) & 33,220.8 & \multicolumn{1}{r|}{35,260.4} & \multicolumn{1}{r|}{5,094.5} & \multicolumn{1}{r|}{1,317.4} \\ \hline %\hline %Total RT (s) & 224248 & \multicolumn{1}{l|}{41100} & \multicolumn{1}{l|}{342251} & \multicolumn{1}{l|}{131495} \\ \hline %\% Total RT & 30.34\% & \multicolumn{1}{l|}{5.56\%} & \multicolumn{1}{l|}{46.3\%} & \multicolumn{1}{l|}{17.79\%} \\ \hline @@ -908,7 +912,7 @@ Originally we expected that these common file extensions would be a much larger Furthermore, the majority of extensions are not readily identified. Upon closer examination of the tracing system it was determined that %these file extensions are an artifact of how Windows interprets file extensions. The Windows operating system merely guesses the file type based on the assumed extension (e.g. whatever characters follow after the final `.'). -many files simply do not have a valid extension. These range from :inux-based library files, man pages, odd naming schemes as part of scripts or back-up files, as well as date-times and IPs as file names. There are undoubtedly more, but exhaustive determination of all variations is seen as out of scope for this work. +many files simply do not have a valid extension. These range from Linux-based library files, man pages, odd naming schemes as part of scripts or back-up files, as well as date-times and IPs as file names. There are undoubtedly more, but exhaustive determination of all variations is seen as out of scope for this work. %\textcolor{red}{Add in information stating that the type of OS in use in the university environment range from Windows, Unix, BSD, as well as other odd operating systems used by the engineering department.} @@ -917,16 +921,16 @@ many files simply do not have a valid extension. These range from :inux-based l \begin{tabular}{|l|l|l|} \hline SMB2 Filename Extension & Occurrences & Percentage of Total \\ \hline --Travel & 33396147 & 15.26 \\ -o & 28670784 & 13.1 \\ -e & 28606421 & 13.07 \\ -N & 27639457 & 12.63 \\ -one & 27615505 & 12.62 \\ -\textless{}No Extension\textgreater{} & 27613845 & 12.62 \\ -d & 2799799 & 1.28 \\ -l & 2321338 & 1.06 \\ -x & 2108279 & 0.96 \\ -h & 2019714 & 0.92 \\ \hline +-Travel & 33,396,147 & 15.26 \\ +o & 28,670,784 & 13.1 \\ +e & 28,606,421 & 13.07 \\ +N & 27,639,457 & 12.63 \\ +one & 27,615,505 & 12.62 \\ +\textless{}No Extension\textgreater{} & 27,613,845 & 12.62 \\ +d & 2,799,799 & 1.28 \\ +l & 2,321,338 & 1.06 \\ +x & 2,108,279 & 0.96 \\ +h & 2,019,714 & 0.92 \\ \hline \end{tabular} \caption{Top 10 File Extensions Seen Over Three Week Period} \label{tab:top10SMB2FileExts} @@ -937,16 +941,16 @@ h & 2019714 & 0.92 \\ \hline \begin{tabular}{|l|l|l|} \hline SMB2 Filename Extension & Occurrences & Percentage of Total \\ \hline -doc & 352958 & 0.16 \\ -docx & 291047 & 0.13 \\ -ppt & 46706 & 0.02 \\ -pptx & 38604 & 0.02 \\ -xls & 218031 & 0.1 \\ -xlsx & 180676 & 0.08 \\ +doc & 352,958 & 0.16 \\ +docx & 291,047 & 0.13 \\ +ppt & 46,706 & 0.02 \\ +pptx & 38,604 & 0.02 \\ +xls & 218,031 & 0.1 \\ +xlsx & 180,676 & 0.08 \\ odt & 28 & 0.000013 \\ -pdf & 375601 & 0.17 \\ -xml & 1192840 & 0.54 \\ -txt & 167827 & 0.08 \\ \hline +pdf & 375,601 & 0.17 \\ +xml & 1,192,840 & 0.54 \\ +txt & 167,827 & 0.08 \\ \hline \end{tabular} \caption{Common File Extensions Seen Over Three Week Period} \label{tab:commonSMB2FileExts} @@ -1026,7 +1030,7 @@ $\mu$, $\sigma$, $k$, and $\lambda$ Values for Curve Fitting Equations on CDF Gr %Examination of the Response Time (RT) and Inter Arrival Times (IAT) revealed the speed and frequency with which metadata operations are performed, as well as the infrequency of individual users and sessions to interact with a given share. %% NEED: Run the matlab curve fitting to complete this section of the writing -Our comparison of the existing standard use of a exponential distribution to model network interarrival and response times is still valid. One should notice that the Gaussian distributions +Our comparison of the existing standard use of a exponential distribution to model network inter-arrival and response times is still valid. One should notice that the Gaussian distributions % had better $R^2$ result than the exponential equivalent for write operations. This is not surprising due to the step-function shape of the Figure~\ref{fig:CDF-RT-Write} CDF. Examining the $R^2$ results for the read + write I/O operations we find that the exponential distribution is far more accurate at modeling this combined behavior. for write and create operations are similar, while those for read operations are not. Further more there is less similarity between the modeled behavior of general operation inter arrival times and their response times, showing the need for a more refined model for each aspect of the network filesystem interactions. One should also notice that the general operation model is more closely similar to that of the creates. @@ -1072,7 +1076,7 @@ Because the data is being collected from an active network, there will be differ %Normally, one could simply re-perform the conversion process to a DataSeries file, but due to the rate of the packets being captured and security concerns of the data being captured, we are unable to re-run any captured information. \section{Conclusions and Future Work} -Our analysis of this university network filesystem illustrated the current implementation and use of the CIFS/SMB protocol in a large academic setting. We notice the effect of caches on the ability of the filesystem to limit the number of accesses to persistant storage. The effect of enterprise storage disks access time can be seen in the response time for read and write I/O. Metadata operations dominate the neThe majority of network communication is dominated by metadata operation, which is of less surprise since SMB is a known chatty protocol. We do notice that the CIFS/SMB protocol continues to be chatty with metadata I/O operations regardless of the version of SMB being implemented; $74.66$\% of I/O being metadata operations for SMB2. +Our analysis of this university network filesystem illustrated the current implementation and use of the CIFS/SMB protocol in a large academic setting. We notice the effect of caches on the ability of the filesystem to limit the number of accesses to persistant storage. The effect of enterprise storage disks access time can be seen in the response time for read and write I/O. Metadata operations dominate the majority of network communication, which is of less surprise since SMB is a known chatty protocol. We do notice that the CIFS/SMB protocol continues to be chatty with metadata I/O operations regardless of the version of SMB being implemented; $74.66$\% of I/O being metadata operations for SMB2. We also find that read and write transfer sizes are significantly smaller than would be expected and requires further study as to the impact on current storage systems. %operations happen in greater number than write operations (at a ratio of 1.06) and the size of their transfers are is also greater by a factor of about 2. %However, the average write operation includes a larger number of relatively smaller writes.