Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Additional edits to writing
  • Loading branch information
Duncan committed Apr 20, 2020
1 parent e4c688a commit 7658bbb
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions trackingPaper.tex
Expand Up @@ -131,7 +131,9 @@ While file system traces have been well-studied in earlier work, it has been som
The purpose of this work is to continue previous SMB studies to better understand the use of the protocol in a real-world production system in use at the University of Connecticut.
The main contribution of our work is the exploration of I/O behavior in modern file system workloads as well as new examinations of the inter-arrival times and run times for I/O events.
We further investigate if the recent standard models for traffic remain accurate.
Our findings reveal interesting data relating to the number of read and write events. We notice that the number of read and write events is significantly less than creates and that the average of bytes transferred over the wire is much smaller than what has been seen in previous studies. Furthermore we find an increase in the use of metadata for overall network communication that can be taken advantage of through the use of smart storage devices.
Our findings reveal interesting data relating to the number of read and write events. We notice that the number of read and write events is significantly less than creates and \textcolor{blue}{that average number of bytes exchanged per I/O has reduced.}
%the average of bytes transferred over the wire is much smaller than what has been seen in previous studies.
Furthermore we find an increase in the use of metadata for overall network communication that can be taken advantage of through the use of smart storage devices.
\end{abstract}

\section{Introduction}
Expand Down Expand Up @@ -193,6 +195,8 @@ The DataSeries data format allowed us to create data analysis code that focuses
%Focus should be aboiut analysis and new traces
The contributions of this work are the new traces of SMB traffic over a larger university network as well as new analysis of this traffic. Our new examination of the captured data reveals that despite the streamlining of the CIFS/SMB protocol to be less "chatty", the majority of SMB communication is still metadata based I/O rather than actual data I/O. We found that read operations occur in greater numbers and cause a larger overall number of bytes to pass over the network. Additionally, the average number of bytes transferred for each write I/O is smaller than that of the average read operation. We also find that the current standard for modeling network I/O holds for the majority of operations, while a more representative model needs to be developed for reads.

\textcolor{red}{Add information about releasing the code?}

\subsection{Related Work}
In this section we discuss previous studies examining traces and testing that has advanced benchmark development. We summarize major works in trace study in Table~\ref{tbl:studySummary}. In addition we examine issues that occur with traces and the assumptions in their study.
\begin{table*}[]
Expand Down Expand Up @@ -223,7 +227,8 @@ This paper & 2020 & SMB & x & Dynamic &
\end{table*}
\label{Previous Advances Due to Testing}
Tracing collection and analysis has proved its worth in time from previous studies where one can see important lessons pulled from the research; change in behavior of read/write events, overhead concerns originating in system implementation, bottlenecks in communication, and other revelations found in the traces. \\
Previous tracing work has shown that one of the largest and broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}. To produce a benchmark with high fidelity one needs to understand not only the technology being used but how it is being implemented within the system~\cite{roselli2000comparison,traeger2008nine,ruemmler1992unix}. All of these aspects will lend to the behavior of the system; from timing and resource elements to how the managing software governs actions~\cite{douceur1999large,malkani2003passive,seltzer2003nfs}. Furthermore, in pursuing this work one may find unexpected results and learn new things through examination~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}. \\
Previous tracing work has shown that one of the largest and broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error \textcolor{blue}{(e.g. timing, accuracy, resource usage)} ~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}.
To produce a benchmark with high fidelity one needs to understand not only the technology being used but how it is being implemented within the system~\cite{roselli2000comparison,traeger2008nine,ruemmler1992unix}. All of these aspects will lend to the behavior of the system; from timing and resource elements to how the managing software governs actions~\cite{douceur1999large,malkani2003passive,seltzer2003nfs}. Furthermore, in pursuing this work one may find unexpected results and learn new things through examination~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}. \\
These studies are required in order to evaluate the development of technologies and methodologies along with furthering knowledge of different system aspects and capabilities. As has been pointed out by past work, the design of systems is usually guided by an understanding of the file system workloads and user behavior~\cite{leung2008measurement}. It is for that reason that new studies are constantly performed by the science community, from large scale studies to individual protocol studies~\cite{leung2008measurement,vogels1999file,roselli2000comparison,seltzer2003nfs,anderson2004buttress}. Even within these studies, the information gleaned is only as meaningful as the considerations of how the data is handled.

The work done by Leung et al.~\cite{leung2008measurement} found observations related to the infrequency of files to be shared by more than one client. Over 67\% of files were never open by more than one client.
Expand All @@ -235,7 +240,7 @@ The 2004 paper by Anderson et al.~~\cite{anderson2004buttress} has the following
%I/O benchmarking widespread practice in storage industry and serves as basis for purchasing decisions, performance tuning studies and marketing campaigns.
Issues of inaccuracies in scheduling I/O can result in as much as a factor 3.5 difference in measured response time and factor of 26 in measured queue sizes. These inaccuracies pose too much of an issue to ignore.

Orosz and Skopko examined the effect of the kernel on packet loss in their 2013 paper~\cite{Orosz2013}. Their work showed that when taking network measurements the precision of the timestamping of packets is a more important criterion than low clock offset, especially when measuring packet inter-arrival times and round-trip delays at a single point of the network. One concern is that Dumpcap is a single threaded application and was suspected to be unable to handle new arriving packets due to a small size of the kernel buffer. Work by Dabir and Matrawy, in 2008~\cite{dabir2007bottleneck}, attempted to overcome this limitation by using two semaphores to buffer incoming strings and improve the writing of packet information to disk.
Orosz and Skopko examined the effect of the kernel on packet loss in their 2013 paper~\cite{Orosz2013}. Their work showed that when taking network measurements the precision of the timestamping of packets is a more important criterion than low clock offset, especially when measuring packet inter-arrival times and round-trip delays at a single point of the network. One \textcolor{blue}{solution for network capture is the tool Dumpcap, however the} concern \textcolor{blue}{with} Dumpcap is \textcolor{blue}{that it is a} single threaded application and was suspected to be unable to handle new arriving packets due to a small size of the kernel buffer. Work by Dabir and Matrawy, in 2008~\cite{dabir2007bottleneck}, attempted to overcome this limitation by using two semaphores to buffer incoming strings and improve the writing of packet information to disk.

Narayan and Chandy examined the concerns of distributed I/O and the different models of parallel application I/O.
%There are five major models of parallel application I/O. (1) Single output file shared by multiple nodes. (2) Large sequential reads by a single node at the beginning of computation and large sequential writes by a single node at the end of computation. (3) Checkpointing of states. (4) Metadata and read intensive (e.g. small data I/O and frequent directory lookups for reads).
Expand Down Expand Up @@ -325,7 +330,7 @@ The filesize used was in a ring buffer where each file captured was 64000 kB.

The \texttt{.pcap} files from \texttt{tshark} do not lend themselves to easy data analysis, so we translate these files into the DataSeries~\cite{DataSeries} format. HP developed DataSeries, an XML-based structured data format, that was designed to be self-descriptive, storage and access efficient, and highly flexible.
The system for taking captured \texttt{.pcap} files and writing them into the DataSeries format (i.e. \texttt{.ds}) does so by first creating a structure (based on a pre-written determination of the data desired to capture). Once the code builds this structure, it then reads through the capture traffic packets while dissecting and filling in the prepared structure with the desired information and format.
Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure security of the university's network communications. It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. \textcolor{blue}{This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes. In those scenarios the AndX messages specify only a sinlge SMB header with the rest of the AndX Chain attached in a series of block pairs.}
Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure security of the university's network communications. \textcolor{blue}{The reasoning for this limit was to allow for capture of longer SMB AndX message chains due to negotiated \textit{MaxBufferSize}.} It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. \textcolor{blue}{This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes. In those scenarios the AndX messages specify only a sinlge SMB header with the rest of the AndX Chain attached in a series of block pairs.}

\subsection{DataSeries Analysis}

Expand Down Expand Up @@ -826,6 +831,8 @@ Upon closer examination of the tracing system it was determined that
%these file extensions are an artifact of how Windows interprets file extensions. The Windows operating system merely guesses the file type based on the assumed extension (e.g. whatever characters follow after the final `.').
many files simply do not have a valid extension. These range from linux-based library files, manual pages, odd naming schemes as part of scripts or back-up files, as well as date-times and IPs as file names. There are undoubtedly a larger number more, but exhaustive determination of all variations is seen as out of scope for this work.

\textcolor{red}{Add in information stating that the type of OS in use in the university environment range from Windows, Unix, BSD, as well as other odd operating systems used by the engineering department.}

\begin{table}[]
\centering
\begin{tabular}{|l|l|l|}
Expand Down

0 comments on commit 7658bbb

Please sign in to comment.