Skip to content
Permalink
Browse files

Basic edits and fixes. Still have handful of points to address

  • Loading branch information
Duncan
Duncan committed Jun 24, 2020
1 parent 2d23c69 commit 614c1b9e1ad1991aa821c182e292d369da6b66b0
Showing with 10 additions and 10 deletions.
  1. +10 −10 trackingPaper.tex
@@ -153,7 +153,7 @@ \section{Introduction}
storage area networks (SAN), network attached storage (NAS), clustered file systems,
hybrid storage, amongst others. However, the front-end client-facing network file
system protocol in most enterprise IT settings tends to be, for the most part, solely
SMB (Server Message Block) because of the preponderance of MS Windows clients.
SMB (Server Message Block) because of the preponderance of Microsoft (MS) Windows clients.
While there are other network file systems such as Network File System (NFS) and
clustered file systems such as Ceph, PanFS, and OrangeFS, they tend to be used less
extensively in most non-research networks.
@@ -200,7 +200,7 @@ \section{Introduction}
% DataSeries + Code section
DataSeries was modified to filter specific SMB protocol fields along with the writing of analysis tools to parse and dissect the captured packets. Specific fields were chosen to be the interesting fields kept for analysis.
%It should be noted that this was done originally arbitrarily and changes/additions have been made as the value of certain fields were determined to be worth examining; e.g. multiple runs were required to refine the captured data for later analysis.
The DataSeries data format allowed us to create data analysis code that focuses on I/O events and ID tracking (TID/UID). The future vision for this information is to combine ID tracking with the OpLock information in order to track resource sharing of the different clients on the network. As well as using IP information to recreate communication in a larger network trace to establish a better benchmark.
The DataSeries data format allowed us to create data analysis code that focuses on I/O events and ID tracking: e.g. Tree ID (TID) and User ID (UID). The future vision for this information is to combine ID tracking with the OpLock information in order to track resource sharing of the different clients on the network, as well as using IP information to recreate communication in a larger network trace to establish a better benchmark.

%Focus should be aboiut analysis and new traces
The contributions of this work are the new traces of SMB traffic over a large university network as well as new analysis of this traffic. Our new examination of the captured data reveals that despite the streamlining of the CIFS/SMB protocol to be less "chatty", the majority of SMB communication is still metadata based I/O rather than actual data I/O. We found that read operations occur in greater numbers and cause a larger overall number of bytes to pass over the network. Additionally, the average number of bytes transferred for each write I/O is smaller than that of the average read operation. We also find that the current standard for modeling network I/O holds for the majority of operations, while a more representative model needs to be developed for reads.
@@ -247,7 +247,7 @@ \section{Related Work}
%The work done by
Leung et al.~\cite{leung2008measurement} found that
%observations related to the infrequency of files to be shared by more than one client.
over 67\% of files were never opened by more than one client.
over 67\% of files were never opened by more than one client
%Work by Leung \textit{et al.} led to a series of observations, from the fact that files are rarely re-opened to finding
and that read-write access patterns are more frequent.
%If files were shared it was rarely concurrently and usually as read-only; where 5\% of files were opened by multiple clients concurrently and 90\% of the file sharing was read only.
@@ -260,7 +260,7 @@ \section{Related Work}
%I/O benchmarking widespread practice in storage industry and serves as basis for purchasing decisions, performance tuning studies and marketing campaigns.
Issues of inaccuracies in scheduling I/O can result in as much as a factor 3.5 difference in measured response time and factor of 26 in measured queue sizes. These inaccuracies pose too much of an issue to ignore.

Orosz and Skopko examined the effect of the kernel on packet loss and
Orosz and Skopko~\cite{Orosz2013} examined the effect of the kernel on packet loss and
%in their 2013 paper~\cite{Orosz2013}. Their work
showed that when taking network measurements, the precision of the timestamping of packets is a more important criterion than low clock offset, especially when measuring packet inter-arrival times and round-trip delays at a single point of the network. One solution for network capture is the tool Dumpcap. However, the concern with Dumpcap is that it is a single threaded application and was suspected to be unable to handle new arriving packets due to the small size of the kernel buffer. Work by
Dabir and Matrawy%, in 2008
@@ -344,7 +344,7 @@ \subsection{University Storage System Overview}
%\textcolor{red}{UConn}
as well as personal drive share space for faculty, staff and students, along with at least one small group of users. Each server is capable of handling 1~Gb/s of traffic in each direction (e.g. outbound and inbound traffic). Altogether, the five-blade server system can in theory handle 5~Gb/s of data traffic in each direction.
%Some of these blade servers have local storage but the majority do not have any.
The blade servers serve as SMB heads, but the actual storage is served by SAN storage nodes that sit behind them. This system does not currently implement load balancing. Instead, the servers are set up to spread the load with a static distribution across four of the active cluster nodes while the passive fifth node takes over in the case any of the other nodes go down.% (e.g. become inoperable or crash).
The blade servers serve as SMB heads, but the actual storage is served by SAN storage nodes that sit behind them. This system does not currently implement load balancing. Instead, the servers are set up to spread the load with a static distribution across four of the active cluster nodes while the passive fifth node takes over in the case of any other nodes going down.% (e.g. become inoperable or crash).

The actual tracing was performed with a tracing server connected to a switch outfitted with a packet duplicating element as shown in the topology diagram in Figure~\ref{fig:captureTopology}. A 10~Gbps network tap was installed in the file server switch, allowing our storage server to obtain a copy of all network traffic going to the 5 file servers. The reason for using 10~Gbps hardware is to help ensure that the system is able to capture information on the network at peak theoretical throughput.

@@ -421,15 +421,15 @@ \section{Data Analysis}
shows a summary of the SMB traffic captured, statistics of the I/O operations, and read/write data exchange observed for the network filesystem. This information is further detailed in Table~\ref{tbl:SMBCommands}, which illustrates that the majority of I/O operations are general (74.87\%). As shown in %the bottom part of
Table~\ref{tbl:SMBCommands2}, general I/O includes metadata commands such as connect, close, query info, etc.

Our examination of the collected network filesystem data revealed interesting patterns for the current use of CIFS/SMB in a large academic setting. The first is that there is a major shift away from read and write operations towards more metadata-based ones. This matches the last CIFS observations made by Leung et.~al.~that files were being generated and accessed infrequently. The change in operations are due to a movement of use activity from reading and writing data to simply checking file and directory metadata. However, since the earlier study, SMB has transitioned to the SMB2 protocol which was supposed to be less "chatty". As a result, we would expect fewer general SMB operations. Table~\ref{tbl:SMBCommands} shows a breakdown of SMB and SMB2 usage over the time period of May. From this table, one can see that the SMB2 protocol makes up $99.14$\% of total network operations compared to just $0.86$\% for SMB, indicating that most clients have upgraded to SMB2. However, $74.66$\% of SMB2 I/O are still general operations. Contrary to the purpose of implementing the SMB2 protocol, there is still a large amount of general I/O.
Our examination of the collected network filesystem data revealed interesting patterns for the current use of CIFS/SMB in a large academic setting. The first is that there is a major shift away from read and write operations towards more metadata-based ones. This matches the last CIFS observations made by Leung et.~al.~\cite{leung2008measurement} that files were being generated and accessed infrequently. The change in operations are due to a movement of use activity from reading and writing data to simply checking file and directory metadata. However, since the earlier study, SMB has transitioned to the SMB2 protocol which was supposed to be less "chatty". As a result, we would expect fewer general SMB operations. Table~\ref{tbl:SMBCommands} shows a breakdown of SMB and SMB2 usage over the time period of May. From this table, one can see that the SMB2 protocol makes up $99.14$\% of total network operations compared to just $0.86$\% for SMB, indicating that most clients have upgraded to SMB2. However, $74.66$\% of SMB2 I/O are still general operations. Contrary to the purpose of implementing the SMB2 protocol, there is still a large amount of general I/O.
%While CIFS/SMB protocol has less metadata operations, this is due to a depreciation of the SMB protocol commands, therefore we would expect to see less total operations (e.g. $0.04$\% of total operations).
%The infrequency of file activity is further strengthened by our finding that within a week long window of time there are no Read or Write inter arrival times that can be calculated.
%\textcolor{red}{XXX we are going to get questioned on this. its not likely that there are no IATs for reads and writes}
%General operations happen at very high frequency with inter arrival times that were found to be relatively short (1317$\mu$s on average), as shown in Table~\ref{tbl:PercentageTraceSummary}.

Taking a deeper look at the SMB2 operations, shown in %the bottom half of
Table~\ref{tbl:SMBCommands2}, we see that $9.06$\% of the general operations are negotiate commands. These are commands sent by the client to notify the server which dialects of the SMB2 protocol the client can understand. The three most common commands are close, tree connect, and query info.
The latter two relate to metadata information of shares and files accessed. However, the close operation corresponds to the create operations. Note that the create command is also used as an open file. Notice is that the number of closes is greater than the total number of create operations by $9.35$\%. These extra close operations are most likely due to applications doing multiple closes that do not need to be performed.
The latter two relate to metadata information of shares and files accessed. However, the close operation corresponds to the create operations. Note that the create command is also used as an open file. Notice that the number of closes is greater than the total number of create operations by $9.35$\%. These extra close operations are most likely due to applications doing multiple closes that do not need to be performed.

\begin{table}
\centering
@@ -447,7 +447,7 @@ \section{Data Analysis}
Combine Protocol Operations & 2,421,214 & 278,998,472 & 281,419,686 \\
Combined Protocols \% & 0.86\% & 99.14\% & 100\% \\ \hline
\end{tabular}
\caption{\label{tbl:SMBCommands}Percentage of SMB and SMB2 Protocol Commands on March 15th}
\caption{\label{tbl:SMBCommands}Percentage of SMB and SMB2 Protocol Commands for the Time of April 30th, 2019 to May 20th, 2019}
\vspace{-1em}
\end{table}

@@ -585,7 +585,7 @@ \subsection{I/O Data Request Sizes}
\end{table}

In comparison of the read, write, and create operations we found that the vast majority
of these type of I/O belong to creates. By the fact that there are so many creates, it
of I/O belong to creates. By the fact that there are so many creates, it
seems apparent that many applications create new files rather than updating existing
files when files are modified. Furthermore, read operations account for the largest aggregate of bytes transferred over the network. However, the number of bytes transferred by write commands is not far behind, although, non-intuitively, including a larger number of standardized relatively smaller writes. The most unexpected finding of the data is that all the read and writes are performed using much smaller buffers than expected; about an order of magnitude smaller (e.g. bytes instead of kilobytes).

@@ -919,7 +919,7 @@ \subsection{File Extensions}
Furthermore, the majority of extensions are not readily identified.
Upon closer examination of the tracing system it was determined that
%these file extensions are an artifact of how Windows interprets file extensions. The Windows operating system merely guesses the file type based on the assumed extension (e.g. whatever characters follow after the final `.').
many files simply do not have a valid extension. These range from Linux-based library files, man pages, odd naming schemes as part of scripts or back-up files, as well as date-times and IPs as file names. There are undoubtedly more, but exhaustive determination of all variations is seen as out of scope for this work.
many files simply do not have a valid extension. These range from Linux-based library files, manual pages, odd naming schemes as part of scripts or back-up files, as well as date-times and IPs as file names. There are undoubtedly more, but exhaustive determination of all variations is seen as out of scope for this work.

%\textcolor{red}{Add in information stating that the type of OS in use in the university environment range from Windows, Unix, BSD, as well as other odd operating systems used by the engineering department.}

0 comments on commit 614c1b9

Please sign in to comment.
You can’t perform that action at this time.