minor edits

paw10003 · May 22, 2020 · 2d23c69 · 2d23c69
1 parent daedb0e
commit 2d23c69
Showing 1 changed file with 8 additions and 8 deletions.
diff --git a/trackingPaper.tex b/trackingPaper.tex
@@ -136,7 +136,7 @@
 We further investigate if the recent standard models for traffic remain accurate.  
 Our findings reveal interesting data relating to the number of read and write events.  We notice that the number of read and write events is significantly less than creates and the average number of bytes exchanged per I/O is much smaller than what has been seen in previous studies. 
 %the average of bytes transferred over the wire is much smaller than what has been seen in previous studies.  
-Furthermore we find an increase in the use of metadata for overall network communication that can be taken advantage of through the use of smart storage devices.
+Furthermore, we find an increase in the use of metadata for overall network communication that can be taken advantage of through the use of smart storage devices.
 \keywords{Server Message Block, Storage System Tracing,
 Network Benchmark, Storage Systems, Distributed I/O.}
 \end{abstract}
@@ -367,13 +367,13 @@ \subsection{High-speed Packet Capture}
 %  This causes tshark to switch to the next file after it reaches a determined size.
 %To simplify this aspect of the capturing process, the entirety of the capturing, dissection, and permanent storage was all automated through watch-dog scripts.
 
-The \texttt{.pcap} files from \texttt{tshark} do not lend themselves to easy data analysis, so we translate these files into the DataSeries~\cite{DataSeries} format, an XML-based structured data format designed to be self-descriptive, storage and access efficient, and highly flexible.
+The \texttt{.pcap} files from \texttt{tshark} do not lend themselves to easy data analysis, so we translate these files into \texttt{.ds} files using the DataSeries~\cite{DataSeries} format, an XML-based structured data format designed to be self-descriptive, storage and access efficient, and highly flexible.
 %The system for taking captured \texttt{.pcap} files and writing them into the DataSeries format (i.e. \texttt{.ds}) does so by first creating a structure (based on a pre-written determination of the data desired to capture).  Once the code builds this structure, it then reads through the capture traffic packets while dissecting and filling in the prepared structure with the desired information and format.  
-Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions).  It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system.  Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information  that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure the privacy of the university's network communications.  The reasoning for this limit was to allow for capture of longer SMB AndX message chains due to negotiated \textit{MaxBufferSize}.  It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely.  This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes.  In those scenarios the AndX messages specify only a single SMB header with the rest of the AndX Chain attached in a series of block pairs.
+For our purposes, there is no need to track all data that is exchanged, only information that illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions).  It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the privacy of the users of the storage system.  Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information  that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure privacy.  The reasoning for this limit was to allow for capture of longer SMB AndX message chains due to negotiated \textit{MaxBufferSize}.  It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely.  This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes.  In those scenarios the AndX messages specify only a single SMB header with the rest of the AndX Chain attached in a series of block pairs.
 
 \subsection{DataSeries Analysis}
 
-Building upon existing code for the interpretation and dissection of the captured \texttt{.ds} files, we developed C/C++ code for examining the captured traffic information.  From this analysis, we are able to capture read, write, create and general I/O information at both a global scale and individual tracking ID (UID/TID) level.  In addition, read and write buffer size information is tracked, as well as the inter-arrival and response times.  Also included in this data is oplock information and IP addresses.  The main contribution of this step is to aggregate seen information for later interpretation of the results.  
+Building upon existing code for the interpretation and dissection of the captured \texttt{.ds} files, we developed C/C++ code to examine the captured traffic information.  From this analysis, we are able to capture read, write, create and general I/O information at both a global scale and individual tracking ID (UID/TID) level.  In addition, read and write buffer size information is tracked, as well as the inter-arrival and response times.  Also included in this data is oplock information and IP addresses.  The main contribution of this step is to aggregate observed data for later interpretation of the results.  
 This step also creates an easily digestible output that can be used to re-create all tuple information for SMB/SMB2 sessions that are witnessed over the entire time period.
 Sessions are any communication where a valid UID and TID is used.
 
@@ -418,7 +418,7 @@ \section{Data Analysis}
 % NOTE: Not sure but this reference keeps referencing the WRONG table
 
 Table~\ref{tbl:TraceSummaryTotal}
-show a summary of the SMB traffic captured, statistics of the I/O operations, and read/write data exchange observed for the network filesystem.  This information is further detailed in Table~\ref{tbl:SMBCommands}, which illustrates that the majority of I/O operations are general (74.87\%).  As shown in %the bottom part of 
+shows a summary of the SMB traffic captured, statistics of the I/O operations, and read/write data exchange observed for the network filesystem.  This information is further detailed in Table~\ref{tbl:SMBCommands}, which illustrates that the majority of I/O operations are general (74.87\%).  As shown in %the bottom part of 
 Table~\ref{tbl:SMBCommands2}, general I/O includes metadata commands such as connect, close, query info, etc.
 
 Our examination of the collected network filesystem data revealed interesting patterns for the current use of CIFS/SMB in a large academic setting.  The first is that there is a major shift away from read and write operations towards more metadata-based ones.  This matches the last CIFS observations made by Leung et.~al.~that files were being generated and accessed infrequently.  The change in operations are due to a movement of use activity from reading and writing data to simply checking file and directory metadata.  However, since the earlier study, SMB has transitioned to the SMB2 protocol which was supposed to be less "chatty".  As a result, we would expect fewer general SMB operations.  Table~\ref{tbl:SMBCommands} shows a breakdown of SMB and SMB2 usage over the time period of May.  From this table, one can see that the SMB2 protocol makes up $99.14$\% of total network operations compared to just $0.86$\% for SMB, indicating that most clients have upgraded to SMB2.  However, $74.66$\% of SMB2 I/O are still general operations.  Contrary to the purpose of implementing the SMB2 protocol, there is still a large amount of general I/O.
@@ -623,7 +623,7 @@ \subsection{I/O Data Request Sizes}
 \subsection{I/O Response Times}
 
 %~!~ Addition since Chandy writing ~!~%
-Most previous tracing work has not reported I/O response times or command latency, which is generally proportional to data request size, but under load, the response times give an indication of server load.  In
+Most previous tracing work have not reported I/O response times or command latency, which is generally proportional to data request size, but under load, the response times give an indication of server load.  In
 Table~\ref{tbl:PercentageTraceSummary} we show a summary of the response times for read, write, create, and general commands.  We note that most general (metadata) operations  occur fairly frequently, run relatively slowly, and happen at high frequency.  
 We also observe that the number of writes is very close to the number of reads. The write response time for their operations is very small - most likely because the storage server caches the write without actually committing to disk.  Reads, on the other hand, are in most cases probably not going to hit in the cache and require an actual read from the storage media.  Although read operations are only a small percentage of all operations, they have the highest average response time.  As noted above, creates happen more frequently, but have a slightly slower response time, because of the extra metadata operations required for a create as opposed to a simple write.  
 
@@ -786,7 +786,7 @@ \subsection{I/O Response Times}
 %%	\end{itemize}
 %%\end{enumerate}
 %
-Figures~\ref{fig:CDF-IAT-SMB} and~\ref{fig:PDF-IAT-SMB} shows the inter arrival times CDFs and PDFs.  As can be seen, SMB commands happen very frequently - $85$\% of commands are issued less than 1000~$\mu s$ apart.  As mentioned above, SMB is known to be very chatty, and it is clear that servers must spend a lot of time dealing with these commands.  For the most part, most of these commands are also serviced fairly quickly as 
+Figures~\ref{fig:CDF-IAT-SMB} and~\ref{fig:PDF-IAT-SMB} shows the inter arrival times CDFs and PDFs.  As can be seen, SMB commands happen very frequently - $85$\% of commands are issued less than 1000~$\mu s$ apart.  As mentioned above, SMB is known to be very chatty, and it is clear that servers must spend a significant amount of time dealing with these commands.  For the most part, most of these commands are also serviced fairly quickly as 
 seen in Figures~\ref{fig:CDF-RT-SMB} and~\ref{fig:PDF-RT-SMB}.  Interestingly, the response time for the general metadata operations follows a similar curve to the inter-arrival times.
 
 %Next we examine the response time (RT) of the read, write, and create I/O operations that occur over the SMB network filesystem.  
@@ -1107,7 +1107,7 @@ \subsection{System Limitations and Challenges}
 %One glaring challenge with building this tracing system was using code written by others; tshark and DataSeries.  While these programs are used within the tracing structure there are some issues when working with them.  These issues ranged from data type limitations of the code to hash value and checksum miscalculations due to encryption of specific fields/data.  Attempt was made to dig and correct these issues, but they were so inherent to the code being worked with that hacks and workarounds were developed to minimize their effect.  Other challenges centralize around selection, interpretations and distribution scope of the data collected.  Which fields should be filtered out from the original packet capture?  What data is most prophetic to the form and function of the network being traced?  What should be the scope, with respect to time, of the data being examined?  Where will the most interesting information appear?  As each obstacle was tackled, new information and ways of examining the data reveal themselves and with each development different alterations and corrections are made.
 
 %Even when all the information is collected and the most important data has been selected, there is still the issue of what lens should be used to view this information.  
-Because the data is being collected from an active network, there will be differing activity depending on the time of day, week, and scholastic year.  For example,  although the first week or so of the year may contain a lot of traffic, this does not mean that trends of that period of time will occur for every week of the year (except perhaps the final week of the semester).  The trends and habits of the network will change based on the time of year, time of day, and even depend on the exam schedule.  A comprehensive examination requires looking at all different periods of time to see how all these factors play into the storage system utilization.
+Because the data is being collected from an active network, there will be differing activity depending on the time, the day, the week, and the academic calendar.  For example,  although the first week or so of the academic year may contain a large amount of traffic, this does not mean that trends of that period of time will occur for every week of the year (except perhaps the final week of the semester).  The trends and habits of the network will change based on the time of year, time of day, and even depend on the exam schedule.  A comprehensive examination requires looking at all different periods of time to see how all these factors play into the storage system utilization.
 % DataSeries Challenge
 %A complication of this process is that the DataSeries code makes use of a push-pop stack for iterating through packet information.  This means that if information can not be re-read then errors occur.  This can manifest in the scenario where a produced \texttt{.ds} file is corrupted or incomplete, despite the fact that the original \texttt{.pcap} being fine.  
 %This manifested as an approximate loss of \textbf{????} out of every 100,000 files.