diff --git a/images/README.commands b/images/README.commands new file mode 100644 index 0000000..eed4fee --- /dev/null +++ b/images/README.commands @@ -0,0 +1,13 @@ +# The following are the most basic commands used to produce the gnuplots that can then be saved as PDFs +# Note: This is for line graphs NOT box ones +set size ratio 0.6 +set xlabel "X-Axis Value" +unset key +set ytics nomirror +set xtics nomirror +set logscale x 2 +set xtics ("1" 1, "32" 32, "1024" 1024, "32768" 32768, "1s" 1e6, "10s" 1e7, "100s" 1e8) +plot "-" with lines + +# For Bytes +set xtics ("1" 1, "32" 32, "1K" 1024, "8K" 8192, "16K" 16384, "32K" 32768, "256K" 262144, "512K" 524288, "1M" 1048576, "32M" 33554432, "512M" 536870912) diff --git a/images/smb_create_iats_cdf.pdf b/images/smb_create_iats_cdf.pdf index b3e1066..1ec0ddd 100644 Binary files a/images/smb_create_iats_cdf.pdf and b/images/smb_create_iats_cdf.pdf differ diff --git a/images/smb_create_iats_pdf.pdf b/images/smb_create_iats_pdf.pdf index b392ccd..d687909 100644 Binary files a/images/smb_create_iats_pdf.pdf and b/images/smb_create_iats_pdf.pdf differ diff --git a/images/smb_create_rts_cdf.pdf b/images/smb_create_rts_cdf.pdf index 5a448f5..461e8eb 100644 Binary files a/images/smb_create_rts_cdf.pdf and b/images/smb_create_rts_cdf.pdf differ diff --git a/images/smb_create_rts_pdf.pdf b/images/smb_create_rts_pdf.pdf index 85395c4..2e1db7b 100644 Binary files a/images/smb_create_rts_pdf.pdf and b/images/smb_create_rts_pdf.pdf differ diff --git a/images/smb_general_iats_cdf.pdf b/images/smb_general_iats_cdf.pdf index 913afea..bca2fa2 100644 Binary files a/images/smb_general_iats_cdf.pdf and b/images/smb_general_iats_cdf.pdf differ diff --git a/images/smb_general_iats_pdf.pdf b/images/smb_general_iats_pdf.pdf index 8f222cd..b4fa575 100644 Binary files a/images/smb_general_iats_pdf.pdf and b/images/smb_general_iats_pdf.pdf differ diff --git a/images/smb_general_rts_cdf.pdf b/images/smb_general_rts_cdf.pdf index 4131dcc..0f0e11d 100644 Binary files a/images/smb_general_rts_cdf.pdf and b/images/smb_general_rts_cdf.pdf differ diff --git a/images/smb_general_rts_pdf.pdf b/images/smb_general_rts_pdf.pdf index 6c193d1..b25eb70 100644 Binary files a/images/smb_general_rts_pdf.pdf and b/images/smb_general_rts_pdf.pdf differ diff --git a/images/smb_read_bytes_cdf.pdf b/images/smb_read_bytes_cdf.pdf index 9ea7c1d..43e717c 100644 Binary files a/images/smb_read_bytes_cdf.pdf and b/images/smb_read_bytes_cdf.pdf differ diff --git a/images/smb_read_bytes_pdf.pdf b/images/smb_read_bytes_pdf.pdf index 07c0382..fecdcec 100644 Binary files a/images/smb_read_bytes_pdf.pdf and b/images/smb_read_bytes_pdf.pdf differ diff --git a/images/smb_read_iats_cdf.pdf b/images/smb_read_iats_cdf.pdf index 865772f..5d6cc02 100644 Binary files a/images/smb_read_iats_cdf.pdf and b/images/smb_read_iats_cdf.pdf differ diff --git a/images/smb_read_iats_pdf.pdf b/images/smb_read_iats_pdf.pdf index 09fdf86..659514b 100644 Binary files a/images/smb_read_iats_pdf.pdf and b/images/smb_read_iats_pdf.pdf differ diff --git a/images/smb_read_rts_cdf.pdf b/images/smb_read_rts_cdf.pdf index b30bcee..d468f58 100644 Binary files a/images/smb_read_rts_cdf.pdf and b/images/smb_read_rts_cdf.pdf differ diff --git a/images/smb_read_rts_pdf.pdf b/images/smb_read_rts_pdf.pdf index f15535c..fb58d8a 100644 Binary files a/images/smb_read_rts_pdf.pdf and b/images/smb_read_rts_pdf.pdf differ diff --git a/images/smb_write_bytes_cdf.pdf b/images/smb_write_bytes_cdf.pdf index cb23d4b..fcea747 100644 Binary files a/images/smb_write_bytes_cdf.pdf and b/images/smb_write_bytes_cdf.pdf differ diff --git a/images/smb_write_bytes_pdf.pdf b/images/smb_write_bytes_pdf.pdf index 86cbd06..295a857 100644 Binary files a/images/smb_write_bytes_pdf.pdf and b/images/smb_write_bytes_pdf.pdf differ diff --git a/images/smb_write_iats_cdf.pdf b/images/smb_write_iats_cdf.pdf index 178cc5e..028f361 100644 Binary files a/images/smb_write_iats_cdf.pdf and b/images/smb_write_iats_cdf.pdf differ diff --git a/images/smb_write_iats_pdf.pdf b/images/smb_write_iats_pdf.pdf index 792bbd7..5796945 100644 Binary files a/images/smb_write_iats_pdf.pdf and b/images/smb_write_iats_pdf.pdf differ diff --git a/images/smb_write_rts_cdf.pdf b/images/smb_write_rts_cdf.pdf index e08a9f7..9fc038b 100644 Binary files a/images/smb_write_rts_cdf.pdf and b/images/smb_write_rts_cdf.pdf differ diff --git a/images/smb_write_rts_pdf.pdf b/images/smb_write_rts_pdf.pdf index a1be8b1..123a318 100644 Binary files a/images/smb_write_rts_pdf.pdf and b/images/smb_write_rts_pdf.pdf differ diff --git a/trackingPaper.dvi b/trackingPaper.dvi new file mode 100644 index 0000000..25e3eec Binary files /dev/null and b/trackingPaper.dvi differ diff --git a/trackingPaper.tex b/trackingPaper.tex index 67cc4fc..cb9d555 100644 --- a/trackingPaper.tex +++ b/trackingPaper.tex @@ -223,15 +223,15 @@ This paper & 2020 & SMB & x & Dynamic & \end{table*} \label{Previous Advances Due to Testing} Tracing collection and analysis has proved its worth in time from previous studies where one can see important lessons pulled from the research; change in behavior of read/write events, overhead concerns originating in system implementation, bottlenecks in communication, and other revelations found in the traces. \\ -Previous tracing work has shown that one of the largest \& broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}. To produce a benchmark with high fidelity one needs to understand not only the technology being used but how it is being implemented within the system~\cite{roselli2000comparison,traeger2008nine,ruemmler1992unix}. All of these aspects will lend to the behavior of the system; from timing \& resource elements to how the managing software governs actions~\cite{douceur1999large,malkani2003passive,seltzer2003nfs}. Furthermore, in pursuing this work one may find unexpected results and learn new things through examination~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}. \\ +Previous tracing work has shown that one of the largest and broadest hurdles to tackle is that traces (and benchmarks) must be tailored to the system being tested. There are always some generalizations taken into account but these generalizations can also be a major source of error~\cite{vogels1999file,malkani2003passive,seltzer2003nfs,anderson2004buttress,Orosz2013,dabir2007bottleneck,skopko2012loss,traeger2008nine,ruemmler1992unix}. To produce a benchmark with high fidelity one needs to understand not only the technology being used but how it is being implemented within the system~\cite{roselli2000comparison,traeger2008nine,ruemmler1992unix}. All of these aspects will lend to the behavior of the system; from timing and resource elements to how the managing software governs actions~\cite{douceur1999large,malkani2003passive,seltzer2003nfs}. Furthermore, in pursuing this work one may find unexpected results and learn new things through examination~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}. \\ These studies are required in order to evaluate the development of technologies and methodologies along with furthering knowledge of different system aspects and capabilities. As has been pointed out by past work, the design of systems is usually guided by an understanding of the file system workloads and user behavior~\cite{leung2008measurement}. It is for that reason that new studies are constantly performed by the science community, from large scale studies to individual protocol studies~\cite{leung2008measurement,vogels1999file,roselli2000comparison,seltzer2003nfs,anderson2004buttress}. Even within these studies, the information gleaned is only as meaningful as the considerations of how the data is handled. -The work done by Leung et. al.~\cite{leung2008measurement} found observations related to the infrequency of files to be shared by more than one client. Over 67\% of files were never open by more than one client. -Leung's \textit{et. al.} work led to a series of observations, from the fact that files are rarely re-opened to finding that read-write access patterns are more frequent ~\cite{leung2008measurement}. +The work done by Leung et al.~\cite{leung2008measurement} found observations related to the infrequency of files to be shared by more than one client. Over 67\% of files were never open by more than one client. +Leung's \textit{et al.} work led to a series of observations, from the fact that files are rarely re-opened to finding that read-write access patterns are more frequent ~\cite{leung2008measurement}. %If files were shared it was rarely concurrently and usually as read-only; where 5\% of files were opened by multiple clients concurrently and 90\% of the file sharing was read only. %Concerns of the accuracy achieved of the trace data was due to using standard system calls as well as errors in issuing I/Os leading to substantial I/O statistical errors. % Anderson Paper -The 2004 paper by Anderson et. al.~~\cite{anderson2004buttress} has the following observations. A source of decreased precision came from the Kernel overhead for providing timestamp resolution. This would introduce substantial errors in the observed system metrics due to the use inaccurate tools when benchmarking I/O systems. These errors in perceived I/O response times can range from +350\% to -15\%. +The 2004 paper by Anderson et al.~~\cite{anderson2004buttress} has the following observations. A source of decreased precision came from the Kernel overhead for providing timestamp resolution. This would introduce substantial errors in the observed system metrics due to the use inaccurate tools when benchmarking I/O systems. These errors in perceived I/O response times can range from +350\% to -15\%. %I/O benchmarking widespread practice in storage industry and serves as basis for purchasing decisions, performance tuning studies and marketing campaigns. Issues of inaccuracies in scheduling I/O can result in as much as a factor 3.5 difference in measured response time and factor of 26 in measured queue sizes. These inaccuracies pose too much of an issue to ignore. @@ -239,7 +239,7 @@ Orosz and Skopko examined the effect of the kernel on packet loss in their 2013 Narayan and Chandy examined the concerns of distributed I/O and the different models of parallel application I/O. %There are five major models of parallel application I/O. (1) Single output file shared by multiple nodes. (2) Large sequential reads by a single node at the beginning of computation and large sequential writes by a single node at the end of computation. (3) Checkpointing of states. (4) Metadata and read intensive (e.g. small data I/O and frequent directory lookups for reads). -Due to the striping of files across multiple nodes, this can cause any read or write to access all the nodes; which does not decrease the inter-arrival times (IATs) seen. As the number of I/O operations increase and the number of nodes increase, the IAT times decreased. +Due to the striping of files across multiple nodes, this can cause any read or write to access all the nodes; which does not decrease the inter-arrival times (IATs) seen. As the number of I/O operations increases and the number of nodes increases, the IAT times decreased. Observations from Skopko in a 2012 paper~\cite{skopko2012loss} examined the nuance concerns of software based capture solutions. The main observation was software solutions relied heavily on OS packet processing mechanisms. Further more, depending on the mode of operation (e.g. interrupt or polling), the timestamping of packets would change. As seen in previous trace work~\cite{leung2008measurement,roselli2000comparison,seltzer2003nfs}, the general perceptions of how computer systems are being used versus their initial purpose have allowed for great strides in eliminating actual bottlenecks rather than spending unnecessary time working on imagined bottlenecks. Without illumination of these underlying actions (e.g. read-write ratios, file death rates, file access rates) these issues can not be readily tackled. @@ -265,6 +265,10 @@ Some nuances of SMB protocol I/O to note are that SMB/SMB2 write requests are th %\end{itemize} % Make sure to detail here how exactly IAT/RT are each calculated +\textcolor{red}{Add writing about the type of packets used by SMB. Include information about the response time of R/W/C/General (to introduce them formally; not sure what this means.... Also can bring up the relation between close and other requests.} + +\textcolor{blue}{It is worth noting that for the SMB2 protocol, the close request packet is used by clients to close instances of file that was openned with a previous create request packet.} + \begin{figure} \includegraphics[width=0.5\textwidth]{./images/smbPacket.jpg} \caption{Visualization of SMB Packet} @@ -275,10 +279,10 @@ Some nuances of SMB protocol I/O to note are that SMB/SMB2 write requests are th \label{Issues with Tracing} There are three general approaches to creating a benchmark based on a trade-off between experimental complexity and resemblance to the original application. (1) Connect the system to a production test environment, run the application, and measure the application metrics. (2) Collect traces from running the application and replay them (after possible modification) back on the test I/O system. (3) Generate a synthetic workload and measure the system performance. -The majority of benchmarks attempt to represent a known system and structure on which some ``original'' design/system was tested. While this is all well and good, there are many issues with this sort of approach; temporal \& spatial scaling concerns, timestamping and buffer copying, as well as driver operation for capturing packets~\cite{Orosz2013,dabir2007bottleneck,skopko2012loss}. Each of these aspects contribute to the initial problems with dissection and analysis of the captured information. For example, inaccuracies in scheduling I/Os may result in as much as a factor of 3.5 differences in measured response time and factor of 26 in measured queue sizes; differences that are too large to ignore~\cite{anderson2004buttress}. +The majority of benchmarks attempt to represent a known system and structure on which some ``original'' design/system was tested. While this is all well and good, there are many issues with this sort of approach; temporal and spatial scaling concerns, timestamping and buffer copying, as well as driver operation for capturing packets~\cite{Orosz2013,dabir2007bottleneck,skopko2012loss}. Each of these aspects contribute to the initial problems with dissection and analysis of the captured information. For example, inaccuracies in scheduling I/Os may result in as much as a factor of 3.5 differences in measured response time and factor of 26 in measured queue sizes; differences that are too large to ignore~\cite{anderson2004buttress}. Dealing with timing accuracy and high throughput involves three challenges. (1) Designing for dealing with peak performance requirements. (2) Coping with OS timing inaccuracies. (3) Working around unpredictable OS behavior; e.g. mechanisms to keep time and issue I/Os or performance effects due to interrupts. -Temporal scaling refers to the need to account for the nuances of timing with respect to the run time of commands; consisting of computation, communication \& service. A temporally scalable benchmarking system would take these subtleties into account when expanding its operation across multiple machines in a network. While these temporal issues have been tackled for a single processor (and even somewhat for cases of multi-processor), these same timing issues are not properly handled when dealing with inter-network communication. Inaccuracies in packet timestamping can be caused due to overhead in generic kernel-time based solutions, as well as use of the kernel data structures ~\cite{PFRINGMan,Orosz2013}. +Temporal scaling refers to the need to account for the nuances of timing with respect to the run time of commands; consisting of computation, communication and service. A temporally scalable benchmarking system would take these subtleties into account when expanding its operation across multiple machines in a network. While these temporal issues have been tackled for a single processor (and even somewhat for cases of multi-processor), these same timing issues are not properly handled when dealing with inter-network communication. Inaccuracies in packet timestamping can be caused due to overhead in generic kernel-time based solutions, as well as use of the kernel data structures ~\cite{PFRINGMan,Orosz2013}. Spatial scaling refers to the need to account for the nuances of expanding a benchmark to incorporate a number of machines over a network. A system that properly incorporates spatial scaling is one that would be able to incorporate communication (even in varying intensities) between all the machines on a system, thus stress testing all communicative actions and aspects (e.g. resource locks, queueing) on the network. @@ -298,7 +302,7 @@ We collected traces from the University of Connecticut University Information Te %Some of these blade servers have local storage but the majority do not have any. The blade servers serve as SMB heads, but the actual storage is served by SAN storage nodes that sit behind them. This system does not currently implement load balancing. Instead, the servers are set up to spread the traffic load with a static distribution among four of the active cluster nodes while the fifth node is passive and purposed to take over in the case that any of the other nodes go down (e.g. become inoperable or crash). -The actual tracing was performed with a tracing server connected to a switch outfitted with a packet duplicating element as shown in the topology diagram in Figure~\ref{fig:captureTopology}. A 10~Gbps network tap was installed in the file server switch, allowing our storage server to obtain a copy of all network traffic going to the 5 file servers. The reason for using 10~Gbps hardware is to help ensure that the system is able to capture and all information on the network at peak theoretical throughput. +The actual tracing was performed with a tracing server connected to a switch outfitted with a packet duplicating element as shown in the topology diagram in Figure~\ref{fig:captureTopology}. A 10~Gbps network tap was installed in the file server switch, allowing our storage server to obtain a copy of all network traffic going to the 5 file servers. The reason for using 10~Gbps hardware is to help ensure that the system is able to capture information on the network at peak theoretical throughput. \subsection{High-speed Packet Capture} \label{Capture} @@ -321,7 +325,7 @@ The filesize used was in a ring buffer where each file captured was 64000 kB. The \texttt{.pcap} files from \texttt{tshark} do not lend themselves to easy data analysis, so we translate these files into the DataSeries~\cite{DataSeries} format. HP developed DataSeries, an XML-based structured data format, that was designed to be self-descriptive, storage and access efficient, and highly flexible. The system for taking captured \texttt{.pcap} files and writing them into the DataSeries format (i.e. \texttt{.ds}) does so by first creating a structure (based on a pre-written determination of the data desired to capture). Once the code builds this structure, it then reads through the capture traffic packets while dissecting and filling in the prepared structure with the desired information and format. -Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure security of the university's network communications. It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. +Due to the fundamental nature of this work, there is no need to track every piece of information that is exchanged, only that information which illuminates the behavior of the clients and servers that interact over the network (i.e. I/O transactions). It should also be noted that all sensitive information being captured by the tracing system is hashed to protect the users whose information is examined by the tracing system. Furthermore, the DataSeries file retains only the first 512 bytes of the SMB packet - enough to capture the SMB header information that contains the I/O information we seek, while the body of the SMB traffic is not retained in order to better ensure security of the university's network communications. It is worth noting that in the case of larger SMB headers, some information is lost, however this is a trade-off by the university to provide, on average, the correct sized SMB header but does lead to scenarios where some information may be captured incompletely. \textcolor{blue}{This scenario only occurs in the cases of large AndX Chains in the SMB protocol, since the SMB header for SMB 2 is fixed at 72 bytes. In those scenarios the AndX messages specify only a sinlge SMB header with the rest of the AndX Chain attached in a series of block pairs.} \subsection{DataSeries Analysis} @@ -442,6 +446,8 @@ Each SMB Read and Write command is associated with a data request size that indi Figures~\ref{fig:PDF-Bytes-Read} and~\ref{fig:PDF-Bytes-Write} show the probability density function (PDF) of the different sizes of bytes transferred for read and write I/O operations respectively. The most noticeable aspect of these graphs are that the majority of bytes transferred for read and write operations is around 64 bytes. It is worth noting that write I/O also have a larger number of very small transfer amounts. This is unexpected in terms of the amount of data passed in a frame. Our belief is that this is due to a large number of long term calculations/scripts being run that only require small but frequent updates. This assumption was later validated in part when examining the files transferred, as some were related to running scripts creating a large volume of files, however the more affirming finding was the behavior observed with common applications. For example, it was seen that Microsoft Word would perform a large number of small reads at ever growing offsets. This was interpreted as when a user is viewing a document over the network and Word would load the next few lines of text as the user scrolled down the document; causing ``loading times'' amid use. A large degree of small writes were observed to be related to application cookies or other such smaller data communications. %This could also be attributed to simple reads relating to metadata\textcolor{red}{???} +\textcolor{blue}{Reviewing of the SMB and SMB2 leads to some confusion in understanding this behavior. According to the specification the default ``MaxBuffSize'' for reads and writes should be between 4,356 bytes and 16,644 bytes depending on the use of either a client version of server version of Windows; respectively. In the SMB2 protocol specification, specific version of Windows (e.g. Vista SP1, Server 2008, 7, Server 2008 R2, 8, Server 2012, 8.1, Server 2012 R2) disconnect if the ``MaxReadSize''/``MaxWriteSize'' value is less than 4096. However, further examination of the specification states that for SMB2 the read length and write length can be zero. Thus, this seems to conflict that the size has to be greater than 4096 but allows for it to also be zero. It is due to this protocol specification of allowing zero that supports the smaller read/write sizes seen in the captured traffic. The author's assumption here is that the university's configuration allows for smaller traffic to be exchanged without the disconnection for sizes smaller than 4096.} + %\begin{figure} % \includegraphics[width=0.5\textwidth]{./images/aggAvgBytes.pdf} % \caption{Average Bytes by I/O} @@ -554,7 +560,7 @@ files when files are modified. Furthermore, read operations account for the lar %~!~ Addition since Chandy writing ~!~% Most previous tracing work has not reported I/O response times or command latency which is generally proportional to data request size, but under load, the response times give an indication of server load. In Table~\ref{tbl:PercentageTraceSummary} we show a summary of the response times for read, write, create, and general commands. We note that most general (metadata) operations occur fairly frequently, run relatively slowly, and happen at high frequency. -Other observations of the data show that the number of writes is very close to the number of reads, although the write response time for their operations is very small - most likely because the storage server caches the write without actually committing to disk. Reads on the other hand are in most cases probably not going to hit in the cache and require an actual read from the storage media. Although read operations are only a few percentage of the total operations they have a the greatest average response time; more than general I/O. As noted above, creates happen more frequently, but have a slightly slower response time, because of the extra metadata operations required for a create as opposed to a simple write. +Other observations of the data show that the number of writes is very close to the number of reads, although the write response time for their operations is very small - most likely because the storage server caches the write without actually committing to disk. Reads on the other hand are in most cases probably not going to hit in the cache and require an actual read from the storage media. Although read operations are only a few percentage of the total operations they have the greatest average response time; more than general I/O. As noted above, creates happen more frequently, but have a slightly slower response time, because of the extra metadata operations required for a create as opposed to a simple write. % Note: RT + IAT time CDFs exist in data output @@ -870,7 +876,7 @@ txt & 167827 & 0.08 \\ \hline For simulations and analytic modeling, it is often useful to have models that describe storage systems I/O behavior. In this section, we attempt to map traditional probabilistic distributions to the data that we have observed. Specifically, taking the developed CDF graphs, we perform curve fitting to determine the applicability of Gaussian and Weibull distributions to the the network filesystem I/O behavior. Note that an exponential distribution, typically used to model interarrival times and response times, is a special case of a Weibull distribution where $k=1$. -Table~\ref{tbl:curveFitting} shows best-fit parametrized distributions for the measured data. The error bounds give an indication of how well the model fits the CDF. % along with $R^2$ fitness values. +Table~\ref{tbl:curveFitting} shows best-fit parametrized distributions for the measured data. % along with $R^2$ fitness values. %Based on the collected IAT and RT data, the following are the best fit curve representation equations with supporting $R^{2}$ values. In the case of each, it was found that the equation used to model the I/O behavior was a Gaussian equation with a single term. %\begin{equation} f(x) = a_1 * e^{-((x-b_1)/c_1)^2)} \end{equation} @@ -893,8 +899,8 @@ Model & \multicolumn{3}{|c|}{Gaussian} CDF & \multicolumn{3}{|c|}{$\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\frac{x-\mu}{\sigma}}e^{\frac{-t^2}{2}}dt$} & \multicolumn{3}{|c|}{$1 - e^{(-x/\lambda)^k}$} \\ \hline \hline I/O Operation & $\mu$ & \multicolumn{2}{|c|}{$\sigma$} & $k$ & \multicolumn{2}{|c|}{$\lambda$} \\ \hline -General IAT & 786.72$\pm$2.79 & \multicolumn{2}{|c|}{10329.6$\pm$2} & 0.9031$\pm$0.0002 & \multicolumn{2}{|c|}{743.2075$\pm$0.2341} \\ General RT & 3606.66$\pm$742.44 & \multicolumn{2}{|c|}{2.74931e+06$\pm$530} & 0.5652$\pm$0.0001 & \multicolumn{2}{|c|}{980.9721$\pm$0.4975} \\ +General IAT & 786.72$\pm$2.79 & \multicolumn{2}{|c|}{10329.6$\pm$2} & 0.9031$\pm$0.0002 & \multicolumn{2}{|c|}{743.2075$\pm$0.2341} \\ Read RT & 44718.5$\pm$11715 & \multicolumn{2}{|c|}{1.72776e+07$\pm$8300} & 0.0004$\pm$0.0 & \multicolumn{2}{|c|}{1.5517$\pm$0.0028} \\ Read IAT & 24146$\pm$8062 & \multicolumn{2}{|c|}{1.189e+07$\pm$5700} & 0.0005$\pm$0.0 & \multicolumn{2}{|c|}{3.8134$\pm$0.0057} \\ Write RT & 379.823$\pm$2.809 & \multicolumn{2}{|c|}{4021.72$\pm$1.99} & 0.8569$\pm$0.0004 & \multicolumn{2}{|c|}{325.2856$\pm$0.2804} \\ @@ -934,9 +940,9 @@ $\mu$, $\sigma$, $k$, and $\lambda$ Values for Curve Fitting Equations on CDF Gr %Examination of the Response Time (RT) and Inter Arrival Times (IAT) revealed the speed and frequency with which metadata operations are performed, as well as the infrequency of individual users and sessions to interact with a given share. %% NEED: Run the matlab curve fitting to complete this section of the writing -The curve-fitting data shows that the use of an exponential distribution to model network interarrival and response times is still valid. One should notice that the Gaussian distributions +Our comparison of the existing standard use of a exponential distribution to model network interarrival and response times is still valid. One should notice that the Gaussian distributions % had better $R^2$ result than the exponential equivalent for write operations. This is not surprising due to the step-function shape of the Figure~\ref{fig:CDF-RT-Write} CDF. Examining the $R^2$ results for the read + write I/O operations we find that the exponential distribution is far more accurate at modeling this combined behavior. -for write and create operations are similar, while those for read operations are not. Furthermore there is less similarity between the modeled behavior of general operation inter arrival times and their response times, showing the need for a more refined model for each aspect of the network filesystem interactions. +for write and create operations are similar, while those for read operations are not. Further more there is less similarity between the modeled behavior of general operation inter arrival times and their response times, showing the need for a more refined model for each aspect of the network filesystem interactions. One should also notice that the general operation model is more closely similar to that of the creates. This makes sense since the influence of create operations are found to dominate the I/O behavior of the network filesystem, which aligns well with the number of existing close operations. %improves the ability of a exponential distribution to model the combined behavior.} @@ -969,7 +975,7 @@ An other concern was whether or not the system would be able to function optimal %About Challenges of system While the limitations of the system were concerns, there were other challenges that were tackled in the development of this research. -One glaring challenge with building this tracing system was using code written by others; tshark \& DataSeries. While these programs are used within the tracing structure there are some issues when working with them. These issues ranged from data type limitations of the code to hash value and checksum miscalculations due to encryption of specific fields/data. Attempt was made to dig and correct these issues, but they were so inherent to the code being worked with that hacks and workarounds were developed to minimize their effect. Other challenges centralize around selection, interpretations and distribution scope of the data collected. Which fields should be filtered out from the original packet capture? What data is most prophetic to the form and function of the network being traced? What should be the scope, with respect to time, of the data being examined? Where will the most interesting information appear? As each obstacle was tackled, new information and ways of examining the data reveal themselves and with each development different alterations \& corrections are made. +One glaring challenge with building this tracing system was using code written by others; tshark and DataSeries. While these programs are used within the tracing structure there are some issues when working with them. These issues ranged from data type limitations of the code to hash value and checksum miscalculations due to encryption of specific fields/data. Attempt was made to dig and correct these issues, but they were so inherent to the code being worked with that hacks and workarounds were developed to minimize their effect. Other challenges centralize around selection, interpretations and distribution scope of the data collected. Which fields should be filtered out from the original packet capture? What data is most prophetic to the form and function of the network being traced? What should be the scope, with respect to time, of the data being examined? Where will the most interesting information appear? As each obstacle was tackled, new information and ways of examining the data reveal themselves and with each development different alterations and corrections are made. Even when all the information is collected and the most important data has been selected, there is still the issue of what lens should be used to view this information. Because the data being collected is from an active network, there will be differing activity depending on the time of day, week, and scholastic year. For example, although the first week or so of the year may contain a lot of traffic, this does not mean that trends of that period of time will occur for every week of the year (except perhaps the final week of the semester). The trends and habits of the network will change based on the time of year, time of day, and even depend on the exam schedule. Truly interesting examination of data requires looking at all different periods of time to see how all these factors play into the communications of the network. % DataSeries Challenge