diff --git a/TracingPaper.tex b/TracingPaper.tex index 2ab4ca3..1038094 100644 --- a/TracingPaper.tex +++ b/TracingPaper.tex @@ -144,8 +144,7 @@ The trace analysis in performed by an analysis module code that both processes t \subsection{ID Tracking} \label{ID Tracking} -\textit{\textbf{Note:} It should be noted that this system is currently not implemented due to the poorly written way in which it was implemented. The new concept for this ID tracking is to combine the MID/PID/TID/UID tuple tracking along with FID tracking to know what files are opened, by whom (i.e. tuple identification), and tracking of file sizes for files that are created with an initial size of zero. The purpose for this tracking will be to track the habits of individual users. While initially simplistic (drawing a connection between FIDs and tuple IDs) this aspect of the research will be developed in future work to be move inclusive.} \\ -All comands sent over the network are coupled to an identifying MID/PID/TID/UID tuple. Since the only commands being examined are read or write commands, the identifying characteristic distinguishing a request command packet from a reponse command packet is the addition of an FID field with the sent packet. It is examination of the packets for this FID field that allows the analysis code to distinguish between request \& response command pakets. The pairing is done by examining the identifying tuple and assuming that each tuple-identified system will only send one command at a time (awaiting a response before sending the next command of that same type). +All comands sent over the network are coupled to an identifying MID/PID/TID/UID tuple. This identifying tuple is coupled with FID information for the purpose of tracking the different files that are exchanged/written over the network. The purpose for this tracking is for profiling the habits of individual users. While initially simplistic (drawing a connection between FIDs and tuple IDs), this aspect of the research is easily expandable to other facets of communication. \\One could also track the amount of sharing that is occurring between users. The PID is the process identifier and the MID is the multiplex identifier, which is set by the client and is to be used for identifying groups of commands belonging to the same logical thread of operation on the client node. \\The per client process ID can be used to map the activity of given programs, thus allowing for finer granularity in the produced benchmark (e.g. control down to process types ran by individual client levels). Other features of interest are the time between an open \& close, or how many opens/closes occurred in a window (e.g. a period of time). This information could be used as a gauge of current day trends in filesystem usage \& its consequent taxation on the surrounding network. It would also allow for greater insight on the r/w habits of users on a network along with a rough comparison between other registered events that occur on the network. Lastly, though no less important, it would allow us to look at how many occurrences there are of shared files between different users. One must note that the type of sharing may differ and there can be an issue of resource locking (e.g. shared files) that needs to be taken into account. This is preliminarily addressed by monitoring any oplock flags that are sent for read \& writes. This information also helps provide a preliminary mapping of how the network is used and what sort of traffic populates the communication. @@ -170,9 +169,7 @@ While the limitations of the system were concerns, there were other challenges t One glaring challenge with building this tracing system was using code written by others; tshark \& DataSeries. While these programs are used within the tracing structure there are some issues when working with them. These issues were unfortunately so inherent to the code that hacks and workarounds were developed to minimize their effect. Other challenges centralize around selection, interpretations and distribution scope of the data collected. A researcher must select which information is important or which information will give the greatest insight to the workings on the network. Too little information can lead to incorrect conclusions being drawn about the workings on the system, while too much information (and not knowing which information is pertinent) can lead to erroneous conclusions as well. There is a need to strike a balance between what information is important enough to capture (so as not to slow down the capturing process through needless processing) while still obtaining enough information to acquire the bigger picture of what is going on. Every step of the tracing process requires a degree of human input to decide what network information will end up providing the most complete picture of the network communication and how to interpret that data into meaningful graphs and tables. This can lead to either finds around the focus of the work being done, or even lead to discoveries of other phenomena that end up having far more impact on the overall performance of the system~\cite{Ellard2003}. %About interpretation of data -To some degree these interpretations are easy to make (e.g. file system behavior \& user behavior~\cite{Leung2008}) while others are more complicated (e.g. temporal scaling of occurances of read/write), but in all scenarios there is still the requirment for human interpretation of the data. While having humans do the interpretations can be adventageous, a lack of all the "background" information can also lead to incorrectly interpreting the information. - -Because the data being collected is from an active network, there will be differing activity depending on the time of day, week, and scholastic year. For example, although the first week or so of the year may contain a lot of traffic, this does not mean that trends of that period of time will occur for every week of the year (except perhaps the final week of the semester). The trends and habits of the network will change based on the time of year, time of day, and even depend on the exam schedule. Truly interesting examination of data requires looking at all different periods of time to see how all these factors play into the communications of the network. +To some degree these interpretations are easy to make (e.g. file system behavior \& user behavior~\cite{Leung2008}) while others are more complicated (e.g. temporal scaling of occurances of read/write), but in all scenarios there is still the requirment for human interpretation of the data. While having humans do the interpretations can be adventageous, a lack of all the "background" information can also lead to incorrectly interpreting the information. The trends and habits of the network will change based on the time of year, time of day, and even depend on the exam schedule. Truly interesting examination of data requires looking at all different periods of time to see how all these factors play into the communications of the network. \section{Trace Analysis} \label{Trace Analysis} @@ -182,13 +179,16 @@ The trace analysis is performed by an AnalysisModule code that both processes th \label{System Information and Predictions} It is important to detail out any benchmakring system so that when the results of one's research are being examined, they can be properly understood with the correct background information and understanding that lead the originating author to those results~\cite{Traeger2008}. The following is an explination the UITS system from which Trace1 pulls it's packet traffic information along with predicitions of how the data will look along with the reasoning behind the shape of the information. -The UITS system consisnts of five Microsoft file server cluster nodes. These blade servers are used to host home directories for all UConn users within a list of 88 departments. These home directories are used to provide personal drive share space to facultiy, staff and students, along with at lest one small group of users. Each server is capable of handling 1Gb/s of traffic in each direction (e.g. outbound and inbound traffic). All together the five blade server system can in theory handle 10Gb/s of recieving and tranmitting data. Some of these blade servers have local storage but the majority do not have any. To the understanding of this paper, the blade servers are purposed purely for dealing with incoming traffic to the SAN storage nodes that sit behind them. This system does not currently implement load balancing, instead the servers are set up to spread the traffic load among four of the active cluster nodes while the fifth node is passive and purposed to take over in the case that any of the other nodes go down (e.g. become inoperable or crash). \\ +The UITS system consisnts of five Microsoft file server cluster nodes. These blade servers are used to host home directories for all UConn users within a list of 88 departments. These users include facultiy, staff and students, along with at lest one small group of users. Each server is capable of handling 1Gb/s of traffic in each direction (e.g. outbound and inbound traffic). All together the five blade server system can in theory handle 10Gb/s of recieving and tranmitting data, but only four of the servers should be active at any given time. Some of these blade servers have local storage but the majority do not have any. To the understanding of this paper, the blade servers are purposed purely for dealing with incoming traffic to the SAN storage nodes that sit behind them. This system does not currently implement load balancing, instead the servers are set up to spread the traffic load among four of the active cluster nodes while the fifth node is passive and purposed to take over in the case that any of the other nodes go down (e.g. become inoperable or crash). \\ From the information at hand I theorized the following behaviors and attributes that would be seen in the system. First are the predictions based on what was learned from talking to people within UITS, after that are my general predictions. -From conversations with UITS, the understanding of the file server system behavior is that there are spikes of traffic that tend to happen during the night time. Our assumption is that the majority of this traffic will occur between 2am and 6am because this is when backups occur to the SAN system. It is important to note that, however, it is not expected that we would see any of this traffic as the traffic would be encrypted; therefore the protocol used is not the SMB/CIFS protocol that is being analyzed by this paper. +From conversations with UITS, the understanding of the file server system behavior is that there are spikes of traffic that tend to happen during the night time. Our assumption is that the majority of this traffic will occur between 2am and 6am because this is when backups occur to the SAN system, while additional back-up traffic for application servers will occur between 12am and 4am; whose starting time can vary. These application server back-ups are performed incrementally, starting with a random 50 servers, and are performed with device-specific configurations. +It is important to note that, however, it is not expected that we would see any of this traffic as the traffic would be encrypted; therefore the protocol used is not the SMB/CIFS protocol that is being analyzed by this paper. However, while the SAN back-up may be encrypted, other application server back-ups may still appear in communications. Furthermore, any traffic that does occur during the duration of "day time hours" (i.e. 9am to 5pm) would be solely due to the actions taken by the users of this system (e.g faculty, staff, students). If there is an automatic backup manager we would expect to see traffic caused by it pulling cached information from the machine of users across the network. \\ -One general assumption is that these blade servers should not ever fail, thus the greatest transfer rate observed should be 8Gb/s (i.e. 1Gb/s per server). If there is a greater rate than that this means that the fifth server is aiding rather than acting as backup should another server fail. +One general assumption is that these blade servers should not ever fail, thus the greatest transfer rate observed should be 8Gb/s (i.e. 1Gb/s per server). If there is a greater rate than that this means that the fifth server is aiding rather than acting as backup should another server fail. \\ + +During the three weeks of data observation there was a moritorium on changes/alterations to the system, therefore all observed traffic is assumed to only be regualr esage and software managing network communications. \textit{\textbf{SMB Expectations}}: SMB will be heavily used by students to access their network accounts from any networked computer, along with network access to shared file systems and connected printers. Oplocks will be in heavy use and cause a slowdown of the system for multiuser shared storage space. Authentication of network computers could bottleneck during moments of high traffic (e.g. students all logging in for a class). @@ -204,7 +204,7 @@ When examinging the data produced from this research, one has to look for a limi \section{Intuition Confirm/Change} \label{Intuition Confirm/Change} -In order ot interpret the data being analyzed and dissected, the first step was to understand how to pair the byte throughput and IO event frequency into an understanding of the system. This was achieved by including examination of the data relative to the surrounding behavior. Pairing the information in this manner shows not only the bytes \& IO behavior but preliminary understanding of how much throughput is being generated by each IO event; giving an outline of client behvaior on the system.\\ +In order to interpret the data being analyzed and dissected, the first step was to understand how to pair the byte throughput and IO event frequency into an understanding of the system. This was achieved by including examination of the data relative to the surrounding behavior. Pairing the information in this manner shows not only the bytes \& IO behavior but preliminary understanding of how much throughput is being generated by each IO event; giving an outline of client behvaior on the system. This information compared with the buffer sizes allows for improved interpretations of the network communication information.\\ \begin{figure*} \includegraphics[width=\textwidth,height=4cm]{./images/IOWeek.png} @@ -238,15 +238,14 @@ Different behavioral situations (seen using pairing of bytes and IO graphs/behav \item Bytes - Amount of data being passed \item IOs - Number of interactions occurring on network \end{itemize} -Three scenarios observed were (1) a large number of IOs and small number of bytes, (2) a small number of IOs and large number of bytes, and (3) similar number of IOs and bytes transferred. This data is interpreted as follows, the number of bytes directly correlates to the data being transferred over the wire and the number of IOs correlates to the number of clients interacting over the network. Scenario (1) has high client interaction where expected bottlenecks would be due to management software dealing with the large volume of client interactions. Race conditions and age of requests could be an issue, but ideally OpLocks and SMB’s internal management tackle these issues. Scenario (2) has high traffic throughput, where any bottleneck would most likely be due to physical limitations of the system; wires, switches, etc. The last scenario (3) has equal interactions of each, where if both are high the system will taxed and bottlenecks could occur in multiple aspects or both are low and the system will be in a relaxed state. \textbf{ADD FINDINGS FROM BUFFER SIZES; e.g. ~4K size.} -%Test test of ref~\ref{fig:IOWeek} but not of the other ref \ref{fig:BytesWeek}. +Three scenarios observed were (1) a large number of IOs and small number of bytes, (2) a small number of IOs and large number of bytes, and (3) similar number of IOs and bytes transferred. This data is interpreted as follows, the number of bytes directly correlates to the data being transferred over the wire and the number of IOs correlates to the number of clients interacting over the network. Scenario (1) has high client interaction where expected bottlenecks would be due to management software dealing with the large volume of client interactions. Race conditions and age of requests could be an issue, but ideally OpLocks and SMB’s internal management tackle these issues. Scenario (2) has high traffic throughput, where any bottleneck would most likely be due to physical limitations of the system; wires, switches, etc. The last scenario (3) has equal interactions of each, where if both are high the system will taxed and bottlenecks could occur in multiple aspects or both are low and the system will be in a relaxed state. +It was discovered that the median buffer size exchanged over the network was 4096 bytes (i.e. 4K). This leads to the understanding that the majority of any network-based bottlenecks will be due to a large number of small communications. Paired with the knowledge that the interacting systems are individually configured supports that a good deal of "back-and-forth" traffic is being created during back-up processes. \section{Conclusion} \label{Conclusion} -\textit{Do the results show a continuation in the trend of traditional computer science workloads?} -On the outset of this work it was believed that the data collected and analyzed would follow similar behavior patterns seen in previous papers ~\cite{Douceur1999, RuemmlerWilkes1993, Bolosky2007, EllardLedlie2003}. The expectation is that certain aspect of the data, such as transfer/buffer sizes, will produce a bell-shape and be centralized around a larger size than previous papers' findings. The number of I/O operations was expected to peak during nocturnal hours and fall during day time hours. On top of that the expectation is that a greater number of reads will be seen over the course of a day, where the majority of writes will occur near the expected times of UITS' backup (e.g. 2am to 6am). Granted, one must recall that the expectation is that any backup traffic that is seen will be due to a fetching of user's caches in order to preserve fidelity of any shared data.\\ +On the outset of this work it was believed that the data collected and analyzed would follow similar behavior patterns seen in previous papers ~\cite{Douceur1999, RuemmlerWilkes1993, Bolosky2007, EllardLedlie2003}. The expectation is that certain aspects of the data, such as transfer/buffer sizes, will produce a bell-shape and be centralized around a larger size than previous papers' findings. The number of I/O operations was expected to peak during nocturnal hours and fall during day time hours. On top of that the expectation is that a greater number of reads will be seen over the course of a day, where the majority of writes will occur near the expected times of UITS' backup (e.g. 2am to 6am). Granted, one must recall that the expectation is that any backup traffic that is seen will be due to a fetching of user's caches in order to preserve fidelity of any shared data (e.g. traffic produced by incremental back-ups).\\ One oddity was that during the day one would see a greater increase in writes instead of reads. The first assumption was that this is due to the system and how users interact with everything. -I believe that the greater number of writes comes from students doing intro work for different classes in which students are constantly saving their work while reading instructions from a single source. The early traffic is most likely due to professors preparing for classes. One must also recall that this data itself has limited interpretation because only a small three week windows of information is being examined. A better, and far more complete, image can be constructed using data captured from the following months, or more ideally, from an entire year's worth of data. Another limitation of the results is the scope of the analysis is curbed and does not yet fully dissect all of the fields being passed in network communication. +I believe that the greater number of writes comes from students doing intro work for different classes in which students are constantly saving their work while reading instructions from a single source. The early traffic is most likely due to professors preparing for classes. A majority of the late night/early morning write traffic is due to the back-ups occurring, though these can occur through-out the day. One must also recall that this data itself has limited interpretation because only a small three week windows of information is being examined. A better, and far more complete, image can be constructed using data captured from the following months, or more ideally, from an entire year's worth of data. Another limitation of the results is the scope of the analysis is curbed and does not yet fully dissect all of the fields being passed in network communication. The future work of this project would be to incorporate more fields from the SMB/CIFS specifications in order to depict a fuller picture of the network system. Additional alterations would be a work-around allowing for DataSeries files that incorporate an entire day's worth of trafafic, for the purpose of distribution of the code, modulation of the captruing software to incorporate additional protocols, better automation of code elements in order to reduce the potential for human error causing loss of data, and incorporation of DataSeries version changes to tackle data primitive translation issues. The hope is to further the progress made with benchmarks \& tracing in the hope that it too will lend to improving and deepening the knowledge and understanding of these systems so that as a result the technology and methodology is bettered as a whole.