Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
pvfs2-osd/doc/design/flow-design.tex
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
431 lines (342 sloc)
17.4 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% | |
% flow design | |
% | |
\documentclass[12pt]{article} % FORMAT CHANGE | |
\usepackage[dvips]{graphicx} | |
\usepackage{times} | |
\graphicspath{{./}{figs/}} | |
% | |
% GET THE MARGINS RIGHT, THE UGLY WAY | |
% | |
% \topmargin 0.2in | |
% \textwidth 6.5in | |
% \textheight 8.75in | |
% \columnsep 0.25in | |
% \oddsidemargin 0.0in | |
% \evensidemargin 0.0in | |
% \headsep 0.0in | |
% \headheight 0.0in | |
\pagestyle{plain} | |
\addtolength{\hoffset}{-2cm} | |
\addtolength{\textwidth}{4cm} | |
\addtolength{\voffset}{-1.5cm} | |
\addtolength{\textheight}{3cm} | |
\setlength{\parindent}{0pt} | |
\setlength{\parskip}{12pt} | |
\title{Flow Design Document} | |
\author{PVFS Development Team} | |
\date{July 2002} | |
\begin{document} | |
\maketitle | |
\section{TODO} | |
\begin{itemize} | |
\item point to some other document for explanation of concepts common to all | |
pvfs2 I/O interfaces (contexts, max idle time, test semantics, etc.) | |
\end{itemize} | |
\section{Concepts and Motivation} | |
Flows are a high level model for how PVFS2 system components | |
will perform I/O. It is designed to abstractly but efficiently move | |
data from source to destination, where source and destination may be | |
defined as storage devices, network devices, or memory regions. | |
Features include: | |
\begin{itemize} | |
\item \emph{Combining I/O mechanisms} The flow interface combines | |
network I/O and disk I/O into a single framework with a scheduler that takes | |
both into account. | |
\item \emph{Multiple protocols} | |
Actual I/O is carried out underneath the flow interface by | |
\emph{flow protocols}. We may implement several different | |
protocols (using different I/O or buffering techniques, for example) | |
which can be switched at runtime. | |
\item \emph{Simple interface} The application interface to | |
this system will be as high level and simple as possible. Device | |
selection, scheduling, buffer management, and request pattern | |
processing will be transparent to the flow user. | |
\item \emph{Datatypes} Flows allow the user to specify both memory | |
and file datatypes (similar to those used in MPI) , and will handle | |
breaking down these datatypes into a format that can be used by lower | |
level I/O interfaces. | |
\end{itemize} | |
\section{Flows} | |
\subsection{Overview} | |
A flow describes a movement of data. The data always moves from a single | |
source to a single destination. There may be (and almost always will be) | |
multiple flows in progress at the same time for different locations- | |
in particular for clients that are talking simultaneously to several | |
servers, or servers that are handling simultaneous I/O requests. | |
At the highest level abstraction, it is important that a flow describes | |
a movement of data in terms of ``what to do'' rather than ``how to do | |
it''. For example, when a user sets up a flow, it may indicate that the | |
first 100 bytes of a file on a local disk should be sent to | |
a particular host on the network. It will not specify what protocols to | |
use, how to buffer the data, or how to schedule the I/O. All of this will be | |
handled underneath the flow interface. The user just requests that a | |
high level I/O task be performed and then checks for completion until it | |
is done. | |
Note that the ``user'' in the above example is most likely a | |
system interface or server implementer in pvfs2. End users will | |
be unaware of this API. | |
A single flow created on a server will match exactly one flow on a | |
client. For example, if a single client performs a PVFS2 read, the | |
server will create a storage to network flow, and the client will | |
create a network to memory flow. If a client communicates with N servers | |
to complete an I/O operation, then it will issue N flows simultaneously. | |
Flows will not be used for exchanging request protocol messages | |
between the client and server (requests or acknowledgements). | |
They will only be used for data transfer. It is assumed that request | |
messages will be used for handshaking before or after the flow as | |
needed. | |
\subsection{Architecture} | |
\label{sec:arch} | |
There are two major parts of the flow architecture, as seen in figure | |
\ref{fig:flow-arch}. The first is the | |
\emph{flow interface}. Applications (ie PVFS components) interact with | |
this interface. It provides a consistent API regardless of what | |
protocols are in use, what scheduling is being performed, etc. | |
The second major component of the architecture is the \emph{flow | |
protocol}. | |
There may be many flow protocols active within one flow interface. Each | |
flow protocol implements communication between a different pair of data | |
endpoint types. For example, one flow protocol may link TCP/IP to | |
asynchronous unix I/O, while another may link VIA to memory regions. For | |
two seperate hosts to communicate, they must share compatible flow | |
protocols (as indicated by the dotted line at the bottom of figure | |
\ref{fig:flow-arch}). | |
Flow protocols all adhere to a strict interface | |
and must provide the same expected functionality (which will be | |
described later). Flow protocols take care of details such as buffering | |
and flow control if necessary. | |
\begin{figure} | |
\begin{center} | |
\caption{Basic flow architecture \label{fig:flow-arch}} | |
\includegraphics[scale=0.6]{flow-arch.eps} | |
\end{center} | |
\end{figure} | |
\subsection{Describing flows} | |
Individual flows are represented using structures called \emph{flow | |
descriptors}. The source and destination of a given flow are represented | |
by structures called \emph{endpoints}. A flow descriptor may serve | |
many roles. First of all, when created by a flow interface user, it | |
describes an I/O task that needs to be performed. Once it is submitted | |
to the flow interface, it may keep up with state or progress information. | |
When the descriptor is finally completed and returned to the user, it will | |
indicate the status of the completed flow, whether successful or in error. | |
Flow endpoints describe the memory, storage, or network locations for | |
the movement of data. All flow descriptors must have both a | |
source and a destination endpoint. | |
\subsection{Usage assumptions} | |
It is assumed that all flows in PVFS2 will be \emph{preceded} by a | |
PVFS2 request protocol exchange between the client and server. In | |
a file system read case, the client will send a read request, and | |
the server will send an acknowledgement (which among other things | |
indicates how much data is available to be read). In a file | |
system write case, the client will send a write request, and the | |
server will send an acknowledgement that indicates when it is safe | |
to begin the flow to send data to the server. Once the flow is completed, | |
a trailing acknowledgment alerts the client that the server has | |
completed the write operation. | |
The request protocol will transmit information such as file size and | |
distribution parameters that may be needed to coordinate remote flows. | |
\section{Data structures} | |
\subsection{Flow descriptor} | |
\label{sec:flow-desc} | |
Flow descriptors are created by the flow interface user. At this time, | |
the caller may edit these fields directly. Once the flow has been posted | |
for service, however, the caller may only interact with the descriptor | |
through functions defined in the flow interface. It is not safe to | |
directly edit a flow descriptor while it is in progress. | |
Once a flow is complete, it is again safe to examine fields within | |
the descriptor (for example, to determine the status of the | |
completed flow). | |
Note that there is an endpoint specific to each type supported by | |
the flow interface (currently memory, BMI (network), and Trove | |
(storage)). | |
The following fields may be set by the caller prior to posting: | |
\begin{itemize} | |
\item src: source endpoint (BMI, memory, or Trove addressing information) | |
\item dest: destination endpoint (BMI, memory, or Trove addressing information) | |
\item tag: tag used to match up flows with particular operation sequences | |
\item user\_ptr: void* pointer reserved for use by the caller (may associate a | |
flow with some higher level state structure, for example) | |
\item type: specifies what kind of flow protocol to use for this flow | |
\item file\_req: file datatype (similar to MPI datatype) | |
\item file\_req\_offset: offset into the file datatype | |
\item mem\_req: memory datatype (similar to MPI datatype) (optional) | |
\item aggregate\_size: total amount of data the flow should transfer (optional) | |
\item file\_data: struct containing state information about the file to access, used by the distribution subsystem | |
\end{itemize} | |
Special notes: Both the mem\_req and the aggregate\_size fields are optional. | |
However, at least one of them \emph{must} be set. Otherwise the flow has | |
no way to calculate how much data must be transferred. | |
The following fields may be read by the caller after completion of a flow: | |
\begin{itemize} | |
\item state: final state of flow (see enumerated values in flow.h) | |
\item error\_code: error code (nonzero if state indicates an error) | |
\item total\_transfered: amount of data moved by the flow | |
\end{itemize} | |
The following fields are reserved for use within the flow code: | |
\begin{itemize} | |
\item context\_id: specifies which flow level context the descriptor belongs to | |
\item flowproto\_id: internal identifier for the flowprotocol used | |
\item priority: priority level of flow (unused as of yet) | |
\item sched\_queue\_link: for use by internal scheduler | |
\item flow\_protocol\_data: void* reserved for flow protocol use | |
\item file\_req\_state: current state of file datatype processing | |
\item mem\_req\_state: current state of memory datatype processing | |
\item result: result of each datatype processing iteration | |
\end{itemize} | |
\section{Flow interface} | |
The flow interface is the set of functions that the flow user is allowed | |
to interact with. These functions allow you to do such things as create | |
flows, post them for service, and check for completion. | |
\begin{itemize} | |
\item \emph{PINT\_flow\_initialize()}: performs initial setup of flow | |
interface - must be called once before any other flow interface functions | |
\item \emph{PINT\_flow\_finalize()}: shuts down the flow interface | |
\item \emph{PINT\_flow\_alloc()}: creates a new flow descriptor | |
\item \emph{PINT\_flow\_free()}: frees up a flow descriptor that is | |
no longer needed | |
\item \emph{PINT\_flow\_reset()}: resets a previously used flow | |
descriptor to its initial state and values. | |
\item \emph{PINT\_flow\_set\_priority()}: sets the priority of a | |
particular flow descriptor. May be called even when a flow is in | |
service. | |
\item \emph{PINT\_flow\_get\_priority()}: reads the priority of a | |
particular flow descriptor. May be called even when a flow is in | |
service. | |
\item \emph{PINT\_flow\_post()}: submits a flow descriptor for | |
service | |
\item \emph{PINT\_flow\_setinfo()}: used to set optional | |
interface parameters. | |
\item \emph{PINT\_flow\_getinfo()}: used to read optional | |
interface parameters. | |
\end{itemize} | |
Three functions are provided to test for completion of posted flows: | |
\begin{itemize} | |
\item \emph{PINT\_flow\_test()}: tests for completion of a single | |
flow | |
\item \emph{PINT\_flow\_testsome()}: tests for completion of any | |
flows from a specified set of flows | |
\item \emph{PINT\_flow\_testcontext()}: tests for completion of any | |
flows that are in service in the interface | |
\end{itemize} | |
\section{Flow protocol interface} | |
The flow protocols are modular components capable of moving data between | |
particular types of endpoints. (See section \ref{sec:arch} for an | |
overview). Any flow protocol implementation must conform to a | |
predefined flow protocol interface in order to interoperate with the | |
flow system. | |
\begin{itemize} | |
\item \emph{flowproto\_initialize()}: Initializes the flow | |
protocol (called exactly once before posting any flows) | |
\item \emph{flowproto\_finalize()}: shuts down the flow | |
protocol (forceful terminating any pending flows) | |
\item \emph{flowproto\_post()}: posts a flow descriptor | |
\item \emph{flowproto\_find\_serviceable()}: returns an array of active | |
flows from the flow protocol that are either completed or in need of | |
service | |
\item \emph{flowproto\_service()}: performs work on a single | |
flow descriptor that is ready for service (as indicated by a | |
flowproto\_find\_serviceable() function) | |
\item \emph{flowproto\_getinfo()}: reads optional parameters | |
from the protocol | |
\item \emph{flowproto\_setinfo()}: sets optional protocol | |
parameters | |
\end{itemize} | |
The following section describing the interaction between the flow component | |
and the flow protocols may be helpful in clarifying how the above functions | |
will be used. | |
\section{Interaction between flow component and flow protocols} | |
The flow code that resides above the flow protocols serves two primary | |
functions: multiplexing between the various flow protocols, and scheduling | |
work. | |
The multiplexing is handled by simply tracking all active flow protocols and | |
directing flow descriptors to the appropriate one. | |
The scheduling functionality is the more complicated of the two | |
responsibilities of the flow code. This responsibility leads to the | |
design of the flow protocol interface and the states of the flow descriptors. | |
In order to understand these states, it is important to understand that | |
flow protocols typically operate with a certain granularity that is | |
defined by the flow protocol implementation. For example, a flow protocol | |
may transfer data 128 KB of data at a time. A simple implementation of a memory to | |
network flow may post a network send of 128 KB, wait for it to complete, then | |
post the next send of 128 KB, and so on. Each of these iterations is driven by | |
the top level flow component. In other words, the flow protocol is not | |
autonomous. Rather than work continuously once it receives a flow descriptor, | |
it only performs one iteration of work at a time, and then waits for the | |
flow component to tell it to continue. This provides the flow interface | |
with an opportunity to schedule flows and choose which ones to service at | |
each iteration. | |
When a flow descriptor is waiting for the flow component to allow it to | |
continue, then it is ``ready for service''. The flow component may | |
then call the flowproto\_service() function to allow it to continue. In | |
the above example, this would cause the flow protocol to post another network | |
send. | |
In order to discover which flow descriptors are ``ready for service'' (and | |
therefore must be scheduled), it calls flowproto\_find\_serviceable() for each | |
active flow protocol. Thus, the service loop of the flow component looks something like this: | |
\begin{enumerate} | |
\item call flowproto\_find\_serviceable() for each active flow protocol to | |
generate a list of flows to service | |
\item run scheduling algorithm to build list of scheduled flows | |
\item call flowproto\_service() for each scheduled flow (in order) | |
\item if a flow descriptor reaches the completed or error state (at any time), | |
then move it to a list of completed flow descriptors to be returned to the | |
caller | |
\end{enumerate} | |
The scheduling filter (at the time of this writing) does nothing but service | |
all flows in order. More advanced schedulers will be added later. | |
\subsection{Example flow protocol (implementation)} | |
The default flow protocol is called "flowproto\_bmi\_trove'' and is capable of | |
handling the following endpoint combinations: | |
\begin{itemize} | |
\item BMI to memory | |
\item memory to BMI | |
\item BMI to Trove | |
\item Trove to BMI | |
\end{itemize} | |
The following summarizes what the principle flow protocol interface functions | |
do in this protocol: | |
\begin{itemize} | |
\item flowproto\_post(): allocates any intermediate buffers that may be needed, | |
begins datatype processing | |
\item flowproto\_service(): posts the next necessary BMI and Trove operations | |
\item flowproto\_find\_serviceable(): tests for completion of pending BMI and | |
trove operations, drives the state of any flow descriptors affected by completion, and returns flow descriptors ready for service to caller | |
\end{itemize} | |
The flow protocol performs double buffering to keep both the Trove and BMI interfaces as busy as possible when transferring between the two. | |
The flow protocol does not have an internal thread. However, if it detects that the job interface is using threads (through the \_\_PVFS2\_JOB\_THREADED\_\_ define), then it will use the job interface's thread manager to push on BMI and | |
Trove operations. The find\_serviceable() function then just checks for | |
completion notifications from the thread callback functions, rather than | |
testing the BMI or Trove interfaces directly. | |
Trove support is compiled out if the \_\_PVFS2\_TROVE\_SUPPORT\_\_ define is | |
not detected. This is mainly done in client libraries which do not | |
need to use Trove in order to reduce library dependencies. | |
\section{Implementation note: avoiding flows} | |
The flow interface will introduce overhead for small operations | |
that would not otherwise be present. It may therefore be helpful | |
to eventually introduce an optimization to avoid the use of flows | |
for small read or write operations. | |
\begin{verbatim} | |
text of an email discussion on this topic (> part by Phil, non > | |
part by Rob): | |
> Yeah, we need to get these ideas documented somewhere. There may actually | |
> be a couple of eager modes. By default, BMI only allows unexpected | |
> messages < 16K or so. That places a cap on the eager write size, | |
> unless we had a second eager mode that consists of a) send write request | |
> b) send write data c) receive ack... | |
Yes. These two modes are usually differentiated by the terms "short" and | |
"eager", where the "short" one puts the data actually into the same | |
packet/message (depending on the network layer at which we are working). | |
> Of course all of this would need to be tunable so that we can see what | |
> works well. Maybe rules like: | |
> | |
> contig writes < 15K : simple eager write | |
> 15K < contig writes < 64K : two part eager write | |
> writes > 64K && noncontig writes : flow | |
> | |
> contig reads < 64K : eager read | |
> contig reads > 64K && noncontig reads : flow | |
Yeah, something like that. | |
\end{verbatim} | |
\end{document} | |