Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
pvfs2-osd/doc/intro.tex
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
318 lines (258 sloc)
16.5 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\section{An introduction to PVFS2} | |
Since the mid-1990s we have been in the business of parallel I/O. Our first | |
parallel file system, the Parallel Virtual File System (PVFS), has been the | |
most successful parallel file system on Linux clusters to date. This code | |
base has been used both in production mode at large scientific computing | |
centers and as a launching point for many research endeavors. | |
However, the PVFS (or PVFS1) code base is a terrible mess! For the last few | |
years we have been pushing it well beyond the environment for which it was | |
originally designed. The core of PVFS1 is no longer appropriate for the | |
environment in which we now see parallel file systems being deployed. | |
While we have been keeping PVFS1 relevant, we have also been interacting with | |
application groups, other parallel I/O researchers, and implementors of system | |
software such as message passing libraries. As a result have learned a great | |
deal about how applications use these file systems and how we might better | |
leverage the underlying hardware. | |
Eventually we reached a point where it was obvious to us that a new design was | |
in order. The PVFS2 design embodies the principles that we believe are key to | |
a successful, robust, high-performance parallel file system. It is being | |
implemented primarily by a distributed team at Argonne National Laboratory and | |
Clemson University. Early collaborations have already begun with Ohio | |
Supercomputer Center and Ohio State University, and we look forward to | |
additional participation by interested and motivated parties. | |
In this section we discuss the motivation behind and the key characteristics | |
of our parallel file system, PVFS2. | |
\subsection{Why rewrite?} | |
There are lots of reasons why we've chosen to rewrite the code. We were bored | |
with the old code. We were tired of trying to work around inherent problems | |
in the design. But mostly we felt hamstrung by the design. It was too | |
socket-centric, too obviously single-threaded, wouldn't support heterogeneous | |
systems with different endian-ness, and relied too thoroughly on OS buffering | |
and file system characteristics. | |
The new code base is much bigger and more flexible. Definitely there's the | |
opportunity for us to suffer from second system syndrome here! But we're | |
willing to risk this in order to position ourselves to use the new code base | |
for many years to come. | |
\subsection{What's different?} | |
The new design has a number of important features, including: | |
\begin{itemize} | |
\item modular networking and storage subsystems, | |
\item powerful request format for structured non-contiguous accesses, | |
\item flexible and extensible data distribution modules, | |
\item distributed metadata, | |
\item stateless servers and clients (no locking subsystem), | |
\item explicit concurrency support, | |
\item tunable semantics, | |
\item flexible mapping from file references to servers, | |
\item tight MPI-IO integration, and | |
\item support for data and metadata redundancy. | |
\end{itemize} | |
\subsubsection{Modular networking and storage} | |
One shortcoming of the PVFS1 system is its reliance on the socket networking | |
interface and local file systems for data and metadata storage. | |
If we look at most cluster computers today we see a variety of networking | |
technologies in place. IP, IB, Myrinet, and Quadrics are some of the more | |
popular ones at the time of writing, but surely more will appear \emph{and | |
disappear} in the near future. | |
% | |
As groups attempt to remotely access data at high data rates across the wide | |
area, approaches such as reliable UDP protocols might become an important | |
component of a parallel file system as well. | |
% | |
Supporting multiple networking technologies, and supporting them | |
\emph{efficiently} is key. | |
Likewise many different storage technologies are now available. We're still | |
getting our feet wet in this area, but it is clear that some flexibility on | |
this front will pay off in terms of our ability to leverage new technologies | |
as they appear. In the mean time we are certainly going to leverage database | |
technologies for metadata storage -- that just makes good sense. | |
In PVFS2 the Buffered Messaging Interface (BMI) and the Trove storage | |
interface provide APIs to network and storage technologies respectively. | |
\subsubsection{Structured non-contiguous data access} | |
Scientific applications are complicated entities constructed from numerous | |
libraries and operating on highly structured data. This data is often stored | |
using high-level I/O libraries that manage access to traditional byte-stream | |
files. | |
These libraries allow applications to describe complicated access patterns | |
that extract subsets of large datasets. These subsets often do not sit in | |
contiguous regions in the underlying file; however, they are often very | |
structured (e.g. a block out of a multidimensional array). | |
It is imperative that a parallel file system natively support structured data | |
access in an efficient manner. In PVFS2 we perform this with the same types | |
of constructs used in MPI datatypes, allowing for the description of | |
structured data regions with strides, common block sizes, and so on. This | |
capability can then be leveraged by higher-level libraries such as the MPI-IO | |
implementation. | |
\subsubsection{Flexible data distribution} | |
The tradition of striping data across I/O servers in round-robin fashion has | |
been in place for quite some time, and it seems as good a default as any given | |
no more information about how a file is going to be accessed. However, in | |
many cases we \emph{do} know more about how a file is going to be accessed. | |
% | |
Applications have many opportunities to give the file system information about | |
access patterns through various high-level interfaces. Armed with this | |
information we can make informed decisions on how to better distribute data to | |
match expected access patterns. More complicated systems could redistribute | |
data to better match patterns that are seen in practice. | |
PVFS2 includes a modular system for adding new data distributions to the | |
system and using these for new files. We're starting with the same old | |
round-robin scheme that everyone is accustomed to, but we expect to see this | |
mechanism used to better access multidimensional datasets. It might play a | |
role in data redundancy as well. | |
\subsubsection{Distributed metadata} | |
One of the biggest complaints about PVFS1 is the single metadata server. | |
There are actually two bases on which this complaint is typically launched. | |
The first is that this is a single point of failure -- we'll address that in a | |
bit when we talk about data and metadata redundancy. The second is that it is | |
a potential performance bottleneck. | |
In PVFS1 the metadata server steps aside for I/O operations, making it rarely | |
a bottleneck in practice for large parallel applications, because they are | |
busy writing data and not creating a billion files or some such thing. | |
However, as systems continue to scale it becomes ever more likely that any | |
such single point of contact might become a bottleneck for even well-behaved | |
applications. | |
PVFS2 allows for configurations where metadata is distributed to some subset | |
of I/O servers (which might or might not also serve data). This allows for | |
metadata for different files to be placed on different servers, so that | |
applications accessing different files tend to impact each other less. | |
Distributed metadata is a relatively tricky problem, but we're going to | |
provide it in early releases anyway. | |
\subsubsection{Stateless servers and clients} | |
Parallel file systems (and more generally distributed file systems) are | |
potentially complicated systems. As the number of entities participating in | |
the system grows, so does the opportunity for failures. Anything that can be | |
done to minimize the impact of such failures should be considered. | |
NFS, for all its faults, does provide a concrete example of the advantage of | |
removing shared state from the system. Clients can disappear, and an NFS | |
server just keeps happily serving files to the remaining clients. | |
In stark contrast to this are a number of example distributed file systems in | |
place today. In order to meet certain design constraints they provide | |
coherent caches on clients enforced via locking subsystems. As a result a | |
client failure is a significant event requiring a complex sequence of events | |
to recover locks and ensure that the system is in the appropriate state before | |
operations can continue. | |
We have decided to build PVFS2 as a stateless system and do not use locks as | |
part of the client-server interaction. This vastly simplifies the process of | |
recovering from failures and facilitates the use of off-the-shelf | |
high-availability solutions for providing server failover. This does impact | |
the semantics of the file system, but we believe that the resulting semantics | |
are very appropriate for parallel I/O. | |
\subsubsection{Explicit support for concurrency} | |
Clearly concurrent processing is key in this type of system. The PVFS2 server | |
and client designs are based around an explicit state machine system that is | |
tightly coupled with a component for monitoring completion of operations | |
across all devices. Threads are used where necessary to provide non-blocking | |
access to all device types. This combination of threads, state machines, and | |
completion notification allows us to quickly identify opportunities to make | |
progress on particular operations and avoids serialization of independent | |
operations within the client or server. | |
This design has a further side-effect of giving us native support for | |
asynchronous operations on the client side. Native support for asynchronous | |
operations makes nonblocking operations under MPI-IO both easy to implement | |
and advantageous to use. | |
\subsubsection{Tunable semantics} | |
Most distributed file systems in use for cluster systems provide POSIX (or | |
very close to POSIX) semantics. These semantics are very strict, arguably | |
more strict than necessary for a usable parallel I/O system. | |
NFS does not provide POSIX semantics because it does not guarantee that client | |
caches are coherent. This actually results in a system that is often unusable | |
for parallel I/O, but is very useful for home directories and such. | |
Storage systems being applied in the Grid environment, such as those being | |
used in conjunction with some physics experiments, have still different | |
semantics. These tend to assume that files are added atomically and are never | |
subsequently modified. | |
All these very different approaches have their merits and applications, but | |
also have their disadvantages. PVFS2 will not support POSIX semantics | |
(although one is welcome to build such a system on top of PVFS2). However, we | |
do intend to give users a great degree of flexibility in terms of the | |
coherency of the view of file data and of the file system hierarchy. Users in | |
a tightly coupled parallel machine will opt for more strict semantics that | |
allow for MPI-IO to be implemented. Other groups might go for looser | |
semantics allowing for access across the wide area. The key here is allowing | |
for the possibility of different semantics to match different needs. | |
\subsubsection{Flexible mapping from file references to servers} | |
Administrators appreciate the ability to reconfigure systems to adapt to | |
changes in policy or available resources. In parallel file systems, the | |
mapping from a file reference to its location on devices can help or hinder | |
reconfiguration of the file system. | |
In PVFS2 file data is split into \emph{datafiles}. Each datafile has its own | |
reference, and clients identify the server that owns a datafile by checking a | |
table loaded at configuration time. A server can be added to the system by | |
allocating a new range of references to that server and restarting clients | |
with an update table. Likewise, servers can be removed by first stopping | |
clients, next moving datafiles off the server, then restarting with a new | |
table. It is not difficult to imagine providing this functionality while the | |
system is running, and we will be investigating this possibility once basic | |
functionality is stable. | |
\subsubsection{Tight MPI-IO coupling} | |
The UNIX interface is a poor building block for an MPI-IO implementation. It | |
does not provide the rich API necessary to communicate structured I/O accesses | |
to the underlying file system. It has a lot of internal state stored as part | |
of the file descriptor. It implies POSIX semantics, but does not provide them | |
for some file systems (e.g. NFS, many local file systems when writing large | |
data regions). | |
Rather than building MPI-IO support for PVFS2 through a UNIX-like interface, | |
we have started with something that exposes more of the capabilities of the | |
file system. This interface does not maintain file descriptors or internal | |
state regarding such things as file positions, and in doing so allows us to | |
better leverage the capabilities of MPI-IO to perform efficient access. | |
We've already discussed rich I/O requests. ``Opening'' a file is another good | |
example. \texttt{MPI\_File\_open()} is a collective operation that gathers | |
information on a file so that MPI processes may later access it. If we were | |
to build this on top of a UNIX-like API, we would have each process that would | |
potentially access the file call | |
\texttt{open()}. In PVFS2 we instead resolve the filename into a handle using | |
a single file system operation, then broadcast the resulting handle to the | |
remainder of the processes. Operations that determine file size, truncate | |
files, and remove files may all be performed in this same $O(1)$ manner, | |
scaling as well as the MPI broadcast call. | |
\subsubsection{Data and metadata redundancy} | |
Another common (and valid) complaint regarding PVFS1 is its lack of support | |
for redundancy at the server level. RAID approaches are usable to provide | |
tolerance of disk failures, but if a server disappears, all files with data on | |
that server are inaccessible until the server is recovered. | |
Traditional high-availability solutions may be applied to both metadata and | |
data servers in PVFS2 (they're actually the same server). This option | |
requires shared storage between the two machines on which file system data is | |
stored, so this may be prohibitively expensive for some users. | |
A second option that is being investigated is what we are calling \emph{lazy | |
redundancy}. The lazy redundancy approach is our response to the failure of | |
RAID-like approaches to scale well for large parallel I/O systems when applied | |
across servers. The failure of this approach at this scale is primarily due | |
to the significant change in environment (latency and number of devices across | |
which data is striped). Providing the atomic read/modify/write capability | |
necessary to implement RAID-like protocols in a distributed file system | |
requires a great deal of performance-hampering infrastructure. | |
With lazy redundancy we expose the creation of redundant data as an explicit | |
operation. This places the burden of enforcing consistent access on the | |
user. However, it also opens up the opportunity for a number of | |
optimizations, such as creating redundant data in parallel. Further, because | |
this can be applied at a more coarse grain, more compute-intensive algorithms | |
may be used in place of simple parity, providing higher reliability than | |
simple parity schemes. | |
Lazy redundancy is still at the conceptual stage. We're still trying to | |
determine how to best fit this into the system as a whole. However, | |
traditional failover solutions may be put in place for the existing system. | |
\subsubsection{And more...} | |
There are so many things that we feel we could have done better in PVFS1 that | |
it is really a little embarrassing. Better heterogeneous system support, a | |
better build system, a solid infrastructure for testing concurrent access to | |
the file system, an inherently concurrent approach to servicing operations on | |
both the client and server, better management tools, even symlinks; we've | |
tried to address most if not all the major concerns that our PVFS1 users have | |
had over the years. | |
It's a big undertaking for us. Which leads to the obvious next question. | |
\subsection{When will this be available?} | |
Believe it or not, right now. At SC2004 we released PVFS2 1.0. We | |
would be foolish to claim PVFS2 has no bugs and will work for everyone | |
100\% of the time, but we feel PVFS2 is in pretty good shape. Early | |
testing has found a lot of bugs, and we feel PVFS is ready for wider | |
use. | |
Note that we're committed to supporting PVFS1 for some time after PVFS2 | |
is available and stable. We feel like PVFS1 is a good solution for many | |
groups already, and we would prefer for people to use PVFS1 for a little | |
while longer rather than them have a sour first experience with PVFS2. | |
We announce updates frequently on the PVFS2 mailing lists. We encourage | |
users to subscribe -- it's the best way to keep abreast of PVFS2 | |
developments. All code is being distributed under the LGPL license to | |
facilitate use under arbitrarily licensed high-level libraries. |