Skip to content
Permalink
1139b72d5e
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
318 lines (258 sloc) 16.5 KB
\section{An introduction to PVFS2}
Since the mid-1990s we have been in the business of parallel I/O. Our first
parallel file system, the Parallel Virtual File System (PVFS), has been the
most successful parallel file system on Linux clusters to date. This code
base has been used both in production mode at large scientific computing
centers and as a launching point for many research endeavors.
However, the PVFS (or PVFS1) code base is a terrible mess! For the last few
years we have been pushing it well beyond the environment for which it was
originally designed. The core of PVFS1 is no longer appropriate for the
environment in which we now see parallel file systems being deployed.
While we have been keeping PVFS1 relevant, we have also been interacting with
application groups, other parallel I/O researchers, and implementors of system
software such as message passing libraries. As a result have learned a great
deal about how applications use these file systems and how we might better
leverage the underlying hardware.
Eventually we reached a point where it was obvious to us that a new design was
in order. The PVFS2 design embodies the principles that we believe are key to
a successful, robust, high-performance parallel file system. It is being
implemented primarily by a distributed team at Argonne National Laboratory and
Clemson University. Early collaborations have already begun with Ohio
Supercomputer Center and Ohio State University, and we look forward to
additional participation by interested and motivated parties.
In this section we discuss the motivation behind and the key characteristics
of our parallel file system, PVFS2.
\subsection{Why rewrite?}
There are lots of reasons why we've chosen to rewrite the code. We were bored
with the old code. We were tired of trying to work around inherent problems
in the design. But mostly we felt hamstrung by the design. It was too
socket-centric, too obviously single-threaded, wouldn't support heterogeneous
systems with different endian-ness, and relied too thoroughly on OS buffering
and file system characteristics.
The new code base is much bigger and more flexible. Definitely there's the
opportunity for us to suffer from second system syndrome here! But we're
willing to risk this in order to position ourselves to use the new code base
for many years to come.
\subsection{What's different?}
The new design has a number of important features, including:
\begin{itemize}
\item modular networking and storage subsystems,
\item powerful request format for structured non-contiguous accesses,
\item flexible and extensible data distribution modules,
\item distributed metadata,
\item stateless servers and clients (no locking subsystem),
\item explicit concurrency support,
\item tunable semantics,
\item flexible mapping from file references to servers,
\item tight MPI-IO integration, and
\item support for data and metadata redundancy.
\end{itemize}
\subsubsection{Modular networking and storage}
One shortcoming of the PVFS1 system is its reliance on the socket networking
interface and local file systems for data and metadata storage.
If we look at most cluster computers today we see a variety of networking
technologies in place. IP, IB, Myrinet, and Quadrics are some of the more
popular ones at the time of writing, but surely more will appear \emph{and
disappear} in the near future.
%
As groups attempt to remotely access data at high data rates across the wide
area, approaches such as reliable UDP protocols might become an important
component of a parallel file system as well.
%
Supporting multiple networking technologies, and supporting them
\emph{efficiently} is key.
Likewise many different storage technologies are now available. We're still
getting our feet wet in this area, but it is clear that some flexibility on
this front will pay off in terms of our ability to leverage new technologies
as they appear. In the mean time we are certainly going to leverage database
technologies for metadata storage -- that just makes good sense.
In PVFS2 the Buffered Messaging Interface (BMI) and the Trove storage
interface provide APIs to network and storage technologies respectively.
\subsubsection{Structured non-contiguous data access}
Scientific applications are complicated entities constructed from numerous
libraries and operating on highly structured data. This data is often stored
using high-level I/O libraries that manage access to traditional byte-stream
files.
These libraries allow applications to describe complicated access patterns
that extract subsets of large datasets. These subsets often do not sit in
contiguous regions in the underlying file; however, they are often very
structured (e.g. a block out of a multidimensional array).
It is imperative that a parallel file system natively support structured data
access in an efficient manner. In PVFS2 we perform this with the same types
of constructs used in MPI datatypes, allowing for the description of
structured data regions with strides, common block sizes, and so on. This
capability can then be leveraged by higher-level libraries such as the MPI-IO
implementation.
\subsubsection{Flexible data distribution}
The tradition of striping data across I/O servers in round-robin fashion has
been in place for quite some time, and it seems as good a default as any given
no more information about how a file is going to be accessed. However, in
many cases we \emph{do} know more about how a file is going to be accessed.
%
Applications have many opportunities to give the file system information about
access patterns through various high-level interfaces. Armed with this
information we can make informed decisions on how to better distribute data to
match expected access patterns. More complicated systems could redistribute
data to better match patterns that are seen in practice.
PVFS2 includes a modular system for adding new data distributions to the
system and using these for new files. We're starting with the same old
round-robin scheme that everyone is accustomed to, but we expect to see this
mechanism used to better access multidimensional datasets. It might play a
role in data redundancy as well.
\subsubsection{Distributed metadata}
One of the biggest complaints about PVFS1 is the single metadata server.
There are actually two bases on which this complaint is typically launched.
The first is that this is a single point of failure -- we'll address that in a
bit when we talk about data and metadata redundancy. The second is that it is
a potential performance bottleneck.
In PVFS1 the metadata server steps aside for I/O operations, making it rarely
a bottleneck in practice for large parallel applications, because they are
busy writing data and not creating a billion files or some such thing.
However, as systems continue to scale it becomes ever more likely that any
such single point of contact might become a bottleneck for even well-behaved
applications.
PVFS2 allows for configurations where metadata is distributed to some subset
of I/O servers (which might or might not also serve data). This allows for
metadata for different files to be placed on different servers, so that
applications accessing different files tend to impact each other less.
Distributed metadata is a relatively tricky problem, but we're going to
provide it in early releases anyway.
\subsubsection{Stateless servers and clients}
Parallel file systems (and more generally distributed file systems) are
potentially complicated systems. As the number of entities participating in
the system grows, so does the opportunity for failures. Anything that can be
done to minimize the impact of such failures should be considered.
NFS, for all its faults, does provide a concrete example of the advantage of
removing shared state from the system. Clients can disappear, and an NFS
server just keeps happily serving files to the remaining clients.
In stark contrast to this are a number of example distributed file systems in
place today. In order to meet certain design constraints they provide
coherent caches on clients enforced via locking subsystems. As a result a
client failure is a significant event requiring a complex sequence of events
to recover locks and ensure that the system is in the appropriate state before
operations can continue.
We have decided to build PVFS2 as a stateless system and do not use locks as
part of the client-server interaction. This vastly simplifies the process of
recovering from failures and facilitates the use of off-the-shelf
high-availability solutions for providing server failover. This does impact
the semantics of the file system, but we believe that the resulting semantics
are very appropriate for parallel I/O.
\subsubsection{Explicit support for concurrency}
Clearly concurrent processing is key in this type of system. The PVFS2 server
and client designs are based around an explicit state machine system that is
tightly coupled with a component for monitoring completion of operations
across all devices. Threads are used where necessary to provide non-blocking
access to all device types. This combination of threads, state machines, and
completion notification allows us to quickly identify opportunities to make
progress on particular operations and avoids serialization of independent
operations within the client or server.
This design has a further side-effect of giving us native support for
asynchronous operations on the client side. Native support for asynchronous
operations makes nonblocking operations under MPI-IO both easy to implement
and advantageous to use.
\subsubsection{Tunable semantics}
Most distributed file systems in use for cluster systems provide POSIX (or
very close to POSIX) semantics. These semantics are very strict, arguably
more strict than necessary for a usable parallel I/O system.
NFS does not provide POSIX semantics because it does not guarantee that client
caches are coherent. This actually results in a system that is often unusable
for parallel I/O, but is very useful for home directories and such.
Storage systems being applied in the Grid environment, such as those being
used in conjunction with some physics experiments, have still different
semantics. These tend to assume that files are added atomically and are never
subsequently modified.
All these very different approaches have their merits and applications, but
also have their disadvantages. PVFS2 will not support POSIX semantics
(although one is welcome to build such a system on top of PVFS2). However, we
do intend to give users a great degree of flexibility in terms of the
coherency of the view of file data and of the file system hierarchy. Users in
a tightly coupled parallel machine will opt for more strict semantics that
allow for MPI-IO to be implemented. Other groups might go for looser
semantics allowing for access across the wide area. The key here is allowing
for the possibility of different semantics to match different needs.
\subsubsection{Flexible mapping from file references to servers}
Administrators appreciate the ability to reconfigure systems to adapt to
changes in policy or available resources. In parallel file systems, the
mapping from a file reference to its location on devices can help or hinder
reconfiguration of the file system.
In PVFS2 file data is split into \emph{datafiles}. Each datafile has its own
reference, and clients identify the server that owns a datafile by checking a
table loaded at configuration time. A server can be added to the system by
allocating a new range of references to that server and restarting clients
with an update table. Likewise, servers can be removed by first stopping
clients, next moving datafiles off the server, then restarting with a new
table. It is not difficult to imagine providing this functionality while the
system is running, and we will be investigating this possibility once basic
functionality is stable.
\subsubsection{Tight MPI-IO coupling}
The UNIX interface is a poor building block for an MPI-IO implementation. It
does not provide the rich API necessary to communicate structured I/O accesses
to the underlying file system. It has a lot of internal state stored as part
of the file descriptor. It implies POSIX semantics, but does not provide them
for some file systems (e.g. NFS, many local file systems when writing large
data regions).
Rather than building MPI-IO support for PVFS2 through a UNIX-like interface,
we have started with something that exposes more of the capabilities of the
file system. This interface does not maintain file descriptors or internal
state regarding such things as file positions, and in doing so allows us to
better leverage the capabilities of MPI-IO to perform efficient access.
We've already discussed rich I/O requests. ``Opening'' a file is another good
example. \texttt{MPI\_File\_open()} is a collective operation that gathers
information on a file so that MPI processes may later access it. If we were
to build this on top of a UNIX-like API, we would have each process that would
potentially access the file call
\texttt{open()}. In PVFS2 we instead resolve the filename into a handle using
a single file system operation, then broadcast the resulting handle to the
remainder of the processes. Operations that determine file size, truncate
files, and remove files may all be performed in this same $O(1)$ manner,
scaling as well as the MPI broadcast call.
\subsubsection{Data and metadata redundancy}
Another common (and valid) complaint regarding PVFS1 is its lack of support
for redundancy at the server level. RAID approaches are usable to provide
tolerance of disk failures, but if a server disappears, all files with data on
that server are inaccessible until the server is recovered.
Traditional high-availability solutions may be applied to both metadata and
data servers in PVFS2 (they're actually the same server). This option
requires shared storage between the two machines on which file system data is
stored, so this may be prohibitively expensive for some users.
A second option that is being investigated is what we are calling \emph{lazy
redundancy}. The lazy redundancy approach is our response to the failure of
RAID-like approaches to scale well for large parallel I/O systems when applied
across servers. The failure of this approach at this scale is primarily due
to the significant change in environment (latency and number of devices across
which data is striped). Providing the atomic read/modify/write capability
necessary to implement RAID-like protocols in a distributed file system
requires a great deal of performance-hampering infrastructure.
With lazy redundancy we expose the creation of redundant data as an explicit
operation. This places the burden of enforcing consistent access on the
user. However, it also opens up the opportunity for a number of
optimizations, such as creating redundant data in parallel. Further, because
this can be applied at a more coarse grain, more compute-intensive algorithms
may be used in place of simple parity, providing higher reliability than
simple parity schemes.
Lazy redundancy is still at the conceptual stage. We're still trying to
determine how to best fit this into the system as a whole. However,
traditional failover solutions may be put in place for the existing system.
\subsubsection{And more...}
There are so many things that we feel we could have done better in PVFS1 that
it is really a little embarrassing. Better heterogeneous system support, a
better build system, a solid infrastructure for testing concurrent access to
the file system, an inherently concurrent approach to servicing operations on
both the client and server, better management tools, even symlinks; we've
tried to address most if not all the major concerns that our PVFS1 users have
had over the years.
It's a big undertaking for us. Which leads to the obvious next question.
\subsection{When will this be available?}
Believe it or not, right now. At SC2004 we released PVFS2 1.0. We
would be foolish to claim PVFS2 has no bugs and will work for everyone
100\% of the time, but we feel PVFS2 is in pretty good shape. Early
testing has found a lot of bugs, and we feel PVFS is ready for wider
use.
Note that we're committed to supporting PVFS1 for some time after PVFS2
is available and stable. We feel like PVFS1 is a good solution for many
groups already, and we would prefer for people to use PVFS1 for a little
while longer rather than them have a sour first experience with PVFS2.
We announce updates frequently on the PVFS2 mailing lists. We encourage
users to subscribe -- it's the best way to keep abreast of PVFS2
developments. All code is being distributed under the LGPL license to
facilitate use under arbitrarily licensed high-level libraries.