Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
pvfs2-osd/doc/user-apis.tex
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
143 lines (112 sloc)
6.8 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\section{PVFS2 User APIs and Semantics} | |
\label{sec:apis} | |
Because PVFS2 is designed specifically for performance in systems where | |
concurrent access from many processes is commonplace, there are some | |
differences between the PVFS2 interfaces and traditional file system | |
interfaces. | |
% | |
In this section we will discuss the two interfaces provided for applications | |
to use when accessing PVFS2 file systems. We will start with the traditional | |
UNIX I/O interface, which nearly all file systems implement. We will then | |
cover the MPI-IO interface. | |
\subsection{UNIX I/O Interface} | |
We provide an implementation of the UNIX I/O interface for clients running | |
Linux versions 2.4 (2.4.19 and later) and 2.6. This interface implements the | |
traditional \texttt{open}, \texttt{read}, \texttt{write}, and | |
\texttt{close} interface as well as providing the directory operations | |
necessary for applications such as \texttt{ls} to work. | |
It is important to note that there is a difference between implementing the | |
UNIX I/O API and implementing the POSIX semantics for this API. File systems | |
exported via NFS (versions 2 and 3) do not exhibit many of the POSIX | |
semantics, and even local file systems may not guarantee atomicity of writes | |
that cross disk block boundaries. We also do not implement the full POSIX | |
semantics. Here we will document aspects of the POSIX semantics that we do | |
not implement. | |
\subsubsection{Permission Checking} | |
To understand why PVFS2 permission checking behaves differently from the POSIX | |
standard, it is useful to discuss how PVFS2 performs permission checking. | |
PVFS2 does not really implement the \texttt{open} etc. interface, but instead | |
uses a stateless approach that relies on the client to \texttt{lookup} a file | |
name to convert it into a \emph{handle} that can be used for subsequent read | |
and write accesses. This handle may be used for many read and write accesses | |
and may be cached under certain guidelines. This lookup operation is | |
performed at file open time. | |
Permission checking is performed in two places. First, the VFS checks | |
permissions on the client and will prevent users from performing invalid | |
operations. Second, the server performs rudimentary checks for specific | |
operations; however, it (currently) relies on the client to provide accurate | |
user and group information to be used for this purpose. No recursive | |
directory permission checking is performed by the servers, for example. | |
\subsubsection{Permissions and File Access} | |
POSIX semantics dictate that once a file has been opened it may | |
continue to be accessed by the process until closed, regardless of changes to | |
permissions. | |
In PVFS2, the effect of permission changes on a file may or may not be | |
immediately apparent to a client holding an open file descriptor. Because | |
of the manner in which PVFS2 performs permission checking and file lookup, it | |
is possible that a client may lose the ability to access a file that it has | |
previously opened due to permission change, if for example the cached handle | |
is lost and a lookup is performed again. | |
\subsubsection{Access to Removed Files} | |
POSIX semantics dictate that a file opened by a process may continue to be | |
accessed until the subsequent close, even if the file permissions are changed | |
or the file is deleted. This requires that the file system or clients somehow | |
keep up with a list of files that are open, which adds unacceptable state to a | |
distributed file system. In NFS, for example, this is implemented via the | |
``sillyrename'' approach, in which clients rename a deleted but open file to | |
hide it in the directory tree, then delete the renamed file when the file is | |
finally closed. | |
In PVFS2 a file that is deleted is removed immediately regardless of open file | |
descriptors. Subsequent attempts to access the file will find that the file | |
no longer exists. | |
\emph{ Neill: Is this completely true, or do clients delay removal in | |
pvfs2-client if someone is still accessing? } | |
\subsubsection{Overlapping I/O Operations} | |
POSIX semantics dictate sequential consistency for overlapping I/O operations. | |
This means that I/O operations must be atomic with respect to each other -- if | |
one process performs a read spanning a collection of servers while another | |
performs a write in the same region, the read must see either all or none of | |
the changes. In a parallel file system this involves communication to | |
coordinate access in this manner, and because of the number of clients we wish | |
to support, we are unwilling to implement this functionality (at least as part | |
of the core file system). | |
In PVFS2 we instead implement a semantic we call \emph{nonconflicting writes} | |
semantic. This semantic states that all I/O operations that do not access the | |
same bytes (in other words, are nonconflicting) will be sequentially | |
consistent. Write operations that conflict will result in some undefined | |
combination of the bytes being written to storage, while read operations that | |
conflict with write operations will see some undefined combination of original | |
data and write data. | |
% TODO: ADD DIAGRAMS IF NEEDED. | |
\subsubsection{Locks} | |
BSD provides the \texttt{flock} mechanism for locking file regions as a way | |
to perform atomic modifications to files. POSIX provides this functionality | |
through options to \texttt{fcntl}. Both of these are advisory locks, which | |
means that processes not using the locks can access the file regions. | |
PVFS2 does not implement a locking infrastructure as part of the file system. | |
At this time there is no add-on advisory locking component either. Thus | |
neither the \texttt{flock} function nor the \texttt{fcntl} advisory locks | |
are supported by PVFS2. | |
\subsection{MPI-IO Interface} | |
We provide an implementation of the MPI-IO interface via an implementation of | |
the ROMIO ADIO interface. This implementation is included as part of MPICH1 | |
and MPICH2 as well as being available as an independent package. The MPI-IO | |
interface is part of the MPI-2 standard and defines an API for file access | |
from within MPI programs. | |
Our PVFS2 implementation does not include all the functionality of the MPI-IO | |
specification. In this section we discuss the missing components of MPI-IO | |
for PVFS2. | |
\subsubsection{MPI-IO Atomic Mode} | |
Atomic mode is enabled by calling \texttt{MPI\_File\_set\_atomicity} and | |
setting the atomicity to true. Atomic mode guarantees that data written on | |
one process is immediately visible to another process (as in the POSIX default | |
semantics). | |
ROMIO currently uses file locking to implement the MPI-IO atomic mode | |
functionality. Because we do not support locks in PVFS2, atomic mode is not | |
currently supported. | |
\subsubsection{MPI-IO Shared Pointer Mode} | |
Shared pointers are used in the \texttt{\_shared} and \texttt{\_ordered} | |
families of functions. | |
The ROMIO implementation relies on the use of locking support from the file | |
system to implement both of these families of functions, so these are not | |
currently supported. We are researching alternative implementations. | |