Skip to content
Permalink
1139b72d5e
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
1291 lines (1030 sloc) 56.9 KB
\documentclass[11pt,letterpaper]{article}
\usepackage{html}
\usepackage{charter}
\pagestyle{empty}
%
% GET THE MARGINS RIGHT, THE UGLY WAY
%
\topmargin 0.0in
\textwidth 6.5in
\textheight 9.0in
\columnsep 0.25in
\oddsidemargin 0.0in
\evensidemargin 0.0in
\headsep 0.0in
\headheight 0.0in
\title{Frequently Asked Questions about PVFS}
\author{ PVFS Development Team }
% \date{Last Updated: September 2004}
%
% BEGINNING OF DOCUMENT
%
\begin{document}
\maketitle
\tableofcontents
\thispagestyle{empty}
%
% BASICS
%
\section{Basics}
This section covers some basic questions for people who are unfamiliar with PVFS.
\subsection{What is PVFS?}
PVFS is an open-source, scalable parallel file system targeted at production
parallel computation environments. It is designed specifically to scale to
very large numbers of clients and servers. The architecture is very modular,
allowing for easy inclusion of new hardware support and new algorithms. This
makes PVFS a perfect research testbed as well.
\subsection{What is the history of PVFS?}
PVFS was first developed at Clemson University in 1993
by Walt Ligon and Eric Blumer as a parallel file system for
Parallel Virtual Machine (PVM). It was developed as part of a
NASA grant to study the I/O patterns of parallel programs. PVFS version
0 was based on Vesta, a parallel file system developed at IBM T. J.
Watson Research Center. Starting in 1994 Rob Ross re-wrote PVFS to
use TCP/IP and departed from many of the original Vesta design points.
PVFS version 1 was targeted to a cluster of DEC Alpha workstations
networked using switched FDDI. Like Vesta, PVFS striped data across
multiple servers and allowed I/O requests based on a file view that
described a strided access pattern. Unlike Vesta, the striping and view
were not dependent on a common record size. Ross' research focused on
scheduling of disk I/O when multiple clients were accessing the same
file. Previous results had show than scheduling according the best
possible disk access pattern was preferable. Ross showed that this
depended on a number of factors including the relative speed of the
network and the details of the file view. In some cases a scheduling
that based on network traffic was preferable, thus a dynamically
adaptable schedule provided the best overall performance.
In late 1994 Ligon met with Thomas Sterling and John Dorband at Goddard
Space Flight Center (GSFC) and discussed their plans to build the first
Beowulf computer. It was agreed that PVFS would be ported to Linux
and be featured on the new machine. Over the next several years Ligon
and Ross worked with the GSFC group including Donald Becker, Dan Ridge,
and Eric Hendricks. In 1997 at a cluster meeting in Pasadena, CA
Sterling asked that PVFS be released as an open source package.
In 1999 Ligon proposed the development of a new version of PVFS
initially dubbed PVFS2000 and later PVFS2. The design was initially
developed by Ligon, Ross, and Phil Carns. Ross completed his PhD in 2000
and moved to Argonne National Laboratory and the design and
implementation was carried out by Ligon, Carns, Dale Witchurch, and
Harish Ramachandran at Clemson University, Ross, Neil Miller, and Rob
Lathrum at Argonne National Laboratory, and Pete Wyckoff at Ohio
Supercomputer Center. The new file system was released in 2003. The
new design featured object servers, distributed metadata, views based on
MPI, support for multiple network types, and a software architecture for
easy experimentation and extensibility.
PVFS version 1 was retired in 2005. PVFS version 2 is still supported by
Clemson and Argonne. Carns completed his PhD in 2006 and joined Axicom,
Inc. where PVFS was deployed on several thousand nodes for data mining.
In 2008 Carns moved to Argonne and continues to work on PVFS along with
Ross, Latham, and Sam Lang. Brad Settlemyer developed a mirroring
subsystem at Clemson, and later a detailed simulation of PVFS used for
researching new developments. Settlemyer is now at Oak Ridge National
Laboratory. in 2007 Argonne began porting PVFS for use on an IBM Blue
Gene/P. In 2008 Clemson began developing extensions for supporting
large directories of small files, security enhancements, and redundancy
capabilities. As many of these goals conflicted with development for
Blue Gene, a second branch of the CVS source tree was created and dubbed
"Orange" and the original branch was dubbed "Blue." PVFS and OrangeFS
tracked each other very closely, but represent two different groups of
user requirements.
\subsection{What is OrangeFS?}
Simply put, OrangeFS is PVFS. OrangeFS is a branch of PVFS created by
the Clemson team PVFS developers to investigate new features and
implementations of PVFS. As of fall 2010 OrangeFS has become the main
branch of PVFS. So why the name change? PVFS was originally conceived
as a research parallel file system and later developed for production on
large high performance machines such as the BG/P at Argonne National
Lab. OrangeFS is taking a slightly different approach to support a
broader range of large and medium systems and a number of issues PVFS
was not concerned with including security, redundancy, and a broader
range of applications. The new name reflects this new focus, but for
now at least, OrangeFS is PVFS.
The PVFS web site is still maintained. The PVFS mailing lists for
users and developers have not changed and will be used for OrangeFS.
At some point in the future
another group may decide to branch from the main but the PVFS site will
remain the home for the community.
\subsection{What is Omnibond?}
Omnibond is a software company that for years has worked with Clemson
University to market software developed at the university. As of fall
2010 Omnibond is offering commercial support for OrangeFS/PVFS.
OrangeFS is open source and will always be free; and the code, as
always, is developed and maintained by the PVFS community. Omnibond is
offering profesional services to those who are intersted in it, and
directly supports the PVFS community. Omnibond offers its customers the
option of dedicated support services and the opportunity to support the
development of new features that they feel are critical. Omnibond gives
back to the community through their support and development.
\subsection{What does the ``V'' in PVFS stand for?}
The ``V'' in PVFS stands for virtual. This is a holdover from the original
(PVFS1) project that built a parallel file system on top of local file
systems, which we still do now. It isn't meant to imply virtualization of
storage, although that is sort of what the file system does.
\subsection{Is PVFS an attempt to parallelize the *NIX VFS?}
No, and we're not even sure what that means! The design of PVFS does
not depend on the design of the traditional *NIX Virtual Filesystem
Switch (VFS) layer, although we provide a compatibility layer that
allows access to the file system through it.
\subsection{What are the components of PVFS that I should know about?}
The PVFS Guide (\url{http://www.pvfs.org/pvfs2-guide.html}) has more
information on all of these components, plus a discussion of the system as a
whole, the code tree, and more.
\subsection{What is the format of the PVFS version string?}
\label{sec:version-string}
PVFS uses a three-number version string: X.Y.Z. The first number (X)
represents the high level design version of PVFS. The current design
version is 2, and will likely remain there. The second number (Y) refers
to the major version of the release. Major versions are incremented with
new features, protocol changes, public API changes, and storage format
changes. The third number (Z) refers to the minor version of the release,
and is incremented primarily for bug fix releases.
With our 2.6.0 release,
we changed the release version and name from PVFS2 1.x.x, to PVFS 2.x.x.
Users familiar with 'PVFS2' and had been using PVFS2 1.5.1
will find the same software in PVFS version 2.6.0 or
later (with updates and new features of course).
Users of PVFS version 1 can still go to:
\url{http://www.parl.clemson.edu/pvfs}, although we highly
encourage you to upgrade to PVFS version 2, if you are still using
version 1.
%
% SUPPORTED ARCHITECTURES
%
\section{Supported Architectures and Hardware}
This section covers questions related to particular system architectures,
operating systems, and other hardware.
\subsection{Does PVFS require any particular hardware?}
Other than hardware supported by the Linux OS, no. PVFS uses
existing network infrastructure for communication and can currently
operate over TCP, Myrinet, and InfiniBand. Disk local to servers is
used for PVFS storage, so no storage area network (SAN) is required
either (although it can be helpful when setting up fault tolerant solutions;
see Section~\ref{sec:fault-tolerance}.
\subsection{What architectures does PVFS support?}
\label{sec:supported-architectures}
The majority of PVFS is POSIX-compliant C code that runs in user
space. As such, much of PVFS can run on most available systems. See
Question~\ref{sec:supported-hw} for more information on particular
hardware.
The (optional) part of PVFS that hooks to the operating system on
clients must be written specifically for the particular operating
system. Question~\ref{sec:kernel-version} covers this issue.
\subsection{Does PVFS work across heterogeneous architectures?}
Yes! The ``language'' that PVFS uses to talk between clients and
servers is encoded in a architecture-independent format (little-endian
with fixed byte length parameters). This allows different PVFS
components to interact seamlessly regardless of architecture.
\subsection{Does running PVFS require a particular kernel or kernel
version?}
\label{sec:kernel-version}
You can run the userspace PVFS servers and administration tools on
every major GNU/Linux distribution out of the box, and we intend to
keep it that way.
%
However, the kernel module that allows client access to the PVFS system
does depend on particular kernel versions because it builds against
the running one (in the same manner as every other Linux module).
The kernel dependent PVFS client support has been written for Linux
kernel versions 2.4.19 (and greater) and 2.6.0 (and greater). At this
time only Linux clients have this level of support.
\subsection{What specific hardware architectures are supported by the
PVFS kernel module?}
\label{sec:supported-hw}
To our knowledge, PVFS has been verified to be working on x86/IA-32,
IA-64, AMD64, PowerPC (ppc), and Alpha based GNU/Linux distributions.
\subsection{Does the PVFS client require a patched Linux kernel?}
No. The kernel module source included with PVFS is generally
targeted toward the official ``Linus'' kernels (found at kernel.org).
Patches for the PVFS kernel module code may be provided for major
distributions that have modified their kernel to be incompatible with
the officially released kernels. The best place to find out more
information about support for a kernel tied to a particular
distribution is on the PVFS2-developers mailing list.
\subsection{Can I build the PVFS kernel code directly into the kernel,
rather than as a module?}
No, this is currently not supported nor recommended.
\subsection{Is there a MacOS X/Cygwin/Windows client for PVFS?}
At this time we have no plans for porting the code to operating
systems other than Linux. However, we do encourage porting efforts of
PVFS to other operating systems, and will likely aid in the
development.
%
% INSTALLATION
%
\section{Installation}
This section covers issues related to installing and configuring PVFS.
\subsection{How do I install PVFS?}
The PVFS Quick Start Guide
(\url{http://www.pvfs.org/pvfs2/pvfs2-quickstart.html}) provides an overview
of both a simple, single-server installation, and a more complicated,
multi-server configuration.
\subsection{How can I store PVFS data on multiple disks on a single node?}
\label{sec:multiple-disks}
There are at least two ways to do this.
In general the best solution to this problem is going to be to get the disks
logically organized into a single unit by some other OS component, then build
a file system on that single logical unit for use by the PVFS server on that
node.
There are a wide array of hardware RAID controllers that are capable of
performing this task.
%
The Multiple Devices (MD) driver is a software component of Linux that can be
used to combine multiple disk drives into a single logical unit, complete with
RAID for fault tolerance.
%
Using the Logical Volume Management (LVM) component of the Linux OS is another
option for this (see the HOWTO at
\url{http://www.tldp.org/HOWTO/LVM-HOWTO.html}). LVM would also allow you to
add or remove drives at a later time, which can be quite convenient. You
can of course combine the MD and LVM components in interesting ways as well,
but that's outside the scope of this FAQ.
%
There's an EVMS program that can be used for managing local storage; this
might be useful for setting up complicated configurations of local storage
prior to starting up PVFS servers.
A second solution would be to use more than one server on the same node, each
using a different file system to store its data. This might lead to resource
contention issues, so we suggest trying other options first.
\subsection{How can I run multiple PVFS servers on the same node?}
If you do decide to run more than one PVFS server on the same node,
setting things up is as simple as setting up servers on different
nodes. Each will need its own entry in the list of Aliases and its
own server-specific configuration file, as described in the Quick Start
(\url{http://www.pvfs.org/pvfs2/pvfs2-quickstart.html}).
\subsection{Can I use multiple metadata servers in PVFS?}
Absolutely! Any PVFS server can store either metadata, data, or both.
Simply allocate unique MetaHandleRanges for each server that you would like to
store metadata; the clients will handle the rest.
\subsection{Does using multiple metadata servers reduce the chance of
file system corruption during hardware failures?}
Unfortunately, no. While using multiple metadata servers distributes
metadata, it does not replicate or store redundant information across
these servers. For information on better handling failures, see
Section~\ref{sec:fault-tolerance}.
\subsection{How many servers should I run?}
\label{sec:howmany-servers}
Really, the answer is ``it depends'', but here are some factors you
should take into account.
Running multiple metadata servers might help if you expect to have have
a lot of small files. The metadata servers are not involved in data
access (file contents) but do have a role in file creation and lookup.
Multiple clients accessing different files will likely access different
metadata servers, so you could see a load balancing effect.
A good rule of thumb is you should run as many data servers as possible.
One common configuration is to have some nodes with very
high-performance disks acting as servers to the larger cluster. As you
use more servers in this configuration, the theoretical peak performance
of PVFS increases. The clients, however, have to make very large
requests in order to stripe the I/O across all the servers. If your
clients will never write large files, use a smaller number of servers.
If your clients are writing out gigantic checkpoint files or reading in
huge datasets, then use more servers.
It is entirely possible to run PVFS servers on the same nodes doing
computation. In most cases, however, you will see better performance
if you have some portion of your cluster dedicated to IO and another
portion dedicated to computation.
\subsection{Can PVFS servers listen on two network interfaces simultaneously (i.e. multihome)?}
Yes! PVFS servers can listen on more than one interface at a time.
Multihome support was added shortly before the PVFS2 1.0 release.
\subsection{How can I automount PVFS volumes?}
The Linux automounter needs some help dealing with PVFS's resource
strings. A typical mount command (on Linux 2.6) would look like this:
\begin{verbatim}
mount -t pvfs2 tcp://server0:3334/pvfs2-fs /mnt/pvfs2
\end{verbatim}
The entry in the automount config file should look like this:
\begin{verbatim}
pvfs -fstype=pvfs2 tcp://server0\:3334/pvfs2-fs
\end{verbatim}
Note the backslash-escape of the colon before the port number. Without that
escape, the automounter will get confused and replace \texttt{'tcp://'} with
\texttt{'tcp:///'}
\subsection{Can I mount more than one PVFS file system on the same client?}
\label{sec:multiple-mounts}
Yes. However, when setting up the two file systems it is important that both
file systems have unique \texttt{Name} and \texttt{ID} values (in the
file system configuration file). This means that you can't simply make a copy
of the \texttt{fs.conf} generated by \texttt{pvfs2-genconfig}; you will need
to edit the files a bit. This editing needs to be performed \emph{before} you
create the storage spaces!
\subsection{How can I upgrade from PVFS v1 to PVFS v2?}
Hans Reiser summarized the upgrade approach from reiserfs V3 to V4 with the following:
\begin{quote}
To upgrade from reiserfs V3 to V4, use tar, or sponsor us to write a convertfs.
\end{quote}
Similarly, there are no tools currently provided by the PVFS team to
upgrade from PVFS1 to PVFS2, so tar is your best bet.
%
% REPORTING PROBLEMS
%
\section{Reporting Problems}
This section outlines some steps that will help the developers figure out what
has happened when you have a problem.
\subsection{Where can I find documentation?}
The best place to look for documentation on PVFS is the PVFS web site at
\url{http://www.pvfs.org/}. Documentation (including this FAQ) is also
available in the \texttt{doc} subdirectory of the PVFS source distribution.
Please reference \texttt{pvfs2-logging.txt} to understand more about PVFS'
informational messages, where the logs exist, and how to turn logging
on and off.
\subsection{What should I do if I have a problem?}
The first thing to do is to check out the existing documentation and see if it
addresses your problem. We are constantly updating documentation to clarify
sections that users have found confusing and to add to this document answers
to questions that we have seen.
The next thing to do is to check out the PVFS mailing list archives at
\url{http://www.pvfs.org/pvfs2/lists.html}. It is likely that you are not
the first person to see a particular problem, so searching this list will
often result in an immediate answer.
If you still haven't found an answer, the next thing to do is to mail the
mailing list and report your problem.
If you enjoy using IRC, you can also join us on irc.freenode.net in
the \#pvfs2 channel.
\subsection{How do I report a problem with PVFS?}
First you will need to join the PVFS2 Users Mailing list at
\url{http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users}. You
must be a user to post to the list; this is necessary to keep down the amount
of spam on the list.
Next you should gather up some information regarding your system:
\begin{itemize}
\item Version of PVFS
\item Version of MPI and MPI-IO (if you're using them)
\item Version of Linux kernel (if you're using the VFS interface)
\item Hardware architecture, including CPU, network, storage
\item Any logs that might be useful to the developers
\end{itemize}
Including this information in your first message will help the developers most
quickly help you. You are almost guaranteed that if you do not include this
information in your first message, you will be asked to provide it in the
first reply, slowing down the process.
You should be aware that you are also likely to be asked to try the newest
stable version if you are not running that version. We understand that this
is not always possible, but if it is, please do.
\emph{Note:} Please do not send your message to both the PVFS2 Users List and
the PVFS2 Developers List; the lists serve different purposes. Also, please
do not send your message directly to particular developers. By keeping
discussion of problems on the mailing lists we ensure that the discussion is
archived and that everyone has a chance to respond.
%
% Problems and Solutions
%
\section{Problems and Solutions}
This section covers error conditions you might encounter, what they might mean,
and how to fix them.
\subsection{When I try to mount, I get 'wrong fs type, bad option, bad
superblock...'}
First, make 100\% sure you typed the mount command correctly. As discussed in
the PVFS quickstart, different mount commands are needed for linux-2.4 and
linux-2.6. A linux-2.6 mount command will look like this:
\begin{verbatim}
prompt# mount -t pvfs2 tcp://testhost:3334/pvfs2-fs /mnt/pvfs2
\end{verbatim}
Under linux-2.4, the mount command looks slightly different:
\begin{verbatim}
prompt# mount -t pvfs2 pvfs2 /mnt/pvfs2 -o tcp://testhost:3334/pvfs2-fs
\end{verbatim}
This error could also mean a pvfs2-client process is not running,
either because it was not started before the mount command, or was
terminated at some point. If you can reliably (or even intermittently)
cause the pvfs2-client to exit abnormally, please send a report to the
developers.
This error can also occur if you attempt to mount a second PVFS file system
on a client, where the new file system has the same name or ID as one
that is already mounted. If you are trying to mount more than one file system
on the same client and have problems, please see question
\ref{sec:multiple-mounts}.
Finally, be sure there are no typos in your command line, as this is
commonly the case!
\subsection{PVFS server consumes 100\% of the CPU}
\label{sec:server_100pct_cpu}
On some systems, the pvfs2-server will start consuming 100\% of the CPU
after you try to read or write a file to PVFS. gdb indicates that the
server is spending a lot of time in the glibc routine
\texttt{'.handle\_kernel\_aio'}. Please check to see if your
distribution has an updated glibc package. RHEL3, for example, will
exhibit this behavior with glibc-2.3.2-95.6, but not with the updated
glibc-2.3.2-95.20 package. We have also seen this behavior on ppc64
systems running glibc-2.3.3-18.ydl.4 . If you encounter this problem
and your distribution does not have an updated glibc package, you can
configure pvfs2 with \texttt{--disable-aio-threaded-callbacks}, though
this will result in a performance hit. An alternate workaround is to
set \texttt{LD\_ASSUME\_KERNEL} to 2.4.1 before running pvfs2-server.
pvfs2-server will then use an older (and not as optimized) thread
library that does not have this bug.
At this time we do not know which of the two suggested workarounds is
better from a performance standpoint. The \texttt{LD\_ASSUME\_KERNEL}
method might make more sense: when/if the system's glibc is
upgraded, you will only have to restart pvfs2-server with the
environment variable unset. You would not have to rebuild pvfs2 to take
advantage of the fix.
\subsection{PVFS write performance slows down dramatically}
\label{sec:write_slowdown}
Phil Carns noticed that on some kernels, write-heavy workloads can trigger a
kernel bug. The symptoms are that the PVFS server will only be able to
deliver a few KB/s, and the CPU utilization will be close to 100\%. The cause
appears to be related to ext3's ``reservation'' code (designed to reduce
fragmentation). The solution is to either mount the filesystem with the
'noreservation' option, or upgrade your kernel.
For more information, including URLs to several other reports of this issue, see Phil's original post:
\url{http://www.beowulf-underground.org/pipermail/pvfs2-developers/2006-March/001885.html}
\subsection{I get ``error while loading shared libraries'' when starting PVFS programs}
PVFS needs several libraries. If those libraries aren't in the default
locations, you might need to add flags when running PVFS's configure script.
At configure time you can, for example, pass \texttt{--with-db=/path/to/db
--with-gm=/path/to/gm} to compile with Berkeley DB and Myiricom GM libraries.
The configure options let the compiler know where to find the libraries at
compile time.
Those compile-time options, however, aren't enough to find the libraries at
run-time. There are two ways to teach the system where to find libraries:
\begin{itemize}
\item add /usr/local/BerkeleyDB.4.3/lib to the /etc/ld.so.conf config file
and re-run 'ldconfig' OR
\item add /usr/local/BerkeleyDB.4.3/lib to the \texttt{LD\_LIBRARY\_PATH}
environment variable.
\end{itemize}
I would suggest the ld.so.conf approach, since that will work for all users on
your system.
\subsection{PVFS performance gets really bad once a day, then gets
better again}
\label{sec:cron-indexing}
Several sites have reported poor PVFS performance early in the day that
eventually goes away until the next day, when the cycle begins again. Daily
cron jobs might be the culprit in these cases. In particular, most Linux
distributions have a daily cron job (maybe called 'slocate', 'locate' or
'updatedb') that indexes the entire file system. Networked file systems such
as NFS are often excluded from this indexing.
The exact steps to remove PVFS from this indexing process vary among
distributions. Generally speaking, there should be a cron script in
\texttt{/etc/cron.daily} called 'slocate' or 'updatedb'. That script should
have a list of excluded file systems (like /var/run and /tmp) and flie types (
like 'proc' and 'nfs'). Either add 'pvfs2' to the list of file types or add
the pvfs2 mount point to the list of excluded file systems. Be sure to do
this on all machines in your cluster.
\subsection{Make kmod24 fails with ``structure has no member...'' errors}
On some Redhat and Redhat-derived distributions, ``make kmod24'' might
fail with errors like this:
\begin{verbatim}
console]:make kmod24
CC [M] /usr/src/pvfs2/src/kernel/linux-2.4/pvfs2-utils.o
pvfs2-utils.c: In function `mask_blocked_signals':
pvfs2-utils.c:1063: structure has no member named `sig'
pvfs2-utils.c:1070: structure has no member named `sigmask_lock'
pvfs2-utils.c:1073: too many arguments to function `recalc_sigpending'
pvfs2-utils.c: In function `unmask_blocked_signals':
pvfs2-utils.c:1082: structure has no member named `sigmask_lock'
pvfs2-utils.c:1084: too many arguments to function `recalc_sigpending'
make[1]: *** [pvfs2-utils.o] Error 1
make: *** [kmod24] Error 2
\end{verbatim}
Redhat, and derived distributions, have a linux-2.4 based kernel with many
linux-2.6 features backported. These backported features change the
interface to the kernel fairly significantly. PVFS versions newer than
1.0.1 have a new configure option \texttt{--enable-redhat24}. With this
option, we will be able to accommodate the backported features (and the
associated interface changes).
\subsection{When i try to mount a pvfs2 file system, something goes wrong.}
\begin{itemize}
\item First, are all the userspace components running? If \texttt{pvfs2-ping}
doesn't work, the VFS interface won't, either.
\item Make sure the pvfs2 kernel module is loaded
\item Make sure pvfs2-client and pvfs2-client core are running
\item Take a look at dmesg. \texttt{pvfs2\_get\_sb -- wait timed out} could
indicate a problem with \texttt{pvfs2-client-core}. See the next
question.
\end{itemize}
\subsection{I did all three of the above steps and I still can't mount pvfs2}
\label{sec:nptl_and_mounting}
There's one last thing to check. Are you you are using a Redhat or Fedora
distribution, but running with a stock kernel.org 2.4 kernel? If so, you need
to set the environment variable \texttt{LD\_ASSUME\_KERNEL} to 2.4.1 or
\texttt{pvfs2-client-core} will try to use the NPTL thread library. NPTL
requires a 2.6 kernel (or a heavily backported 2.4 kernel, which Redhat
provides). Redhat systems expect to have such a kernel, so running a stock
kernel.org 2.4 kernel can cause issues with any multi-threaded application. In
this particular case, the \texttt{pvfs2-client-core} failure is hidden and can
be tricky to diagnose.
\subsection{I'm running Redhat and the pvfs2-server can't be killed! What's wrong?}
On some Redhat systems, for compatibility reasons, the pvfs2-server
program is actually a script that wraps the installed pvfs2-server
binary. We do this ONLY if we detect that PVFS is being installed on
a system with an NPTL implementation that we're incompatible with.
Specifically, the script exports the LD\_ASSUME\_KERNEL=2.2.5
environment variable and value to avoid using the NPTL at run-time.
The script quite literally exports this variable and then runs the
installed pvfs2-server binary which is named
\texttt{pvfs2-server.bin}. So to properly shutdown or kill the
pvfs2-server application once it's running, you need to issue a
\texttt{killall pvfs2-server.bin} command instead of the more common
\texttt{killall pvfs2-server} command.
\subsection{Why do you single out Redhat users? What's so different
about Redhat than other distributions?}
Some Redhat versions (and probably some other less popular
distributions) use a heavily modified Linux 2.4.x kernel. Due to the
changes made in the memory manager and signal handling, our default
Linux 2.4.x kernel module will not even compile! We have
compatibility code that can mend the differences in place, but we have
to be able to detect that you're running such a system. Our configure
script tries hard to determine which version you're running and
matches it against a known list. If you suspect you need this fix and
our script does not properly detect it, please send mail to the
mailing list and include the contents of your /etc/redhat-release
file.
In addition, some Redhat versions ship with an NPTL (threading
library) implementation that PVFS is not compatible with. We cannot
explain why the errors we're seeing are occurring, as they appear to be
in glibc and the threading library itself. In short, we disable the
use of the NPTL on these few Redhat systems. It should be noted that
we are fully compatible with other distributions that ship NPTL
libraries (such as Gentoo and Debian/unstable).
\subsection{Where is the kernel source on a Fedora system?}
Older systems used to split up the kernel into several packages
(\texttt{kernel}, \texttt{kernel-headers}, \texttt{kernel-source}).
Fedora kernels are not split up that way. Everything you need to build a
kernel module is in /lib/modules/`uname -r`/build. For example, Fedora
Core 3 ships with linux-2.6.9-1.667. When configuring PVFS, you would
pass \texttt{--with-kernel=/lib/modules/2.6.9-1.667/build} to the
configure script.
In Fedora Core 4 things changed a little bit. In order to build the pvfs2
kernel module, make sure you have both a \texttt{kernel} and
\texttt{kernel-devel} package installed. If you have an SMP box, then you'll
need to install the -smp versions of both -- i.e. \texttt{kernel-smp} and
\texttt{kernel-smp-devel}. After both packages are installed,
/lib/modules/`uname -r`/build will once again contain a correctly configured
kernel source tree.
\subsection{What are extended attributes? How do I use them with PVFS?}
Extended attributes are name:value pairs associated with objects (files and directories
in the case of PVFS). They are extensions to the
normal attributes which are associated with all objects in the system (i.e. the stat data).
A complete overview of the extended attributes concepts can be found in man pages section 5 for attr.
On supported 2.4 kernels and all 2.6 kernels, PVFS allows users to store extended attributes
on file-system objects through the VFS as well as through the system interface. Example
usage scenarios are shown below,
To set an extended attribute ("key1", "val1") on a PVFS file foo,
\begin{verbatim}
prompt# setfattr -n key1 -v val1 /path/to/mounted/pvfs2/foo
\end{verbatim}
To retrieve an extended attribute for a given key ("key1") on a PVFS file foo,
\begin{verbatim}
prompt# getfattr -n key1 /path/to/mounted/pvfs2/foo
\end{verbatim}
To retrieve all attributes of a given PVFS file foo,
\begin{verbatim}
prompt# getfattr -m "" /path/to/mounted/pvfs2/foo
\end{verbatim}
Note that PVFS uses a few standard names for its internal use that prohibit users
from reusing the same names. A list of such keys are as follows at the time
of writing of this document ("dir\_ent", "root\_handle",
"datafile\_handles", "metafile\_dist", "symlink\_target"). Further, Linux also uses
a set of reserved keys to hold extended attributes that begin with the prefix "system.",
thus making them unavailable for regular usage.
\subsection{What are Access Control Lists? How do I enable Access Control Lists on PVFS?}
Recent versions of PVFS support POSIX Access Control Lists (ACL), which are used to define fine-grained
discretionary access rights for files and directories. Every object can be thought of as having
associated with it an ACL that governs the discretionary access to that object; this ACL
is referred to as an access ACL. In addition, a directory may have an associated ACL that
governs the initial access ACL for objects created within that directory; this ACL
is referred to as a default ACL. Each ACL consists of a set of ACL entries. An ACL entry
specifies the access permissions on the associated object for an individual user or a group
of users as a combination of read, write and search/execute permissions.
PVFS supports POSIX ACLs by storing them as extended attributes. However, support
for access control based permission checking does not exist on 2.4 Linux kernels and is hence disabled on them.
Most recent version of the Linux 2.6 kernels do allow for such permission checks, and PVFS enables
ACLs on such kernels.
However, in order to use and enforce access control lists on 2.6 kernels, one must mount
the PVFS file system by specifying the "acl" option in the mount command line. For example,
\begin{verbatim}
prompt# mount -t pvfs2 tcp://testhost:3334/pvfs2-fs /mnt/pvfs2 -o acl
\end{verbatim}
Please refer to the man pages of "setfacl", "getfacl" or section 5 acl for detailed usage
information.
\subsection{On SLES 9, 'make kmod' complains about \texttt{mmgrab} and
\texttt{flush\_icache\_range} being undefined}
SLES 9 (and possibly other kernels) makes use of internal symobls in some
inlined kernel routines. PVFS2-1.3.2 or newer has the configure option
\texttt{--disable-kernel-aio}. Passing this option to configure results in a pvfs2
kernel module that uses only exported symbols.
\subsection{Everything built fine, but when I try to compile programs that use PVFS, I get undefined references}
\label{sec:undefined_references}
The \texttt{libpvfs2} library requires a few additional libraries. Usually
"-lpthread -lcrypto -lssl" are required. Further, Myrinet and Infiniband have
their own libraries. If you do not link the required libraries, you will
probably get errors such as \texttt{undefined reference to `BIO\_f\_base64'}.
The easiest and most portable way to ensure that you link in all required
libraries when you link \texttt{libpvfs2} is to use the \texttt{pvfs2-config}
utility. \texttt{pvfs2-config --libs} will give you the full set of linker
flags needed. Here's an example of how one might use this tool:
\begin{verbatim}
$ gcc -c $(pvfs2-config --cflags) example.c
$ gcc example.o -o example $(pvfs2-config --libs)
\end{verbatim}
\subsection{Can we run the Apache webserver to serve files off a PVFS volume?}
Sure you can! However, we recommend that you turn off the EnableSendfile option in
httpd.conf before starting the web server. Alternatively, you could configure
PVFS with the option \texttt{--enable-kernel-sendfile}. Passing this option
to configure results in a pvfs2 kernel module that supports the sendfile
callback.
But we recommend that unless the files that are being served are large enough
this may not be a good idea in terms of performance. Apache 2.x+ uses the {\tt sendfile}
system call that normally stages the file-data through the page-cache. On recent 2.6 kernels,
this can be averted by providing a {\tt sendfile} callback routine at the file-system.
Consequently, this ensures that we don't end up with stale or inconsistent cached data on such
kernels. However, on older 2.4 kernels the {\tt sendfile} system call streams the data through
the page-cache and thus there is a real possibility of the data being served stale.
Therefore users of the {\tt sendfile} system call are warned to be wary of this detail.
\subsection{Trove-dbpf metadata format version mismatch!}
\label{sec:trove-migration}
In PVFS2-1.5.0 or newer the format of the metadata storage has change from
previous versions (1.4.0 or earlier). This affects users that have created
file systems with the earlier versions of pvfs2, and wish to upgrade to the
most recent version. We've provided a migration tool that must be run
(a one-time only procedure) to convert the file system from the old format
to the new one. The migration tool can be used as follows:
\begin{verbatim}
$PVFS_INSTALL/bin/pvfs2-migrate-collection --all fs.conf server.conf-<hostname>
\end{verbatim}
This command finds all the pvfs2 storage collections specified in the
configuration files and migrates them to the new format. Instead of
using {\tt --all}, the option {\tt --fs} can be used to specify the name of
the storage collection that needs to be migrated (usually there's only
one storage collection, with the default name of 'pvfs2-fs').
\subsection{Problems with pre-release kernels}
\label{sec:rc-kernels}
For better or worse, the Linux kernel development process for the 2.6 series
does not make much effort to maintain a stable kernel API. As a result, we
often find we need to make small adjustments to the PVFS kernel module to track
recent kernel additions or changes.
If you are using a pre-release kernel (anything with -rc in the name), you
stand a good chance of running into problems. We are unable to track every
pre-release kernel, but do make an effort to publish necessary patches once a
kernel is officially released.
\subsection{Does PVFS work with Open-MX?}
\label{sec:open-mx}
Yes, PVFS does work with Open-MX. To use Open-MX, configure PVFS with
the the same arguments that you would use for a normal MX installation:
``--disable-bmi-tcp'' and ``--with-mx=PATH''. In addition, however, you
must set the ``MX\_IMM\_ACK'' environment variable to ``1'' before starting
the pvfs2-server or pvfs2-client daemons. This is necessary in order to
account for differences in how MX and Open-MX handle message progression by
default.
%
% PERFORMANCE
%
\section{Performance}
This section covers issues related to the performance of PVFS.
\subsection{I configured PVFS with support for multiple intercdonnects (e.g. Infiniband and TCP), but see low performance}
\label{sec:multi-method-badperf}
When multiple interconnects are enabled, PVFS will poll both interfaces. This
gives PVFS maximum flexiblity, but does incur a performance penalty when one
interface is not being used. For highest performance, configure PVFS with only
one fast method. Consult the \texttt{without-bmi-tcp} option or omit the
\texttt{with-<METHOD>} option when configuring PVFS.
Note that it can sometimes be useful to have multiple interconnects enabled.
The right choice depends a lot on your situation.
\subsection{I ran Bonnie and/or IOzone and the performance is terrible.
Why? Is there anything I can do?}
\label{sec:badperf}
We designed PVFS to work well for scientific applications in a cluster
environment. In such an environment, a file system must either spend
time ensuring all client-side caches are in sync, or not use a cache
at all (which is how PVFS currently operates). The \texttt{bonnie}
and \texttt{bonnie++} benchmarks read and write very small blocks --
on the order of 1K. These many small requests must travel from the
client to the server and back again. Without client-side caching,
there is no sane way to speed this up.
To improve benchmark performance, specify a bigger block size. PVFS
has several more aggressive optimizations that can be turned on, but
those optimizations require that applications accessing PVFS can cope
with out-of-sync caches.
In the future, PVFS is looking to provide optional semantics for use
through the VFS that will allow some client-side caching to speed these
kinds of serial benchmarks up. By offering a way to explicitly sync
data at any given time or by providing 'close-to-open' semantics, these
kinds of caching improvements become an option for some applications.
Bear in mind that benchmarks such as IOzone and Bonnie were meant to
stress local file systems. They do not accurately reflect the types of
workloads for which we designed PVFS. Furthermore, because of their
serial nature, PVFS will be unable to deliver its full performance.
Instead try running a parallel file system benchmark like IOR
(\url{ftp://ftp.llnl.gov/pub/siop/ior/}).
\subsection{Why is program XXX so slow?}
\label{sec:why_so_slow}
See Question~\ref{sec:badperf}. If the program uses small block sizes to
access a PVFS file, performance will suffer.
Setting both (or either of) the \texttt{TroveSyncMeta} and
\texttt{TroveSyncData} options to \texttt{no} in the config file can
improve performance in some situations. If you set the
value to no and the server is terminated unexpectedly, you will likely
lose data (or access to it). Also, PVFS has a transparent server
side attribute cache (enabled by default), which can speed up
applications which read a lot of attributes (\texttt{ls}, for
example). Playing around with the \texttt{AttrCache*} config file
settings may yield some performance improvements. If you're running a
serial application on a single node, you can also use the client side
attribute cache (disabled by default). This timeout is adjustable as
a command line argument to pvfs2-client.
\subsection{NFS outperforms PVFS for application XXX. Why?}
\label{sec:nfs_vs_pvfs2}
In an environment where there is one client accessing a file on one
server, NFS will outperform PVFS in many benchmarks. NFS has
completely different consistency semantics, which work very well when
just one process accesses a file. There is some ongoing work that
will optionally offer similar consistency semantics for PVFS, at
which point we will be playing on a level field, so to speak.
However, if you insist on benchmarking PVFS and NFS in a
single-client test, there are some immediate adjustments you can make.
The easiest way to improve PVFS performance is to increase the block
size of each access. Large block sizes help most file systems, but
for PVFS they make a much larger difference in performance than they
do for other file systems.
Also, if the \texttt{TroveSyncMeta} and \texttt{TroveSyncData} options
are set to \texttt{no} in your PVFS configuration file, the server
will sync data to disk only when a flush or close operation is called.
The \texttt{TroveSyncMeta} option is set to \texttt{yes} by default,
to limit the amount of
data that could be lost if a server is terminated unexpectedly. With
this option enabled, it is somewhat analogous to mounting your NFS
volume with the \texttt{sync} flag, forcing it to sync data after each
operation.
As a final note on the issue, if you plan on running application XXX,
or a similar workload, and the NFS consistency semantics are adequate
for what you're doing, then perhaps PVFS is not a wise choice of file
system for you. PVFS is not designed for serial workloads,
particularly one with small accesses.
\subsection{Can the underlying local file system affect PVFS performance?}
\label{sec:local_fs}
Yes! However, the interaction between the PVFS servers and the local
file system hosting the storage space has not been fully explored. No
doubt a great deal of time could be spent on different file systems
and their parameters.
People have looked at sync performance for a variety of file systems.
Some file systems will flush all dirty buffers when \texttt{fsync} is
called. Other file systems will only flush dirty buffers belonging to
the file. See the threads starting at
\url{http://www.parl.clemson.edu/pipermail/pvfs2-developers/2004-July/000740.html}
and at
\url{http://www.parl.clemson.edu/pipermail/pvfs2-developers/2004-July/000741.html}.
These tests demonstrate wide variance in file system behavior.
Interested users are encouraged to experiment and discuss their
findings on the PVFS lists.
If you're looking for a quick suggestion for a local file system type
to use, we suggest ext3 with ``journal data writeback'' option as a
reasonable choice.
\subsection{Is there any way to tune particular directories for different
workloads?}
\label{sec:dir_tuning}
Yes. This can be done by using extended attributes to set directory
hints. Three hints are currently supported, and they allow you to specify
the distribution, distribution parameters, and number of datafiles to
stripe across. They will not change the characteristics of existing
files, but they will take effect for any newly created files within the
directory. These hints will also be inherited by any new
subdirectories.
\subsubsection{Distribution}
The distribution can be set as follows:
\begin{verbatim}
prompt# setfattr -n "user.pvfs2.dist_name" -v "basic_dist" /mnt/pvfs2/directory
\end{verbatim}
Supported distribution names can be found by looking in the pvfs2-dist-*
header files.
\subsubsection{Distribution parameters}
Some distributions allow you to set parameters that impact how the
distribution behaves. These parameters can be set as follows:
\begin{verbatim}
prompt# setfattr -n "user.pvfs2.dist_params" -v "strip_size:4096" /mnt/pvfs2/directory
\end{verbatim}
You can specify more than one "parameter:value" pair by seperating them with
commas.
\subsubsection{Number of datafiles}
You can also specify the number of datafiles to stripe across:
\begin{verbatim}
prompt# setfattr -n "user.pvfs2.num_dfiles" -v "1" /mnt/pvfs2/directory
\end{verbatim}
PVFS defaults to striping files across each server in the file system.
However, you may find that for small files it is advantages to limit each
file to only a subset of servers (or even just one).
\subsection{My app still runs more slowly than I would like. What can I do?}
\label{sec:tuning}
If you ask the mailing list for help with performance, someone will probably
ask you one or more of the following questions:
\begin{itemize}
\item Are you running servers and clients on the same nodes? We support this
configuration -- sometimes it is required given space or budget
constraints. You will not, however, see the best performance out of this
configuration. See Section~\ref{sec:howmany-servers}.
\item Have you benchmarked your network? A tool like netpipe or ttcp can help
diagnose point-to-point issues. PVFS will tax your bisection bandwidth,
so if ppossible, run simultaneous instances of these network benchmarks
on multiple machine pairs and see if performance suffers. One user
realized the cluster had a hub (not a switch, a hub) connecting all the
nodes. Needless to say, performance was pretty bad.
\item Have you examined buffer sizes? On linux, the settings /proc can make a
big difference in TCP performance. Set
\texttt{/proc/sys/net/core/rmem\_default} and
\texttt{/proc/sys/net/core/wmem\_default}
\end{itemize}
Tuning applications can be quite a challenge. You have disks, networks,
operating systems, PVFS, the application, and sometimes MPI. We are
working on a document to better guide the tuning of systems for
IO-intensive workloads.
%
% REDUNDANCY
%
\section{Fault Tolerance}
\label{sec:fault-tolerance}
This section covers issues related to fault tolerance in the context of PVFS.
\subsection{Does PVFS support some form of fault tolerance?}
Systems can be set up to handle many types of failures for PVFS. Given enough
hardware, PVFS can even handle server failure.
\subsection{Can PVFS tolerate client failures?}
Yes. One of the benefits of the PVFS design is that client failures are not a
significant event in the system. Because there is no locking system in PVFS,
and no shared state stored on clients in general, a client failure does not
affect either the servers or other clients.
\subsection{Can PVFS tolerate disk failures?}
Yes, if configured to do so. Multiple disks on each server may be used to
form redundant storage for that server, allowing servers to continue operating
in the event of a disk failure. See section \ref{sec:multiple-disks} for more
information on this approach.
\subsection{Can PVFS tolerate network failures?}
Yes, if your network has redundant links. Because PVFS uses standard
networks, the same approaches for providing multiple network connections to a
server may be used with PVFS. \emph{Need a reference of some sort.}
\subsection{Can PVFS tolerate server failures?}
Yes. We currently have a recipe describing the hardware and software
needed to set up PVFS in a high availability cluster. Our method is
outlined in the `pvfs2-ha.\{ps,pdf\}' file in the doc subdirectory of the
PVFS distribution. This configuration relies on shared storage and
commodity ``heartbeat'' software to provide means for failover.
Software redundancy offers a less expensive solution to redundancy,
but usually at a non-trivial cost to performance. We are studying how
to implement software redundancy with lower overhead, but at this time
we provide no software-only server failover solution.
%
% INTERFACES
%
\section{File System Interfaces}
This section covers issues related to accessing PVFS file systems.
\subsection{How do I get MPI-IO for PVFS?}
The ROMIO MPI-IO implementation, as provided with MPICH2 and others, supports
PVFS. You can find more information in the ROMIO section of the
pvfs2-quickstart: \url{http://www.pvfs.org/pvfs2/pvfs2-quickstart.html\#sec:romio}
\subsection{Can I directly manipulate PVFS files on the PVFS servers
without going through some client interface?}
You can, yes, but you probably should not. The PVFS developers are not
likely to help you out if you do this and something gets messed up...
%
% MANAGEMENT
%
\section{Management}
This section covers questions about managing PVFS file systems.
\subsection{How can I back up my PVFS file system?}
The default storage implementation for PVFS (called Trove DBPF for ``DB Plus
Files'') stores all file system data held by a single server in a single
subdirectory. In that subdirectory is a directory tree containing UNIX files
with file data and metadata.
%
This entire directory tree can be backed up in any manner you like and
restored if problems occur.
As a side note, this was not possible in PVFS v1, and is one of the many
improvements present in the new system.
\subsection{Can I add, remove, or change the order of the PVFS servers
on an existing PVFS file system?}
You can add and change the order of PVFS servers for an existing PVFS file
system. At this time, you must stop all the servers in order to do so.
To add a new server:
\begin{enumerate}
\item Unmount all clients
\item Stop all servers
\item Edit your config file to:
\begin{enumerate}
\item Add a new Alias for the new server
\item Add a new DataHandleRange for the new server (picking a range you
didn't previously use)
\end{enumerate}
\item Deploy the new config file to all the servers, including the new one
\item Create the storage space on the new server
\item Start all servers
\item Remount clients
\end{enumerate}
To reorder the servers (causing round-robin to occur in a different relative
order):
\begin{enumerate}
\item Unmount all clients
\item Stop all servers
\item Edit your config file to reorder the DataHandleRange entries
\item Deploy the new config file to all the servers
\item Start all servers
\item Remount clients
\end{enumerate}
Note that adding a new server will \emph{not} cause existing datafiles to be
placed on the new server, although new ones will be (by default). Migration
tools are necessary to move existing datafiles (see
Question~\ref{sec:migration}) both in the case of a new server, or if you
wanted to migrate data off a server before removing it.
\subsection{Are there tools for migrating data between servers?}
\label{sec:migration}
Not at this time, no.
\subsection{Why does df show less free space than I think it should? What
can I do about that?}
\label{sec:df-free-space}
PVFS uses a particular algorithm for calculating the free space on a file
system that takes the minimum amount of space free on a single server and
multiplies this value by the number of servers storing file data.
%
This algorithm was chosen because it provides a lower-bound on the amount of
data that could be stored on the system at that point in time.
If this value seems low, it is likely that one of your servers has less space
than the others (either physical space, or because someone has put some other
data on the same local file system on which PVFS data is stored). The
\texttt{pvfs2-statfs} utility, included with PVFS, can be used to check the
amount of free space on each server, as can the \texttt{karma} GUI.
\subsection{Does PVFS have a maximum file system size? If so, what is it?}
PVFS uses a 64-bit value for describing the offsets into files, so
theoretically file sizes are virtually unlimited. However, in practice other
system constraints place upper bounds on the size of files and file systems.
To best calculate maximum file and file system sizes, you should determine the
maximum file and file system sizes for the local file system type that you are
using for PVFS server storage and multiply these values by the number of
servers you are using.
\subsection{Mouning PVFS with the interrupt option}
\label{sec:mountintr}
The PVFS kernel module supports the {\tt intr} option provided by
network file systems. This allows applications to be sent kill signals
when a filesystem is unresponsive (due to network failures, etc.). The
option can be specified at mount time:
\begin{verbatim}
mount -t pvfs2 -o intr tcp://hosta:3334/pvfs2-fs /pvfs-storage/
\end{verbatim}
%
% MISSING FEATURES
%
\section{Missing Features}
This section discusses features that are not present in PVFS that are present
in some other file systems.
\subsection{Why don't hardlinks work under PVFS?}
We didn't implement hardlinks, and there is no plan to do so. Symlinks are
implemented.
\subsection{Can I \texttt{mmap} a PVFS file?}
Private, read-only mmapping of files is supported. Shared mmapping of files
is not. Supporting this would force a great deal of additional infrastructure
into PVFS that would compromise the design goals of simplicity and
robustness. This ``feature'' was intentionally left out, and it will remain
so.
\subsection{Will PVFS store new files on servers with more space, allowing
files to be stored when one server runs out of space?}
No. Currently PVFS does not intelligently place new files based on free
space. It's a good idea, and possible, but we have not done this yet. See
Section~\ref{sec:contributing} for notes on how you could help get this
feature in place.
\subsection{Does PVFS have locks?}
No. Locking subsystems add a great deal of shared state to a parallel file
system implementation, and one of the primary design goals was to eliminate
shared state in PVFS. This results in a simpler, more fault tolerant
overall system than would have been possible had we integrated locking into
the file system.
It's possible that an add-on locking subsystem will be developed at some point;
however, there is no plan to build such a system at this time.
%
% HELPING OUT
%
\section{Helping Out}
This section covers ways one could contribute to the PVFS project.
\subsection{How can I contribute to the PVFS project?}
\label{sec:contributing}
There are lots of ways to directly or indirectly contribute to the PVFS
project. Reporting bugs helps us make the system better, and describing your
use of the PVFS system helps us better understand where and how PVFS is
being deployed.
Even better, patches that fix bugs, add features, or support new hardware are
very welcome! The PVFS community has historically been a friendly one, and we
encourage users to discuss issues and exchange ideas on the mailing lists.
If you're interested in this type of exchange, we suggest joining the PVFS2
Developers List, grabbing the newest CVS version of the code, and seeing what
is new in PVFS. See \url{http://www.pvfs.org/pvfs2/developers.html} for more
details.
%
% IMPLEMENTATION DETAILS
%
\section{Implementation Details}
This section answers questions regarding specific components of the
implementation. It is most useful for people interested in augmenting or
modifying PVFS.
\subsection{BMI}
This section specifically covers questions about the BMI interface and
implementations.
\subsubsection{What is the maximum packet size for BMI?}
Each BMI module is allowed to define its own maximum message size. See
\texttt{BMI\_tcp\_get\_info}, \texttt{BMI\_gm\_get\_info}, and
\texttt{BMI\_ib\_get\_info} for examples of the maximum sizes that each of the
existing modules support. The maximum should be reported when you issue a
\texttt{get\_info} call with the option set to \texttt{BMI\_CHECK\_MAXSIZE}.
Higher level components of PVFS perform these checks in order to make sure
that they don't choose buffer sizes that are too large for the underlying
network.
\subsubsection{What happens if I try to match a BMI send with a BMI receive
that has too small a buffer?}
If the receive buffer is too small for the incoming message, then the
communication will fail and an error will be reported if possible. We
don't support any semantics for receiving partial messages or anything like
that. Its ok if the receive buffer is too big, though.
\end{document}