trove-dbpf.tex

%
% DBPF
%

\documentclass[10pt]{article} % FORMAT CHANGE
\usepackage[dvips]{graphicx}
\usepackage{times}

\graphicspath{{./}{figs/}}

%
% GET THE MARGINS RIGHT, THE UGLY WAY
%
% \topmargin 0.2in
% \textwidth 6.5in
% \textheight 8.75in
% \columnsep 0.25in
% \oddsidemargin 0.0in
% \evensidemargin 0.0in
% \headsep 0.0in
% \headheight 0.0in

\pagestyle{plain}

\addtolength{\hoffset}{-2cm}
\addtolength{\textwidth}{4cm}

\addtolength{\voffset}{-1.5cm}
\addtolength{\textheight}{3cm}

\setlength{\parindent}{0pt}
\setlength{\parskip}{11pt}

\title{Trove Database + Files (DBPF) Implementation}
\author{PVFS Development Team}
% \date{December 2000}

\begin{document}

\maketitle

\section{Introduction}

The purpose of this document is to sketch out the Database + Files (DBPF)
implementation of the Trove storage interface.  Included will be discussion of
how one might use the implementation in practice.

DBPF uses UNIX files and Berkeley DB (3 or 4 I think) databases to store file
and directory data and metadata.

\section{Entities on Physical Storage}

The locations of these entities are defined in dbpf.h, although in future
versions they should be relative to the storage name passed in at initialize time.

For a given server there are single instances of each of the following:
\begin{itemize}
\item storage attributes DB -- holds attributes on the storage space as a whole
\item collections DB -- holds information on collections and is used to map
      from a collection name to a collection ID
\end{itemize}

For each collection there are one of each of the following:
\begin{itemize}
\item collection attributes DB -- holds attributes for the collection as a
      whole, including handle usage information and root handle
\item dataspace attributes DB -- holds dbpf\_dspace\_attr structure for each
      dataspace, referenced by handle
\end{itemize}

In addition, dataspaces have a one to one mapping to the following, although
they are created lazily (so they will not exist if not yet used):
\begin{itemize}
\item bstream file -- UNIX files that holds bstream data, just as in the PVFS1 iod
\item keyval DB -- holds keyvals for a specific dataspace
\end{itemize}

At this time both bstream files and keyval DBs are stored in a flat directory
structure.  This may change to be a hashed directory system as in the PVFS1
iod if/when we see that we are incurring substantial overhead from lookups in
a directory with many entries.

\section{Hooking to Upper Trove Layer}

The DBPF implementation hooks to the upper trove ``wrapper'' layer through a
set of structures that contain pointers to the functions that make up the
Trove API.  These structures are defined in the upper-layer trove.h and are
TROVE\_bstream\_ops, TROVE\_keyval\_ops, TROVE\_dspace\_ops, TROVE\_mgmt\_ops,
and TROVE\_fs\_ops.

The TROVE\_mgmt\_ops structure includes a pointer to the function initialize().
This is used to initialize the underlying Trove implementation (e.g. DBPF).
The initialize function takes the storage name as a parameter; this should be
used to locate the various entities that are used for storing the DBPF
collections.

\emph{Additionally a method ID is passed in...why???}

The method name field is filled in by the DBPF implementation to describe what
type of underlying storage is available.

\section{Operation Queue}

The DBPF implementation has its own queue that it uses to hold operations in
progress.  Operations may have one of a number of states (defined in dbpf.h):
\begin{itemize}
\item OP\_UNITIALIZED
\item OP\_NOT\_QUEUED
\item OP\_QUEUED
\item OP\_IN\_SERVICE
\item OP\_COMPLETED
\item OP\_DEQUEUED
\end{itemize}

\section{Dataspaces Implementation}

Dataspaces are stored as:
\begin{itemize}
\item an entry in the dataspace DB
\item zero or one keyval DBs (zero if no keyvals have been set)
\item zero or one bstream UNIX files (zero if no data has been written to the
      bstream)
\end{itemize}

Keyval and bstream files are named by the handle (because they don't really
have names...).

\section{Keyval Spaces Implementation}

Currently all keyval space operations use blocking DB calls.  They queue the
operation when the I/O call is made, and they service the operation when the
service function is called (at test time).

On each call the database is opened and closed, rather than caching the FD.
This will obviously need to change in the near future; probably we'll do
something similar to the bstream fdcache.

The dbpf\_keyval\_iterate\_op\_svc function walks through all the keyval
pairs.  The current implementation uses the btree format with record
numbers to make it easy to pick up where the last call to the iterator
function left off. The iterator function will process \emph{count} items per
call: if fewer than \emph{count} items are left in the database, we set
\emph{count} to how many items were processed.  The caller should check that
\emph{count} has the same value as before the function was called.

\section{Bytestreams Implementation}

The read\_at and write\_at functions are implemented using blocking UNIX file
I/O calls.  They queue the operation when the I/O call is made, and they
service the operation when the service function is called (at test time).

The read\_list and write\_list functions are implemented using lio\_listio.
Both of these functions actually call a function dbpf\_bstream\_rw\_list,
which implements the necessary functionality for both reads and writes.  This
function sets up the queue entry and immediately calls the service function.

The necessary aiocb array is allocated within the service function if one
isn't already allocated for the operation.  The maximum number of aiocb
entries is controlled by a compile-time constant AIOCB\_ARRAY\_SZ; if there
are more list elements than this, the list will be split up and handled by
multiple lio\_listio calls.

If the service function returns to dbpf\_bstream\_rw\_list without completing
the entire operation, the operation is queued.

Both the \_at and \_list functions use a FD cache for getting open FDs.

\section{Creating a ``File System''}

Theoretically Trove in general doesn't know anything about ``file systems''
per se.  However, it's helpful for us to provide functions intended to ease
the creation of FSs in Trove.  These functions are subject to change or
removal as we get a better understanding of how we're going to use Trove.

There's more than one way to build a file system given the infrastructure
provided by Trove; we're just going through a single example here.

At this time, a single file system is associated with one and only one
collection, and a single collection is associated with one and only one file
system.

\subsection{Creating a Storage Space}
First a storage space must be created.  Storage spaces are created with the
storage\_create mgmt operation.  \emph{This function can be used without calling the
initialize function?}  This function creates the storage attributes database
and the collections database.

\subsection{Creating a Collection}
Given that a storage space has been created, the next thing to do is create a
collection.  This is done with the collection\_create mgmt operation.  This
function takes a collection name and an associated collection ID.  First it
looks up the storage space (which currently is done by mapping to the single
possible storage space).  Next it checks to see that a collection with that
name does not already exist (and errors out if it does).  Next it will
create the collection by writing a new entry into the collections DB.
Following this it will create a dataspace attributes database and the
directories for keyval and bstream spaces for the collection.  Then it creates
the collection attributes database and puts a last handle value into it, which
is used in subsequent dataspace creates.

Once a collection is created, it can be looked up with collection\_lookup.
Collections are looked up by name.  This will return collection ID that can be
used to direct subsequent operations to the collection.  Currently the
implementation only supports creation of one collection.

\emph{ The collection lookup routine opens the collection attribute database and
finds the root handle.  This really isn't right...the collection routines
don't need to know about this...the fs routines should instead.}

It also opens the dataspace database and returns the collection ID.

In addition to creating one collection for the file system, a second
``administrative'' collection will be created.  Currently, the handle allocator
uses the admin collection to store handle state in a bstream.

\subsection{Creating a Root Directory}
There is a collection of file system helper functions too...

\subsection{Creating Dataspaces}
At this point dataspaces can be created with dspace\_create.

The dataspace create code checks to see if the handle is in use.
\emph{Currently this code will not try to come up with a new handle if the
proposed one is used...fix?}

An entry is placed in the dataspace attribute DB for the collection.

\emph{What is the ``type'' field used for?}

I think that the basic metadata for a file will be stored in the dataspace
attribute DB by adding members to the dbpf\_dspace\_attr structure (defined in dbpf.h).

\subsubsection{Creating a Directory}

First we create a dataspace.  Then we add a key/value to map the name of the
directory to the handle into the parent directory keyval space.

\subsubsection{Creating a File}

First we create a dataspace.  Then we add a key/value to map the name to the
handle into the parent directory keyval space.

\subsubsection{Deleting a Directory}
We check to make sure the ``directory'' has no elements in it.  Then we remove
the key/value mapping the name of the directory to the handle from the parent
directory keyval space.  After removing the key/value, we mark the dataspace as
ready for deletion.

\subsubsection{Deleting a File}
Given a file name, we retrieve the corresponding handle from the parent
directory keyval space.  Entries in the parent directory keyval space map the
name of the file to a handle -- the handle of the dataspace for that keyval
space. We remove the key/value for the given name.  We then use the handle we
retrieved to find the dataspace and mark it as ready for deletion.

\section{Current Deficiencies}

Only one collection can be used with the current implementation.

Error checking is weak at best.  Assert(0)s are used in many places in error conditions.

\end{document}
	%
	% DBPF
	%

	\documentclass[10pt]{article} % FORMAT CHANGE
	\usepackage[dvips]{graphicx}
	\usepackage{times}

	\graphicspath{{./}{figs/}}

	%
	% GET THE MARGINS RIGHT, THE UGLY WAY
	%
	% \topmargin 0.2in
	% \textwidth 6.5in
	% \textheight 8.75in
	% \columnsep 0.25in
	% \oddsidemargin 0.0in
	% \evensidemargin 0.0in
	% \headsep 0.0in
	% \headheight 0.0in

	\pagestyle{plain}

	\addtolength{\hoffset}{-2cm}
	\addtolength{\textwidth}{4cm}

	\addtolength{\voffset}{-1.5cm}
	\addtolength{\textheight}{3cm}

	\setlength{\parindent}{0pt}
	\setlength{\parskip}{11pt}

	\title{Trove Database + Files (DBPF) Implementation}
	\author{PVFS Development Team}
	% \date{December 2000}

	\begin{document}

	\maketitle

	\section{Introduction}

	The purpose of this document is to sketch out the Database + Files (DBPF)
	implementation of the Trove storage interface. Included will be discussion of
	how one might use the implementation in practice.

	DBPF uses UNIX files and Berkeley DB (3 or 4 I think) databases to store file
	and directory data and metadata.

	\section{Entities on Physical Storage}

	The locations of these entities are defined in dbpf.h, although in future
	versions they should be relative to the storage name passed in at initialize time.

	For a given server there are single instances of each of the following:
	\begin{itemize}
	\item storage attributes DB -- holds attributes on the storage space as a whole
	\item collections DB -- holds information on collections and is used to map
	from a collection name to a collection ID
	\end{itemize}

	For each collection there are one of each of the following:
	\begin{itemize}
	\item collection attributes DB -- holds attributes for the collection as a
	whole, including handle usage information and root handle
	\item dataspace attributes DB -- holds dbpf\_dspace\_attr structure for each
	dataspace, referenced by handle
	\end{itemize}

	In addition, dataspaces have a one to one mapping to the following, although
	they are created lazily (so they will not exist if not yet used):
	\begin{itemize}
	\item bstream file -- UNIX files that holds bstream data, just as in the PVFS1 iod
	\item keyval DB -- holds keyvals for a specific dataspace
	\end{itemize}

	At this time both bstream files and keyval DBs are stored in a flat directory
	structure. This may change to be a hashed directory system as in the PVFS1
	iod if/when we see that we are incurring substantial overhead from lookups in
	a directory with many entries.

	\section{Hooking to Upper Trove Layer}

	The DBPF implementation hooks to the upper trove ``wrapper'' layer through a
	set of structures that contain pointers to the functions that make up the
	Trove API. These structures are defined in the upper-layer trove.h and are
	TROVE\_bstream\_ops, TROVE\_keyval\_ops, TROVE\_dspace\_ops, TROVE\_mgmt\_ops,
	and TROVE\_fs\_ops.

	The TROVE\_mgmt\_ops structure includes a pointer to the function initialize().
	This is used to initialize the underlying Trove implementation (e.g. DBPF).
	The initialize function takes the storage name as a parameter; this should be
	used to locate the various entities that are used for storing the DBPF
	collections.

	\emph{Additionally a method ID is passed in...why???}

	The method name field is filled in by the DBPF implementation to describe what
	type of underlying storage is available.

	\section{Operation Queue}

	The DBPF implementation has its own queue that it uses to hold operations in
	progress. Operations may have one of a number of states (defined in dbpf.h):
	\begin{itemize}
	\item OP\_UNITIALIZED
	\item OP\_NOT\_QUEUED
	\item OP\_QUEUED
	\item OP\_IN\_SERVICE
	\item OP\_COMPLETED
	\item OP\_DEQUEUED
	\end{itemize}

	\section{Dataspaces Implementation}

	Dataspaces are stored as:
	\begin{itemize}
	\item an entry in the dataspace DB
	\item zero or one keyval DBs (zero if no keyvals have been set)
	\item zero or one bstream UNIX files (zero if no data has been written to the
	bstream)
	\end{itemize}

	Keyval and bstream files are named by the handle (because they don't really
	have names...).

	\section{Keyval Spaces Implementation}

	Currently all keyval space operations use blocking DB calls. They queue the
	operation when the I/O call is made, and they service the operation when the
	service function is called (at test time).

	On each call the database is opened and closed, rather than caching the FD.
	This will obviously need to change in the near future; probably we'll do
	something similar to the bstream fdcache.

	The dbpf\_keyval\_iterate\_op\_svc function walks through all the keyval
	pairs. The current implementation uses the btree format with record
	numbers to make it easy to pick up where the last call to the iterator
	function left off. The iterator function will process \emph{count} items per
	call: if fewer than \emph{count} items are left in the database, we set
	\emph{count} to how many items were processed. The caller should check that
	\emph{count} has the same value as before the function was called.

	\section{Bytestreams Implementation}

	The read\_at and write\_at functions are implemented using blocking UNIX file
	I/O calls. They queue the operation when the I/O call is made, and they
	service the operation when the service function is called (at test time).

	The read\_list and write\_list functions are implemented using lio\_listio.
	Both of these functions actually call a function dbpf\_bstream\_rw\_list,
	which implements the necessary functionality for both reads and writes. This
	function sets up the queue entry and immediately calls the service function.

	The necessary aiocb array is allocated within the service function if one
	isn't already allocated for the operation. The maximum number of aiocb
	entries is controlled by a compile-time constant AIOCB\_ARRAY\_SZ; if there
	are more list elements than this, the list will be split up and handled by
	multiple lio\_listio calls.

	If the service function returns to dbpf\_bstream\_rw\_list without completing
	the entire operation, the operation is queued.

	Both the \_at and \_list functions use a FD cache for getting open FDs.

	\section{Creating a ``File System''}

	Theoretically Trove in general doesn't know anything about ``file systems''
	per se. However, it's helpful for us to provide functions intended to ease
	the creation of FSs in Trove. These functions are subject to change or
	removal as we get a better understanding of how we're going to use Trove.

	There's more than one way to build a file system given the infrastructure
	provided by Trove; we're just going through a single example here.

	At this time, a single file system is associated with one and only one
	collection, and a single collection is associated with one and only one file
	system.

	\subsection{Creating a Storage Space}
	First a storage space must be created. Storage spaces are created with the
	storage\_create mgmt operation. \emph{This function can be used without calling the
	initialize function?} This function creates the storage attributes database
	and the collections database.

	\subsection{Creating a Collection}
	Given that a storage space has been created, the next thing to do is create a
	collection. This is done with the collection\_create mgmt operation. This
	function takes a collection name and an associated collection ID. First it
	looks up the storage space (which currently is done by mapping to the single
	possible storage space). Next it checks to see that a collection with that
	name does not already exist (and errors out if it does). Next it will
	create the collection by writing a new entry into the collections DB.
	Following this it will create a dataspace attributes database and the
	directories for keyval and bstream spaces for the collection. Then it creates
	the collection attributes database and puts a last handle value into it, which
	is used in subsequent dataspace creates.

	Once a collection is created, it can be looked up with collection\_lookup.
	Collections are looked up by name. This will return collection ID that can be
	used to direct subsequent operations to the collection. Currently the
	implementation only supports creation of one collection.

	\emph{ The collection lookup routine opens the collection attribute database and
	finds the root handle. This really isn't right...the collection routines
	don't need to know about this...the fs routines should instead.}

	It also opens the dataspace database and returns the collection ID.

	In addition to creating one collection for the file system, a second
	``administrative'' collection will be created. Currently, the handle allocator
	uses the admin collection to store handle state in a bstream.

	\subsection{Creating a Root Directory}
	There is a collection of file system helper functions too...

	\subsection{Creating Dataspaces}
	At this point dataspaces can be created with dspace\_create.

	The dataspace create code checks to see if the handle is in use.
	\emph{Currently this code will not try to come up with a new handle if the
	proposed one is used...fix?}

	An entry is placed in the dataspace attribute DB for the collection.

	\emph{What is the ``type'' field used for?}

	I think that the basic metadata for a file will be stored in the dataspace
	attribute DB by adding members to the dbpf\_dspace\_attr structure (defined in dbpf.h).

	\subsubsection{Creating a Directory}

	First we create a dataspace. Then we add a key/value to map the name of the
	directory to the handle into the parent directory keyval space.

	\subsubsection{Creating a File}

	First we create a dataspace. Then we add a key/value to map the name to the
	handle into the parent directory keyval space.

	\subsubsection{Deleting a Directory}
	We check to make sure the ``directory'' has no elements in it. Then we remove
	the key/value mapping the name of the directory to the handle from the parent
	directory keyval space. After removing the key/value, we mark the dataspace as
	ready for deletion.

	\subsubsection{Deleting a File}
	Given a file name, we retrieve the corresponding handle from the parent
	directory keyval space. Entries in the parent directory keyval space map the
	name of the file to a handle -- the handle of the dataspace for that keyval
	space. We remove the key/value for the given name. We then use the handle we
	retrieved to find the dataspace and mark it as ready for deletion.

	\section{Current Deficiencies}

	Only one collection can be used with the current implementation.

	Error checking is weak at best. Assert(0)s are used in many places in error conditions.

	\end{document}