Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
pvfs2-osd/doc/pvfs2-ha-heartbeat-v2.tex
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
505 lines (402 sloc)
19.7 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
\documentclass[11pt]{article} | |
\usepackage[dvips]{graphicx} | |
\usepackage{times} | |
\graphicspath{{./}{figs/}} | |
% | |
% GET THE MARGINS RIGHT, THE UGLY WAY | |
% | |
\topmargin 0.0in | |
\textwidth 6.5in | |
\textheight 8.75in | |
\columnsep 0.25in | |
\oddsidemargin 0.0in | |
\evensidemargin 0.0in | |
\headsep 0.0in | |
\headheight 0.0in | |
\title{PVFS High Availability Clustering using Heartbeat 2.0} | |
\date{2008} | |
\pagestyle{plain} | |
\begin{document} | |
\maketitle | |
\begin{small} | |
\tableofcontents | |
\end{small} | |
\newpage | |
% These two give us paragraphs with space between, which rross thinks is | |
% the right way to have things. | |
%\setlength{\parindent}{0pt} | |
%\setlength{\parskip}{11pt} | |
\section{Introduction} | |
This document describes how to configure PVFS for high availability | |
using Heartbeat version 2.x from www.linux-ha.org. | |
The combination of PVFS and Heartbeat can support an arbitrary number | |
of active server nodes and an arbitrary number of passive spare nodes. | |
Spare nodes are not required unless you wish to avoid performance | |
degradation upon failure. As configured in this document, PVFS will | |
be able to tolerate $\lceil ((N/2)-1) \rceil$ node failures, where N is | |
the number of nodes present in the Heartbeat cluster including spares. | |
Over half of the nodes must be available in order to reach a quorum and | |
decide if another node has failed. | |
Heartbeat can be configured to monitor IP connectivity, storage hardware | |
connectivity, and responsiveness of the PVFS daemons. Failure of any of | |
these components will trigger a node level failover event. PVFS clients | |
will transparently continue operation following a failover. | |
No modifications of PVFS are required. Example scripts referenced in | |
this document are available in the \texttt{examples/heartbeat} directory of | |
the PVFS source tree. | |
\section{Requirements} | |
\subsection{Hardware} | |
\subsubsection{Nodes} | |
Any number of nodes may be configured, although you need at least three | |
total in order to tolerate a failure. You may also use any number of | |
spare nodes. A spare node is a node that does not run any services until | |
a failover occurs. If you have one or more spares, then they will be | |
selected first to run resources in a failover situation. If you have | |
no spares (or all spares are exhausted), then at least one node will | |
have to run two services simultaneously, which may degrade performance. | |
The examples in this document will use 4 active nodes and 1 spare node, | |
for a total of 5 nodes. Heartbeat has been tested with up to 16 nodes | |
in configurations similar to the one outlined in this document. | |
\subsubsection{Storage} | |
A shared storage device is required. The storage must be configured | |
to allocate a separate block device to each PVFS daemon, and all nodes | |
(including spares) must be capable of accessing all block devices. | |
One way of achieving this is by using a SAN. In the examples used in this | |
document, the SAN has been divided into 4 LUNs. Each of the 5 nodes in | |
the cluster is capable of mounting all 4 LUNs. The Heartbeat software | |
will insure that a given LUN is mounted in only one location at a time. | |
Each block device should be formatted with a local file system. | |
This document assumes the use of ext3. Unique labels should be set on | |
each file system (for example, using the \texttt{-L} argument to | |
\texttt{mke2fs} or \texttt{tune2fs}). This will allow the block devices | |
to be mounted by consistent labels regardless of how Linux assigned | |
device file names to the devices. | |
\subsubsection{Stonith} | |
Heartbeat needs some mechanism to fence or stonith a failed node. | |
Two popular ways to do this are either to use IPMI or a network | |
controllable power strip. Each node needs to have a mechanism available | |
to reset any other node in the cluster. The example configuration in | |
this document uses IPMI. | |
It is possible to configure PVFS and Heartbeat without a power control | |
device. However, if you deploy this configuration for any purpose other | |
than evaluation, then you run a very serious risk of data | |
corruption. Without stonith, there is no way to guarantee that a | |
failed node has completely shutdown and stopped accessing its | |
storage device before failing over. | |
\subsection{Software} | |
This document assumes that you are using Heartbeat version 2.1.3, | |
and PVFS version 2.7.x or greater. You may also wish to use example | |
scripts included in the \texttt{examples/heartbeat} directory of the | |
PVFS source tree. | |
\subsection{Network} | |
There are two special issues regarding the network configuration to be | |
used with Heartbeat. First of all, you must allocate a multicast | |
address to use for communication within the cluster nodes. | |
Secondly, you need to allocate an extra IP address and hostname for each | |
PVFS daemon. In the example that this document uses, we must allocate 4 | |
extra IP addresses, along with 4 hostnames in DNS for those IP addresses. | |
In this document, we will refer to these as ``virtual addresses'' or | |
``virtual hostnames''. Each active PVFS server will be configured | |
to automatically bring up one of these virtual addresses to use for | |
communication. If the node fails, then that IP address is migrated to | |
another node so that clients will appear to communicate with the same | |
server regardless of where it fails over to. It is important that you | |
\emph{not} use the primary IP address of each node for this purpose. | |
In the example in this document, we use 225.0.0.1 as the multicast | |
address, node\{1-5\} as the normal node hostnames, | |
virtual\{1-4\} as the virtual hostnames, and 192.168.0.\{1-4\} as the | |
virtual addresses. | |
Note that the virtual addresses must be on the same subnet as the true | |
IP addresses for the nodes. | |
\section{Configuring PVFS} | |
Download, build, and install PVFS on all server nodes. Configure PVFS | |
for use on each of the active nodes. | |
There are a few points to consider when configuring PVFS: | |
\begin{itemize} | |
\item Use the virtual hostnames when specifying meta servers and I/O | |
servers | |
\item Synchronize file data on every operation (necessary for consistency on | |
failover) | |
\item Synchronize meta data on every operation (necessary for consistency on | |
failover). Coalescing is allowed. | |
\item Use the \texttt{TCPBindSpecific} option (this allows multiple daemons to | |
run on the same node using different virtual addresses) | |
\item Tune retry and timeout values appropriately for your system. This | |
may depend on how long it takes for your power control device to safely | |
shutdown a node. | |
\end{itemize} | |
An example PVFS configuration is shown below including the sections | |
relevant to Heartbeat: | |
\begin{verbatim} | |
<Defaults> | |
... | |
ServerJobBMITimeoutSecs 30 | |
ServerJobFlowTimeoutSecs 30 | |
ClientJobBMITimeoutSecs 30 | |
ClientJobFlowTimeoutSecs 30 | |
ClientRetryLimit 5 | |
ClientRetryDelayMilliSecs 33000 | |
TCPBindSpecific yes | |
... | |
</Defaults> | |
<Aliases> | |
Alias virtual1_tcp3334 tcp://virtual1:3334 | |
Alias virtual2_tcp3334 tcp://virtual2:3334 | |
Alias virtual3_tcp3334 tcp://virtual3:3334 | |
Alias virtual4_tcp3334 tcp://virtual4:3334 | |
</Aliases> | |
<Filesystem> | |
... | |
<MetaHandleRanges> | |
Range virtual1_tcp3334 4-536870914 | |
Range virtual2_tcp3334 536870915-1073741825 | |
Range virtual3_tcp3334 1073741826-1610612736 | |
Range virtual4_tcp3334 1610612737-2147483647 | |
</MetaHandleRanges> | |
<DataHandleRanges> | |
Range virtual1_tcp3334 2147483648-2684354558 | |
Range virtual2_tcp3334 2684354559-3221225469 | |
Range virtual3_tcp3334 3221225470-3758096380 | |
Range virtual4_tcp3334 3758096381-4294967291 | |
</DataHandleRanges> | |
<StorageHints> | |
TroveSyncMeta yes | |
TroveSyncData yes | |
</StorageHints> | |
\end{verbatim} | |
Download, build, and install Heartbeat following the instructions on | |
their web site. No special parameters or options are required. Do not | |
start the Heartbeat service. | |
\section{Configuring storage} | |
Make sure that there is a block device allocated for each active server | |
in the file system. Format each one with ext3. Do not create a PVFS | |
storage space yet, but you can create subdirectories within each file | |
system if you wish. | |
Confirm that each block device can be mounted from every node using the | |
file system label. Do this one node at a time. Never mount | |
the same block device concurrently on two nodes. | |
\section{Configuring stonith} | |
Make sure that your stonith device is accessible and responding from each | |
node in the cluster. For the IPMI stonith example used in this document, | |
this means confirming that \texttt{ipmitool} is capable of monitoring | |
each node. Each node will have its own IPMI IP address, username, and | |
password. | |
\begin{verbatim} | |
$ ipmitool -I lan -U Administrator -P password -H 192.168.0.10 power status | |
Chassis Power is on | |
\end{verbatim} | |
\section{Distributing Heartbeat scripts} | |
The PVFS2 resource script must be installed and set as runnable on every | |
cluster node as follows: | |
\begin{verbatim} | |
$ mkdir -p /usr/lib/ocf/resource.d | |
$ cp examples/heartbeat/PVFS2 /usr/lib/ocf/resource.d/external/ | |
$ chmod a+x /usr/lib/ocf/resource.d/external/PVFS2 | |
\end{verbatim} | |
\section{Base Heartbeat configuration} | |
This section describes how to configure the basic Heartbeat daemon | |
parameters, which include an authentication key and a list of nodes that | |
will participate in the cluster. | |
Begin by generating a random sha1 key, which is used to | |
secure communication between the cluster nodes. Then run the | |
pvfs2-ha-heartbeat-configure.sh script on every node (both active | |
and spare) as shown below. Make sure to use the multicast address, sha1 | |
key, and list of nodes (including spares) appropriate for your environment. | |
\begin{verbatim} | |
$ dd if=/dev/urandom count=4 2>/dev/null | openssl dgst -sha1 | |
dcdebc13c41977eac8cca0023266a8b16d234262 | |
$ examples/heartbeat/pvfs2-ha-heartbeat-configure.sh /etc/ha.d 225.0.0.1 \ | |
dcdebc13c41977eac8cca0023266a8b16d234262 \ | |
node1 node2 node3 node4 node5 | |
\end{verbatim} | |
You can view the configuration file that this generates in | |
/etc/ha.d/ha.cf. An example ha.cf file (with comments) is provided with | |
the Heartbeat package if you wish to investigate how to add or change any settings. | |
\section{CIB configuration} | |
\texttt{Cluster Information Base} (CIB) is the the mechanism that | |
Heartbeat 2.x uses to store information about the resources that are | |
configured for high availability. The configuration is stored in an | |
XML format and automatically synchronized across all of the cluster | |
nodes. | |
It is possible to start the Heartbeat services and then configure the | |
CIB, but it is simpler to begin with a populated XML file on all nodes. | |
\texttt{cib.xml.example} provides an example of a fully populated | |
Heartbeat configuration with 5 nodes and 4 active PVFS servers. Relevant | |
portions of the XML file are outlined below. | |
This file should be modified to reflect your configuration. You can | |
test the validity of the XML with the following commands before | |
installing it. The former checks generic XML syntax, while the latter | |
performs Heartbeat specific checking: | |
\begin{verbatim} | |
$ xmllint --noout cib.xml | |
$ crm_verify -x cib.xml | |
\end{verbatim} | |
Once your XML is filled in correctly, it must be copied into the correct | |
location (with correct ownership) on each node in the cluster: | |
\begin{verbatim} | |
$ mkdir -p /var/lib/heartbeat/crm | |
$ cp cib.xml /var/lib/heartbeat/crm | |
$ chown -R hacluster:haclient /var/lib/heartbeat/crm | |
\end{verbatim} | |
Please note that once Heartbeat has been started, it is no longer | |
legal to modify cib.xml by hand. See the \texttt{cibadmin} command line | |
tool and Heartbeat information on making modifications to | |
existing or online CIB configurations. | |
\subsection{crm\_config} | |
The \texttt{crm\_config} portion of the CIB is used to set global | |
parameters for Heartbeat. This includes behavioral settings | |
(such as how to respond if quorum is lost) as well as tunable parameters | |
(such as timeout values). | |
The options selected in this section should work well as a starting | |
point, but you may refer to the Heartbeat documentation for more | |
details. | |
\subsection{nodes} | |
The \texttt{nodes} section is empty on purpose. This will be filled in | |
dynamically by the Heartbeat daemons. | |
\subsection{resources and groups} | |
The \texttt{resources} section describes all resources that the | |
Heartbeat software needs to manage for failover purposes. This includes | |
IP addresses, SAN mount points, and \texttt{pvfs2-server} processes. The | |
resources are organized into groups, such as \texttt{server0}, to | |
indicate that certain groups of resources should be treated as a single | |
unit. For example, if a node were to fail, you cannot just migrate its | |
\texttt{pvfs2-server} process. You must also migrate the associated IP address | |
and SAN mount point at the same time. Groups also make it easier to | |
start or stop all associated resources for a node with one unified command. | |
In the example \texttt{cib.xml}, there are 4 groups (server0 through | |
server3). These represent the 4 active PVFS servers that will run on | |
the cluster. | |
\subsection{IPaddr} | |
The \texttt{IPaddr} resources, such as \texttt{server0\_address}, are | |
used to indicate what virtual IP address should be used with each group. | |
In this example, all IP addresses are allocated from a private range, but | |
these should be replaced with IP addresses that are appropriate for use | |
on your network. See the network requirements section for more details. | |
\subsection{Filesystem} | |
The \texttt{Filesystem} resources, such as \texttt{server0\_fs}, are used to | |
describe the shared storage block devices that serve as back end storage | |
for PVFS. This is where the PVFS storage space for each server will | |
be created. In this example, the device names are labeled as | |
\texttt{label0} | |
through \texttt{label3}. They are each mounted on directories such | |
as \texttt{/san\_mount0} through \texttt{/san\_mount3}. Please note | |
that each device should be mounted on a different mount point to allow | |
multiple \texttt{pvfs2-server} processes to operate on the same node without | |
collision. The file system type can be changed to reflect the use of | |
alternative underlying file systems. | |
\subsection{PVFS} | |
The \texttt{PVFS2} resources, such as \texttt{server0\_daemon}, are used | |
to describe each \texttt{pvfs2-server} process. This resource is provided by the | |
PVFS2 script in the examples directory. The parameters to this resource | |
are listed below: | |
\begin{itemize} | |
\item \texttt{fsconfig}: location of PVFS fs configuration file | |
\item \texttt{port}: TCP/IP port that the server will listen on (must match server | |
configuration file) | |
\item \texttt{ip}: IP address that the server will listen on (must match both the file | |
system configuration file and the IPAddr resource) | |
\item \texttt{pidfile}: Location where a pid file can be written | |
\item \texttt{alias}: alias to identify this PVFS daemon instance | |
\end{itemize} | |
Also notice that there is a monitor operation associated with the PVFS | |
resource. This will cause the \texttt{pvfs2-check-server} utility to be triggered | |
periodically to make sure that the \texttt{pvfs2-server} process is not only | |
running, but is correctly responding to PVFS protocol requests. This | |
allows problems such as hung \texttt{pvfs2-server} processes to be treated as | |
failure conditions. | |
Please note that the PVFS2 script provided in the examples will attempt | |
to create a storage space on startup for each server if it is not already present. | |
\subsection{rsc\_location} | |
The \texttt{rsc\_location} constraints, such as \texttt{run\_server0}, | |
are used to express a preference for where each resource group should | |
run (if possible). It may be useful for administrative purposes to have | |
the first server group default to run on the first node of your cluster, | |
for example. Otherwise the placement will be left up to Heartbeat. | |
\subsection{rsc\_order} | |
The \texttt{rsc\_order} constraints, such as | |
\texttt{server0\_order\_start\_fs} can be used to dictate the order in | |
which resources must be started or stopped. The resources are already | |
organized into groups, but without ordering constraints, the resources | |
within a group may be started in any order relative to each other. | |
These constraints are necessary because a \texttt{pvfs2-server} process will not | |
start properly until its IP address and storage are available. | |
\subsection{stonith} | |
The \texttt{external/ipmi} stonith device is used in this example. | |
Please see the Heartbeat documentation for instructions on configuring | |
other types of devices. | |
There is one IPMI stonith device for each node. The attributes for that | |
resources specify which node is being controlled, and the username, | |
password, and IP address of corresponding IPMI device. | |
\section{Starting Heartbeat} | |
Once the CIB file is completed and installed in the correct | |
location, then the Heartbeat services can be started on every node. | |
The \texttt{crm\_mon} command, when run with the arguments shown, | |
will provide a periodically updated view of the state of each resource | |
configured within Heartbeat. Check \texttt{/var/log/messages} if any | |
of the groups fail to start. | |
\begin{verbatim} | |
$ /etc/init.d/heartbeat start | |
$ # wait a few minutes for heartbeat services to start | |
$ crm_mon -r | |
\end{verbatim} | |
\section{Mounting the file system} | |
Mounting PVFS with high availability is no different than mounting a | |
normal PVFS file system, except that you must use the virtual hostname | |
for the PVFS server rather than the primary hostname of the node: | |
\begin{verbatim} | |
$ mount -t pvfs2 tcp://virtual1:3334/pvfs2-fs /mnt/pvfs2 | |
\end{verbatim} | |
\section{What happens during failover} | |
The following example illustrates the steps that occur when a node fails: | |
\begin{enumerate} | |
\item Node2 (which is running a \texttt{pvfs2-server} on the virtual2 IP | |
address) suffers a failure | |
\item Client node begins timeout/retry cycle | |
\item Heartbeat services running on remaining nodes notice that node2 | |
is not responding | |
\item After a timeout has elapsed, remaining nodes reach a quorum and | |
vote to treat node2 as a failed node | |
\item Node1 sends a stonith command to reset node2 | |
\item Node2 either reboots or remains powered off (depending on nature | |
of failure) | |
\item Once stonith command succeeds, node5 is selected to replace it | |
\item The virtual2 IP address, mount point, and | |
\texttt{pvfs2-server} service | |
are started on node5 | |
\item Client node retry eventually succeeds, but now the network | |
traffic is routed to node5 | |
\end{enumerate} | |
\section{Controlling Heartbeat} | |
The Heartbeat software comes with a wide variety of tools for managing | |
resources. The following are a few useful examples: | |
\begin{itemize} | |
\item \texttt{cibadmin -Q}: Display the current CIB information | |
\item \texttt{crm\_mon -r -1}: Display the current resource status | |
\item \texttt{crm\_standby}: Used to manually take a node in an out of | |
standby mode. This can be used to take a node offline for maintenance | |
without a true failure event. | |
\item \texttt{crm\_resource}: Modify resource information. For example, | |
\texttt{crm\_resource -r server0 -p target\_role -v stopped} will stop a | |
particular resource group. | |
\item \texttt{crm\_verify}: can be used to confirm if the CIB | |
information is valid and consistent | |
\end{itemize} | |
\section{Additional examples} | |
The \texttt{examples/heartbeat/hardware-specific} directory contains | |
additional example scripts that may be helpful in some scenarios: | |
\begin{itemize} | |
\item \texttt{pvfs2-stonith-plugin}: An example stonith plugin | |
that can use an arbitrary script to power off nodes. May be used (for | |
example) with the \texttt{apc*} and \texttt{baytech*} scripts to control | |
remote controlled power strips if the scripts provided by Heartbeat are | |
not sufficient. | |
\item \texttt{Filesystem-qla-monitor}: A modified version of the | |
standard FileSystem resource that uses the \texttt{qla-monitor.pl} | |
script to provide additional monitoring capability for QLogic fibre | |
channel cards. | |
\item \texttt{PVFS2-notify}: An example of a dummy resource that could | |
be added to the configuration to perform additional logging or | |
notification steps on startup. | |
\end{itemize} | |
\end{document} | |
% vim: tw=72 |