Watchdog Recovery

From AN!Wiki
Jump to: navigation, search

 AN!Wiki :: How To :: Watchdog Recovery

Template note icon.png
Note: This tutorial is written using Fedora 16.

The new fence_sanlock and checkquorum.wdmd fence agents provide new fencing options to users who may not have full out of band management, switched PDUs or similar tradition fence device. They aim to provide a critical function in clusters to users who otherwise would have no (affordable) options.

Template warning icon.png
Warning: This technology is Tech Preview! There is no support for this fence method yet. Feedback and bug reports are much appreciated.

Contents

About Fencing

Traditionally in clustering, all nodes must be in a known state. In practice, this meant that when a node stopped responding, the rest of the cluster could not safely proceed until the silent node was put into a known state.

The action of putting a node into a known state is called "fencing". Typically, this is done by one of the other nodes in the cluster either isolating or forcibly powering off the lost node.

  • With isolation, the lost node would not itself be touched, but it's network link(s) would be disabled. This would ensure that even if the node recovered, it would no longer have access to the cluster or it's shared storage. This form of clustering is called "fabric fencing".
  • The far more common form of fencing is to forcibly power off the lost node. This is done by using an external device, like a server's out-of-band management card (IPMI, iLO, etc) or by using a network-connected power bar, called a PDU.

In either case, the purpose of fencing is to ensure that the lost node will be able to access clustered resources, like shared storage, or provide clustered services in an asynchronous manner. Skipping this crucial step could cause data loss, so it is critical to always use fencing in clusters.

Watchdog Timers

Many mother boards have "watchdog" timers built in. These timers will cause the host machine to reboot if the system appears to freeze for a period of time. The new fence_sanlock agent combines these with SAN storage to provide an alternative fence method.

Where "fabric fencing" can be thought of as a form of ostracism and "power fencing" can be thought of as a form of murder, watchdog fencing can be thought of as a form of suicide. Morbid, but accurate.

Options

There are currently 2 mechanisms to trigger a node recover via watchdog device:

  1. fence_sanlock: Preferred method, but always requires shared storage.
  2. checkquorum.wdmd: Only requires shared storage for 2-node clusters.

Differences Between fence_sanlock And checkquorum.wdmd

When choosing which type of watchdog-based fencing, consider the following;

Method Advantages Disadvantages
fence_sanlock
  • Preferred method, it is a real fence method.
  • Can recover the cluster if a node fails completely.
  • Requires shared storage in all cases.
checkquorum.wdmd
  • Shared storage not needed for 3+ nodes.
  • Not a real fence method.
  • Fence action only considered complete when the lost node rejoins.
  • Node failures that prevent reboot causes the cluster to remain blocked.
  • Network failure causing all nodes to lose quorum will result in a complete cluster restart.

Important Note On Timing

Watchdog timers work by having a constant count down running. The host has to periodically and reliably reset this timer. If the watchdog timer is allowed to expire, the host machine will be reset. This timeout is often measured in minutes.

Traditional fencing methods which communicate with external devices that can report success as soon as the target node has been fenced. This process is usually measured in a small number of seconds.

When a cluster loses contact with a node, it blocks by design. It is not safe to proceed until all nodes are in a known state, so the users of the cluster services will notice a period of interruption until the lost node recovers or is fenced.

Putting this together, this timing difference means that any watchdog-based fencing will be much slower than traditional fencing. Your users will most likely experience an outage of several minutes while the fence_sanlock works. For this reason, fence_sanlock should be used only when traditional fence methods are unavailable.

In short; watchdog fencing is not a replacement for traditional fencing. It is only a replacement for no fencing at all.

fence_sanlock

The fence_sanlock deamon works by combining locks on shared storage with watchdog devices under each node.

On start-up, each node "unfences", during which time the node takes a lock from the shared storage and begins resetting the watchdog timer at set intervals. So long as the node is healthy, this reset of the watchdog timer repeats indefinitely.

If there is a failure on the node itself, the node will self-fence by allowing it's watchdog timer to expire. This will happen either a period of time after losing quorum or immediately if corosync fails.

When another node in the cluster wants to call a fence against a victim, it will try to take it's lock on the shared storage. If the victim is not already in the process of fencing itself, it will detect the attempt to take it's lock and allow it's watchdog timer to expire. Of course, if the victim has crashed entirely, it's watchdog device will already be counting down. In any case, the victim will eventually reboot.

The cluster will know when the victim is gone when it is finally able to take the victim's lock. This is safe because, so long as the victim is alive, it will maintain it's lock.

Requirements

You will need;

  • A hardware watchdog timer.
  • External shared storage, 1 GiB or larger in size.
    • Any shared storage will do. Here is a short tgtd tutorial for creating a SAN on a machine outside of the cluster if you don't have existing shared storage.

From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.

Template note icon.png
Note: Software watchdog timers exist but they are not supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing echo c > /proc/sysrq-trigger will demonstrate the flaw in using software watchdog timers.

You need to install;

  • cman ver. 3.1.99 +
  • wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
  • fence_sanlock ver. 2.6 +

Installation

To install fence_sanlock, run;

yum install cman fence-sanlock sanlock sanlock-lib

Configuring Shared Storage

Any shared storage device can be used. For the purpose if this document, a SAN LUN, exported by a machine outside of the cluster, and made available as /dev/sdb will be used. If you don't currently have a shared storage device, below is a brief tutorial on setting up a tgtd based SAN:

You can use a [non] clustered LVM, an NFS export or most any type of shared storage. The only requirement being that it is at least 1 GiB in size or larger.

Once you have your shared storage, initialize it;

Template note icon.png
Note: The Initializing 128 sanlock host leases may take a while to complete, please be patient.
fence_sanlock -o sanlock_init -p /dev/sdb
Initializing fence sanlock lockspace on /dev/sdb: ok
Initializing 128 sanlock host leases on /dev/sdb: ok

Note the path, /dev/sdb in this case. You will use it in the next step.

Configuration cman

Below is an example cman configuration file. The main sections to note are the <fence>, <unfence> and <fencedevice> elements.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="2" name="an-cluster-01">
	<cman expected_votes="1" two_node="1"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="1" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="1" action="on" />
			</unfence>
		</clusternode>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2">
			<fence>
				<method name="wd">
					<device name="watchdog" host_id="2" />
				</method>
			</fence>
			<unfence>
				<device name="watchdog" host_id="2" action="on" />
			</unfence>
		</clusternode>
	</clusternodes>
	<fencedevices>
		<fencedevice name="watchdog" agent="fence_sanlock" path="/dev/sdb"/>
	</fencedevices>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>

The key attributes are:

  • host_id="x"; This tells fence_sanlock which node ID to act on. Generally it matches the corresponding <clusternode ... nodeid="x"> value. It doesn't have to, though, but it must be unique.
  • path="/dev/x"; This tells the cluster where each node's lock can be found. Here we used /dev/sdb/, which is the device we initialized in the previous step.

One it passes ccs_config_validate, push it out to your other node(s).

Enabling fence_sanlock

Template note icon.png
Note: fence_sanlock fencing and unfencing operations can take up to several minutes to complete. This is normal and expected behaviour. Other than this, the fencing operation will work as any other fence device implementation. From a user perspective there are no operational differences.

We need to disable wdmd and sanlock from starting at boot. Then we need enable fence_sanlockd daemons to start on boot. The fence_sanlockd daemon will start the wdmd and sanlock daemons itself.

Template note icon.png
Note: If you are using an pre-systemd OS, use chkconfig and init.d scripts instead of systemctl.
systemctl disable wdmd.service
systemctl disable sanlock.service
systemctl enable fence_sanlockd.service

If you started the daemons, stop them now.

systemctl stop wdmd.service
systemctl stop sanlock.service
systemctl stop fence_sanlockd.service

Now stop cman and then start fence_sanlockd. If you're not running the cluster yet, then stopping cman is not needed.

systemctl stop cman.service
systemctl start fence_sanlockd.service

Now we can re-enable the cman service.

Template note icon.png
Note: As stated earlier, the start-up time and time-to-fence of the cluster will be much longer with fence_sanlock configured. It could take up to five minutes, or more depending on configuration, for the unfence step to complete. Please be patient.
systemctl start cman.service

Testing fence_sanlock

If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node and restore cluster services.

echo c > /proc/sysrq-trigger

In the system log of the other node, you will see messages similar to those in the example below.

Oct 21 23:25:23 an-c01n02 corosync[1652]:   [TOTEM ] A processor failed, forming new configuration.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [QUORUM] Members[1]: 2
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] New Configuration:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Left:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CLM   ] Members Joined:
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.2) ; members(old:2 left:1)
Oct 21 23:25:25 an-c01n02 corosync[1652]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 21 23:25:25 an-c01n02 kernel: [ 1075.915453] dlm: closing connection to node 1
Oct 21 23:25:25 an-c01n02 fenced[1708]: fencing node an-c01n01.alteeve.ca
Oct 21 23:25:26 an-c01n02 fence_sanlock: 1997 host_id 1 gen 1 ver 1 timestamp 391
Oct 21 23:27:20 an-c01n02 fenced[1708]: fence an-c01n01.alteeve.ca success

This shows that fence_sanlock is working properly.

checkquorum.wdmd

The checkquorum.wdmd is not really a fence device in the traditional sense. The only way for a fence action to be considered a success is to have the failed node rejoin the cluster in a clean (freshly started) state. If a node fails in such a way that it can node start back up, the cluster will remain blocked until cleared by an administrator issuing a fence_ack_manual on the survivor node.

The checkquorum.wdmd works by having wdmd stop resetting the system's watchdog timer if the node loses quorum. For this reason, checkquorum.wdmd will not work if the 2-node (<cman expected_votes="1" two_node="1"/>) option is set in /etc/cluster/cluster.conf. This is because quorum is effectively disabled. When configured for 2-node clusters, a node will never lose quorum, thus, corosync will never stop resetting the watchdog timer.

To get around this 2-node limitation, we can use qdisk on shared storage. This provides a third vote and allows quorum to be used properly.

Template note icon.png
Note: The qdisk device must be in master_wins mode. Please see man 5 qdisk for more information.

Requirements

You will need;

  • A hardware watchdog timer.
  • Two-Node clusters only; External shared storage, 10 MiB or larger in size, for a qdisk partition.

From the hardware end of things, you have to have a hardware watchdog timer. Many workstation and server mainboards offer this as a built-in feature which can be enabled in the BIOS. If your system does not have this, add-in and external watchdog timers can be used and are relatively inexpensive. To further save costs, some open-hardware watchdog timer designs are available for those handy with a soldering iron.

Template note icon.png
Note: Software watchdog timers exist but they are not supported in production. They rely on the host functioning to at least some degree which is a fatal design flaw. A simple test of issuing echo c > /proc/sysrq-trigger will demonstrate the flaw in using software watchdog timers.

You need to install;

  • cman ver. 3.1.99 +
  • wdmd ver. 2.6 + (available from the sanlock ver. 2.6 + package)
  • sanlock ver. 2.6 +

Installation

To use checkquorum.wdmd, install;

yum install cman sanlock sanlock-lib

Configuring qdisk (2-Node Only)

Template note icon.png
Note: If you have a cluster with three or more nodes, you can skip this step.
Template warning icon.png
Warning: If you are already using qdisk, do not create a new qdisk device while the cluster is running! It will cause the cluster to malfunction.

In this document, the SAN device is mounted on each node as /dev/sdb. If you do not have an existing SAN, below is a brief tutorial on setting up a tgtd based SAN:

Create the quorum disk using the following command;

Template note icon.png
Note: In the example below, the label an-cluster-01.wdmd is used. This is a free-form label between 1 and 128 characters in length.
mkqdisk -c /dev/sdb -l an-cluster-01.wdmd
mkqdisk v3.1.99
 
Writing new quorum disk label 'an-cluster-01.wdmd' to /dev/sdb.
WARNING: About to destroy all data on /dev/sdb; proceed [N/y] ? y
Initializing status block for node 1...
Initializing status block for node 2...
Initializing status block for node 3...
Initializing status block for node 4...
Initializing status block for node 5...
Initializing status block for node 6...
Initializing status block for node 7...
Initializing status block for node 8...
Initializing status block for node 9...
Initializing status block for node 10...
Initializing status block for node 11...
Initializing status block for node 12...
Initializing status block for node 13...
Initializing status block for node 14...
Initializing status block for node 15...
Initializing status block for node 16...

With this now created, edit your cluster.conf file to use the quorum disk. This involves removing the <cman expected_votes="1" two_node="1"/> entry and replacing it with the quorum configuration. The important value we will need to set is label="an-cluster-01.wdmd" which tells cman which is it's qdisk device. For more information of the various available qdisk attributes and values by reading man 5 qdisk.

vim /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="3" name="an-cluster-01">
	<quorumd label="an-cluster-01.wdmd"/>
	<clusternodes>
		<clusternode name="an-c01n01.alteeve.ca" nodeid="1"/>
		<clusternode name="an-c01n02.alteeve.ca" nodeid="2"/>
	</clusternodes>
	<fencedevices/>
	<fence_daemon post_join_delay="30"/>
	<totem rrp_mode="none" secauth="off"/>
</cluster>
Template note icon.png
Note: Fence methods and fence devices are not defined under each node when using checkquorum.wdmd.

If you want to configure checkquorum.wdmd, you can do so by creating or editing /etc/sysconfig/checkquorum.

Configuration options:

Option Value(s) Description
waittime natural number This is the number of seconds to wait after losing quorum before declaring a failure. What this means will depend on what action is set to.
action (See below) This is the action to taken either immediately is corosync crashes or after waittime seconds when quorum was lost. This delay is used in case the node is able to rejoin the cluster, thus regaining quorum.
autodetect (default) If kdump is running, an attempt to crash dump is made. If kdump is not running, an error will be returned to wdmd which in turn will allow watchdog to reboot the machine.
hardreboot This will trigger a kernel hard-reboot.
crashdump This will trigger a kdump action in the kernel.
watchdog This will return an error to wdmd which in turn will allow watchdog to reboot the machine.

If kdump is running, it is advised that you also use fence_kdump. This will allow the failed node to inform the other nodes that it has rebooted, speeding up recovery time. This is optional and should be considered an optimization.

Template warning icon.png
Warning: The fence_kdump agent is totally untested. Please use it cautiously.

There are limitation to using fence_kdump;

  • Note that checkquorum.wdmd does not work when two_node=1 is set in /etc/cluster/cluster.conf unless it is used in combination with qdiskd "master win" mode. Given the need for shared storage however, it is better to just use fence_sanlock in this case.
  • When using checkquorum.wdmd, fencing is considered complete only after the failed node has rebooted and rejoined the cluster. If this fails, or if there is a failure in the cluster's network, the cluster will hang indefinitely. Likewise, if the node has failed completely and does not restart, the cluster will also hang.
  • If all communication is lost between the nodes, such as a failure in the core switch(es), all nodes will lose quorum and reboot.

Using checkquorum.wdmd

Before checkquorum.wdwm can be used, we must first copy the checkquorum.wdmd into the watchdog script directory and make it executable. Then we will restart the wdmd to enable the new script.

Copy checkquorum.wdmd and setup it's permissions and ownership.

cp /usr/share/cluster/checkquorum.wdmd /etc/wdmd.d/
chown root:root /etc/wdmd.d/checkquorum.wdmd
chmod u+x /etc/wdmd.d/checkquorum.wdmd

Stop cman if it is running, the restart the wdmd daemon and then start cman again.

systemctl stop cman.service
systemctl stop wdmd.service
systemctl start wdmd.service
systemctl start cman.service

Testing checkquorum.wdmd

If everything is working, you should now be able to recover from a hard-crash on a node in the cluster. To test this, you can echo c to /proc/sysrq-trigger and the other node should, eventually, fence the lost node. Remember; The cluster will remain blocked until either;

  • The lost node reboots and rejoins the cluster in a clean state.

or;

  • You manually clear the fence by issuing fence_ack_manual.
Template warning icon.png
Warning: You must be certain that the failed node has been manually powered off before manually clearing the fence. Failing to do so could cause a serious failure in your cluster!
echo c > /proc/sysrq-trigger

In the system log of the other node, you will see messages similar to those in the example below.

Oct 22 22:01:25 an-c01n01 qdiskd[2138]: Writing eviction notice for node 2
Oct 22 22:01:26 an-c01n01 qdiskd[2138]: Node 2 evicted
Oct 22 22:01:28 an-c01n01 corosync[2084]:   [TOTEM ] A processor failed, forming new configuration.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.2) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [QUORUM] Members[1]: 1
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] New Configuration:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] #011r(0) ip(10.20.10.1) 
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Left:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CLM   ] Members Joined:
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [CPG   ] chosen downlist: sender r(0) ip(10.20.10.1) ; members(old:2 left:1)
Oct 22 22:01:30 an-c01n01 kernel: [13038.313753] dlm: closing connection to node 2
Oct 22 22:01:30 an-c01n01 corosync[2084]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:33 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:33 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed
Oct 22 22:01:36 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:37 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed

Note the errors;

Oct 22 22:01:30 an-c01n01 fenced[2218]: fencing node an-c01n02.alteeve.ca
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca dev 0.0 agent none result: error no method
Oct 22 22:01:30 an-c01n01 fenced[2218]: fence an-c01n02.alteeve.ca failed

This is expected and can be safely ignored. After a minute or so, the hung node should reboot. Once it's back online and rejoins the cluster, the cluster should go back to normal operation.

Using fence_sanlock And checkquorum.wdmd

At this time, using both fence_sanlock and checkquorum.wdmd is not supported.

Permissions

I give unrestricted permission to Red Hat, Inc. to copy this document in whole or in part.

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #anvil! on Freenode   © Alteeve's Niche! Inc. 1997-2014
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.
In memory of Kettle, Josh, Tonia, Leah, Harvey and Corey.
Personal tools
Namespaces

Variants
Actions
Navigation
projects
Toolbox