AN!Wiki :: How To :: RHCS v2 cluster.conf

NOTICE: Do not trust this document until all "Q." are answered and removed.
NOTICE: This is a work in progress and likely contains errors and omissions.
ToDo: Show the command line switches for each argument (or that there is none) and which command uses it.

In RHCS, the /etc/cluster/cluster.conf is the "main" configuration file for setting up the cluster and it's nodes and resources.

This is the RHCS "Stable 2" version of the cluster.conf document. It started life as a clone of the "Stable 3" document, and may contain traces of options and values not available in Stable 2. Once I feel that this document is accurate, I will remove these warnings.

Warning: I screwed up the names of some syntax. Anything in <...> is a tag, not an element. An element is the data contained between an opening and closing tag. Further, I may have screwed up other syntax, I will need to check. Officially though, any <foo var="val"> variable/value pair is an attribute.

Format

The cluster.conf file is an XML formatted file that must validate against cluster.ng. If it fails to validate, the cluster will not use your file. Once you finish editing your cluster.conf file, test it via xmllint:

xmllint --relaxng /usr/share/cluster/cluster.ng /etc/cluster/cluster.conf

Change the path to and name of your cluster.ng file above if needed. Do not try to use your new configuration until it validates.

The cluster.conf file should be in the format:

<?xml version="1.0"?>
<cluster name="cluster_name" config_version="1">
	<...>
</cluster>

Tags may or may not have child elements. If a tag does not, then put all of the variables in one self-closing statement.

	<foo a="x" b="y" c="z" />

If the tag does accept child elements, then use a start and end tag with the child elements inside. The opening tag may or may not have attributes. This example shows two elements.

	<section foo="x" bar="y">
		<baz a="x" b="y" c="z" />
	</section>

Sections

There are multiple sections, most of which are optional and can be omitted if not used.

cluster; The Parent Tag

All tags and elements must be inside the parent cluster tag.

It only has two attributes; name and config_version

Please see man 5 cluster.conf for more details.

name

This attribute names the cluster. The name you choose will be important, as you will use it elsewhere in your cluster. An example would be when creating a GFS2 partition.

No default.

config_version

This is the current version of the cluster.conf file. Every time you make a change, you must increment this value by one. The cluster software refers to this value when determining which configuration file to use and to push to other nodes.

No default
Must be a natural number

cluster example

This names the cluster an-cluster and sets the version to 1. All other cluster configurations must be contained inside this start and end tag.

<?xml version="1.0"?>
<cluster name="an-cluster" config_version="1">
	<!-- All cluster configuration options go here. -->
</cluster>

cman; The Cluster Manager

The cman tag is used to define general cluster configuration options. For example, it sets the number of expected votes, whether the cluster is running in the special two-node state and so forth.

If you had no need for cman arguments, put in the self-closing tag.

	<cman />

two_node

This allows you to configure a cluster with only two nodes. Normally, the loss of quorum after one of two nodes fails prevents the remaining node from continuing (if both nodes have one vote.). The default is '0'. To enable a two-node cluster, set this to '1'. If this is enabled, you must also set 'expected_votes' to '1'.

Default is 0 (disabled)
Must be set to 0 or 1

expected_votes

This is used by cman to determine quorum. The cluster is "quorate" if the sum of votes of members is over half of the expected votes value. By default, cman sets the expected votes value to be the sum of votes of all nodes listed in cluster.conf. This can be overridden by setting an explicit expected_votes value. When using the two_node value to 1 then this must be set to 1 as well. Please see clusternode in the cluster section for more info. If you are using a quorum disk, please see the quorumd section as well.

Q. Does the automatic sum also calculate the votes assigned to the quorum disk?

Default is the sum of all node votes.
Must be a natural number

upgrading

Set this to yes when you are performing a rolling updade of the cluster between major releases.

Q. Does this mean cman version, distro version, ...?

Default is no
Must be set to yes or no

disallowed

This option controls cman's "Disallowed" mode. Setting this to 1 may improve backwards compatibility.

Q. How and where exactly?

The default is 0, disabled.
Must be set to 0 or 1

quorum_dev_poll

This is the number of milliseconds after a qdisk poll before a quorum disk is considered dead.

The quorum disk daemon, qdisk, periodically sends "hello" messages to cman and ais, indicating that qdisk is present. If cman doesn't receive a "hello" message in the time set here, cman will declare qdisk dead and generates error messages indicating that the connection to the quorum device has been lost.

Please see the quorumd section for more information on using quorum disks.

Q. Is the default really 50 seconds or is that just the example used?

Default is 50000 (50 seconds)
Must be a natural number

shutdown_timeout

This is number of milliseconds to wait for a service to respond during a shutdown.

Q. What happens after this time? Q. Does this refer to crm/pacemaker controlled services or any service?

The default is 5000 (5 seconds)
Must be a natural number

ccsd_poll

No info yet.

Default is 1000
Must be a natural number

debug_mask

No info yet.

Unknown Default
Unknown value restrictions

port

No info yet.

Q. Is this for the primary totem ring?

Unknown Default
Unknown value restrictions

cluster_id

No info yet.

Unknown Default
Unknown value restrictions

hash_cluster_id

Enable stronger hashing of cluster ID to avoid collisions.

Q. How? What is an example value?

Unknown Default
Unknown value restrictions

nodename

Local node name; this is set internally by cman-preconfig and should never be set unless you understand the repercussions of doing so. It is here for completeness only.

Unknown Default
Unknown value restrictions

broadcast

Enable cman broadcast. To enable, set this to yes.

Q. Under what conditions would this be enabled?

Default is no, disabled.
Must be yes or no

keyfile

No info yet.

Unknown Default
Unknown value restrictions

disable_openais

No info yet.

Unknown Default
Unknown value restrictions

multicast

This provides the ability for a user to specify a multicast address instead of using the multicast address generated by cman.

By default, cman forms the upper 16 bits of the multicast address with 239.192 and forms the lower 16 bits based on the cluster ID.

Q. Does this have to do with the totem ring? Q. What generates the cluster ID when it's not specified by the user?

See above for the default
Must be a valid IPv4 style multicast address

Madi: Test this, is 'addr' an attribute of 'multicast' or of 'cman'?

This element has one attribute; addr

addr

This is where you can define a multicast address. If you specify a multicast address, ensure that it is in the 239.192.0.0/16 network which cman uses. Using a multicast address outside this range is untested.

Q. Is this for the first totem ring?

Unknown Default
Unknown value restrictions

cman example

This is a common scenario used in two-node clusters.

	<cman two_node="1" expected_votes="1" />

totem; Totem Ring and the Redundant Ring Protocol

This controls the OpenAIS message transport protocol.

Q. Does this also control corosync? Q. Are there specific arguments for either?

consensus

When the cluster tries to form, totem will wait this many millisecond for consensus. If this timeout is reached, the cluster will give up and attempt to form a new cluster configuration. If you set this too low, your cluster may fail to form when it otherwise could have. If you set this too high, it will delay error detection and recovery.

The default is 2000 (2 seconds)
This must be a natural number

join

This tells the totem protocol how long to wait, in milliseconds, for a JOIN messages to come from each node. This must be lower than the consensus time. Setting this too low could cause a healthy node to fail joining the cluster. Setting it higher will slow down the assembly of the cluster when a node has failed.

Q. Is this really in milliseconds?

The default is 60 (0.06 seconds).
This must be a natural number

token

This sets the maximum amount of time, in milliseconds, the totem protocol will wait for a token. If this time elapses, the cluster will reformed which takes approximately 50 milliseconds. The reconfiguration time is, then, a sum of this value times token_retransmits_before_loss_const plus the reconfigure time ((4 * 1000) + 50 ms).

The default value is 1000 (1 seconds).
This must be a natural number

fail_recv_const

No info

The default is 2500
Unknown values

token_retransmits_before_loss_const

This controls how many times the totem protocol will attempt to retransmit a token before giving up and forming a new configuration. If this is set, retransmit and hold will be calculated automatically using retransmits_before_loss and token.

Default is 4
This must be a natural number

rrp_mode

This attribute specifies the redundant ring protocol mode. It can be set to active, passive, or none. Active replication offers slightly lower latency from transmit to delivery in faulty network environments but with less performance. Passive replication may nearly double the speed of the totem protocol if the protocol doesn't become cpu bound. The final option is 'none', in which case only one network interface is used to operate the totem protocol.

If only one interface directive is specified, none is automatically chosen. If multiple interface directives are specified, only active or passive may be chosen.

NOTE: Be sure to set this if you are using redundant rings! If you wish to use a redundant ring, it must be configured in each node's clusternode entry. See below for an example.

Default is none
Valid options are active, passive, or none

Manual Ring Recovery

If a ring fails and then is restored, you must manually run the following to re-enable the ring.

corosync-cfgtool -r

Verify the state by then running:

corosync-cfgtool -s

secauth

This attribute specifies whether HMAC/SHA1 authentication should be used to authenticate all messages or not. It further specifies that all data should be encrypted with the sober128 encryption algorithm to protect data from eavesdropping.

If the totem ring is on a private, secure network, disabling this can improve performance. Please test to see if the extra performance is worth the reduced security.

Q. Is the default actually 'on'?

The default is on
Valid values are on and off

keyfile

No info

Q. In objctl, there is a value called 'totem.key=<cluster_name>'. Is this related?

Unknown default
Unknown valid values

Attribute: interface

The totem tag supports one optional interface child tag. If you use this child tag, be sure to use start and end tags.

	<totem ...>
		<interface ...>
	</totem>

ringnumber

This set the ring number with 0 being the primary ring and 1 being the secondary ring. Currently, only two rings are supported.

No default value
Valid values are 0 and 1

bindnetaddr

This tells totem which network interface to use and match the subnet of your chosen interface. The final octal must be 0.

This can be an IPv6 address, however, you will be required to set the nodeid value above. Further, there will be no automatic interface selection within a specified subnet as there is with IPv4.

Q. With IPv6, how then is the given interface chosen?

No default value
See description for valid values

mcastaddr

This sets the multicast address used by the totem protocol on this ring. Avoid the 224.0.0.0/8 range as that is used for configuration. If you use an IPv6 address, be sure to specify a nodeid value above.

Q. Is there a default? Is it automatically calculated like in cman?

No default
Must be a valid IPv4 or IPv6 IP address

mcastport

This sets the UDP port used with the multicast address above.

Q. Can the port be below 1024?

No default
Must be a valid port (natural number between 1-65536.

broadcast

No info

Q. Can the port be below 1024?

Unknown default
Must be a valid broadcast address

totem example

This example shows secauth being disabled and the optional <interface...> child tag defining the first ring's parameters.

        <totem secauth="off" rrp_mode="passive">
                <interface ringnumber="0" bindnetaddr="10.0.1.0" mcastaddr="239.192.122.47" mcastport="5405" />
        </totem>

Note: Please see the bug numbered 624289.

quorumd; Quorum Daemon

In older versions of RHCS, a quorum partition was used to maintain quorum with the network acting as a fall back. This eventually faded out of fashion and quorum disk partitions were rarely used. Today, quorum partitions are still not required but they are coming back into fashion as a way to improve the reliability of a cluster in a multiple failed state and to provide more intelligent quorum.

Lets look at a couple of examples;

If you have a four-node cluster and two nodes fail, the surviving two nodes will not have quorum because normal quorum requires a majority (n/2+1). In this case, your cluster would shut down when it could have kept going. Adding a quorum disk would have allowed the surviving two nodes to maintain quorum.

If you have a four-node cluster and a network event occurred where only one node retained access to a critical network, you would want that one node to proceed and you would rather fence the three nodes that lost access. Under normal IP quorum, the opposite would happen because, by simple majority, the one good node would be fenced by the three other nodes. The quorumd daemon can have heuristics added. In this case, we would configure each node's quorumd to first check that critical network connection. The three nodes would see that they'd lost the link and remove themselves from the cluster. In this way, only the one good node would remain up and win quorum thanks to the votes assigned to the quorum disk.

In short, the quorum disk allows a much more fine grained control of quorum in corner-case failure states.

This section is not required and can be left out when you aren't using a quorum disk partition.

A quorum partition cannot be used in clusters greater than 16 nodes. This is due to the latency caused be clusters larger that 16 nodes causing unreliable quorum disks. With 17 or more nodes, you must use IP-based (totem protocol) quorum only.

A quorum disk must be a raw 10MB or larger (11MB recommended) partition on an iSCSI or SAN device. It is recommended that your nodes use multipath to access the quorum disk. You can not use a CLVM partition.

Q. On a 2-node DRBD partition, can a raw 10MB partition be used? This is probably irrelevant as there is the 'two_node' cman option, but might be useful for the heuristics in a split brain.

Reference; redhat article from 2007.

interval

This controls how often, in seconds, that the quorum daemon on a node will attempt to write it's status to the quorum disk and read the status of other nodes. The higher this value is, the less chance that a transient error will dissolve quorum. The longer it will take to detect and recover from a failure.

Please see the heuristic element below for heuristics intervals. Q. Is this accurate? Q. Does this control the heuristics or disk poll?

Default is 2
Must be a natural number

tko

If a node fails the heuristics checks and/or fails to contact the quorum disk after this many intervals, it will be declared dead and will be fenced (a "Technical Knock Out"). To determine how long this will actually take, multiple interval by tko and you will have the value in seconds.

If you are using Oracle RAC, be sure that this and the interval values are high enough to give the RAC a chance to react to a failure first. So if your RAC timeout is set to 60 seconds, and you are using the default interval of 2, it is recommended to set this to at least 35 (70 seconds).

Q. Is there a modern variant on the 'cman_deadnode_timeout' and, if so, does interval*tko still need to be lower? Q. There seems to be no default in objctl.

No default
Must be a natural number

votes

This is the number of votes assigned to the quorum disk. This value should be the total number of votes of your cluster minus the minimum number of nodes your cluster can operate with. For example, if you have a four-node cluster that can operate with just one node, you would set this to 3 (4-1). This value must be set when using a quorum disk as there is no default.

Q. Is this true, or would the votes be calculated?

No default
Must be a natural number

min_score

The minimum score for a node to be considered alive. If omitted or set to 0, the default function, floor((n+1)/2), is used, where n is the sum of the heuristics scores. The minimum score value must never exceed the sum of the heuristic scores. If set higher, it will be impossible for the heuristics tests to pass. If the resulting score is below this value, the node will reboot to try an return in a better state.

Q. Does it reboot after one failure?

See above for default
Must be a natural number

device

The storage device the quorum daemon uses. The device must be the same on all nodes. It has no default and must be set unless you set label below. For example, if you created your quorum disk with the call:

mkqdisk -c /dev/sdi1 -l rac_qdisk

This will be set to /dev/sdi1. When possible, use set the label option below as it is more robust. If you use 'label' instead of this then the device does *not* need to be the same among nodes. In short, don't set this unless you have a good reason to.

Q. Is this true?

No default
Must be a valid device path

label

Specifies the quorum disk label created by the mkqdisk utility. If you look at the example given in the device</span< argument above, then this would be rac_qdisk. Setting this instead of device is preferable. If you set this, then device is in fact ignored.

If this field is used, the quorum daemon reads /proc/partitions and checks for qdisk signatures on every block device found, comparing the label against the value below. This is useful in configurations where the quorum device name differs among nodes.

No default
Must be a valid mkqdisk label

quorumd example

No example yet

dlm; The Distributed Lock Manager

The distributed lock manager is used to protect shared resources from corruption by ensure that the nodes in a cluster work together in an organized fashion. This is particularly critical with clustered file systems like gfs2.

See man dlm_controld

protocol

This tells DLM to use automatically determine whether to use TCP or SCTP depending on the rrp_mode. You can force one protocol or the other by setting this to tcp or sctp. If rrp_mode is none, then tcp is used.

Default is detect.
Valid values are detect, tcp and sctp

timewarn

This specifies how many 100ths of a second (centiseconds) to wait before dlm emits a warning via netlink. This value is used for deadlock detection and only applies to lockspaces created with the DLM_LSFL_TIMEWARN flag.

Q. This should be explained better. It relies too heavily on assumed knowledge.

Default is 500 (5 seconds)
Must be a natural number

log_debug

Setting this to 1 will enable DLM debug messages.

Q. Do these messages go to /var/log/messages?

Default is 0 (disabled)
Valid values are 0 and 1

enable_fencing

This controls fencing recovery dependency. Set this to '0' to disable fencing dependency.

Q. Does this allow cman to start when no fence device is configured? Why would a user ever disable this?

Default is 1 (enabled)
Valid values are 0 and 1

enable_quorum

This controls quorum recovery dependency. Set this to 0 to disable quorum dependency.

Q. Does this mean that a non-quorum partition will attempt to continue functioning?

Default is 1 (enabled)
Value must be 0 or 1

enable_deadlk

The controls the deadlock detection code. To enable deadlock detection, set this to 1.

Q. Is this primarily a debugging tool?

Default is 0 (disabled)
Value must be 0 or 1

enable_plock

This controls the posix lock code for clustered file systems. This is required by cluster-aware file systems like GFS2, OCFS2 and similar. In some cases though, like Oracle RAC, plock is implemented internally and thus this needs to be disabled in the cluster. Also, plock can be expensive in terms of latency and bandwidth. Disabling this may help improve performance but should only be done if you are sure you do not need posix locking in your cluster. To disable it, set this to 0.

Unlike flock (file lock), which locks an entire file, plock allows for locking parts of a file. When a plock is set, the file system must know the start and length of the lock. In clustering, this information is sent between the nodes via cpg (the cluster process group), which is a small process layer on top of the totem protocol in corosync.

Messages are of the form take lock (pid, inode, start, length). Delivery of these messages are kept in the same order on all nodes (total order), which is a property of 'virtual synchrony'. For example, if you have three nodes; A, B and C, and each node sends two messages, cpg ensures that the message all arrive in the same order across all nodes. For example, the messages may arrive as c1,a1,a2,b1,b2,c2. The actual order doesn't matter though, just that it's a consistent order.

For more information on posix locks, see the fcntl man page and read the sections on F_SETLK and F_GETLK.

man fcntl

For more information on cpg, install the corosync development libraries (corosynclib-devel) and then read the cpg_overview man page.

yum install corosynclib-devel
man cpg_overview

Default is 1 (enabled)
Value must be 0 or 1

plock_rate_limit

This controls the rate of plock operations per second. Set a natural number to impose a limit. This might be needed if excessive plock messages are causing network load issues.

Default is 0 (unlimited)
Must be a natural number

plock_ownership

This controls the plock ownership function. When enabled, performance gains may be seen where a given node repeatedly issues the same lock. This can affect backward compatibility with older versions of dlm. To disable it, set this to 0.

Q. Is this right? This should be explained better.

Default is 1 (enabled)
Value must be 0 or 1

drop_resources_time

This is the number of milliseconds to wait before dropping the cache of lock information. The lower this value, the better the performance but the more memory will be used.

NOTE: This value is ignored when plock_ownership is disabled.

Q. Is this true?

Default is 10000 (10 seconds)
Must be a natural number

drop_resources_count

This is the number of cached items to attempt to drop each drop_resources_time milliseconds. The higher this number, the better the potential performance, but the more memory will be used.

NOTE: This value is ignored when plock_ownership is disabled.

Q. Is this right?

Default is 10
Must be a natural number

drop_resources_age

This is the number of milliseconds that a cached item is allowed to go unused before it is set to be dropped. The lower this value, the better the performance but the more memory will be used.

NOTE: This value is ignored when plock_ownership is disabled.

Q. Is this right?

Default is 10000 (10 seconds)
Must be a natural number

dlm example

This example increases memory use to try and gain performance.

	<dlm protocol="detect" drop_resources_time="5000" drop_resources_count="20" drop_resources_age="5000" />

gfs_controld; GFS Control Daemon

There are several <gfs_controld...> tags that are still supported, but they have been deprecated in favour of the <dlm_controld...> tags.

If you wish to use these deprecated tags, please see the gfs_controld man page.

man 8 gfs_controld

enable_withdraw

The one remaining argument that is still current is enable_withdraw. When set to 1, the default, GFS will respond to a withdraw. To disable the response, set this to 0.

Q. What does the response actually do?

Default is 1 (enabled)
Value must be 0 or 1

gfs_controld example

	<gfs_controld enable_withdraw="0" />

fence_daemon; Fencing

This controls how and when a node is fenced.

post_join_delay

When the cluster manager starts on a node, it will see which other nodes it can talk to. If it reaches enough nodes to form quorum, then it will wait for this amount of time, in seconds, for the remaining nodes to appear. If this time passes the cluster manager will issue fences against the missing nodes. The shorter this time is, the faster missing nodes will be brought online or recovered. If this value is too low, nodes that were booting could be fenced simply for not booting fast enough.

Red Hat recommends setting this to 20 seconds if the default of 6 seconds is too low. Experiment to find the lowest safest number that gives your nodes a fair chance to be started in your cluster.

If this is set to -1, the node will wait forever for other nodes to join. No fence will be issued.

Default is 6 seconds
Value must be -1, 0 or a natural number

post_fail_delay

This tells the cluster to wait the specified number of seconds before fencing a node once it's declared dead. This should only be used when you have a good reason to. If this is set and a node dies, all nodes accessing clustered file systems will hang until the node is finally fenced. This is because recovery can not proceed on these clustered file systems until the fence daemon has reported a successful fence.

The benefit of this option is that, if a nodes suffers from transient network fault, it can avoid being fenced by recovering withing this time span.

Default is 0 seconds
Value must be 0 or a natural number

clean_start

Warning: Do not use this option unless you fully understand the repercussions of doing so. This option could corrupt clustered file systems.

When this is set to 1, enabled, it tells the cluster manager to assume that all nodes are clean and safe to start. This disables any startup fencing and allows the node to fully start. This is only safe if an administrator can ensure that all nodes are offline and cleanly unmounted clustered file systems. It's use is strongly discouraged.

Default is 0 (disabled)
Value must be 0 or 1

fence_daemon example

This is a very common configuration where high-speed recovery or start times are not required. This allows nodes to start up to one minute apart from one another before they'd be otherwise fenced.

	<fence_daemon post_join_delay="20" />

fence_xvmd; Xen domU Fencing

Note: You only need to use this section if you are creating a cluster out of Xen virtual machines.

Warning: I've not built this type of a cluster to test with yet. I am not comfortable that I understand the various implications of changing values from their defaults. If you have more information, please contact me.

This daemon works along with the fence_xvm fence agent to control fencing a cluster of Xen domU virtual servers.

When a virtual server needs to fence another virtual server, it's cluster manager calls the fence_xvm fence agent in the domU fence domain. The agent in turn connects to the fence_xvmd on dom0 which performs the actual fence.

For it to work, the "parent" dom0 servers must also be in a cman/corosync cluster, but not in the same cluster as the domUs. The dom0 nodes must themselves be configured with fencing for automatic recovery.

debug

This enables debugging output in the log file. The higher this value, the more verbose the debug messages will be.

Default is 0 (disabled)
Must be 0 or a natural number

port

This sets the port to listen on.

Q. TCP and/or UDP? Must it be >1024? Is this the port that the multicast messages from the domU fence domain will come in on?

Default is 1229
Must be a valid port number (1 to 65535)

use_uuid

Fence using the UUID of a domU rather than by it's Xen domain name. Set to 1 to enable this.

Q. Is it actually '1' to enable? If so, are the UUIDs to use set in the domU fence domain?

Default is 0 (disabled)
Must be 0 or 1

multicast_address

This is the multicast address to listen for messages on.

Q. Does this have to match the 'multicast_address' set in the domU cluster? If the 'family' now auto-detected from the format of this attribute's value?

Default is 225.0.0.12 for IPv4 or ff02::3:1 for IPv6
Must be a valid IPv4 or IPv6 address, depending on how

auth

This controls the encryption method used for authentication messages. If you enable encryption you will need to share the key among the members of the domU fence domain as well as with the dom0 host machines. The authentication mechanism uses a bidirectional challenge-response based on pseudo-random number generation and a shared private key

If you enable this, you must also set the key_file attribute.

Default is none
Must be none, sha1, sha256 or sha512

hash

This controls the encryption method used for fence messages. If you enable encryption you will need to share the key among the members of the domU fence domain as well as with the dom0 host machines. The authentication mechanism uses a bidirectional challenge-response based on pseudo-random number generation and a shared private key

If you enable this, you must also set the key_file attribute.

Default is none
Must be none, sha1, sha256 or sha512

key_file

This is the path to the file with the hash key used for authentication and fence message hashing. This must be set if either auth or hash are set to anything other than none. If both of these are set to none, this will be ignored regardless of it's value.

ToDo: Show how to create this file.

Q.: Does this have to be fully defined or can it be relative to a default directory?

No default
Must be a fully qualified path and file name which contains the hash key

uri

No info

Unknown default
Unknown values

multicast_interface

No info

Unknown default
Unknown values

fence_xvmd example

No example yet

group; Cluster 2 Compatibility

This tag has only one option and is used during an upgrade of a cluster from Cluster version 2 to Cluster version 3. It controls the compatibility modes of fenced, dlm_controld and gfs_controld daemons.

See man 8 groupd

groupd_compat

When set to 1, fenced, dlm_controld and gfs_controld will run in a mode compatible with Cluster version 2 (as used in RHEL/CentOS 5.x and similar). This will allow nodes from those old clusters to participate in the new cluster until they are upgraded.

This should be disabled once the upgrade of all cluster nodes is complete.

Default is 0 (disabled)
Must be 0 or 1

group example

This is the only entry you will use.

	<group groupd_compat="1" />

logging; Global Logging Settings

This controls both the global logging options and the daemon-specific logging options as set by child-tags.

For more info, see man 5 cluster.conf

to_syslog

This controls whether log messages are sent to syslog. Messages sent to syslog will, on most distributions, be written in /var/log/messages. Set this to no to disable.

Default is yes
Must be yes or no

to_logfile

This controls whether log messages are sent to a specific log file. If set to yes you must then set logfile. Set this to yes to enable.

Default is no
Must be yes or no

syslog_facility

This sets the syslog facility to assign to log messages. It is recommended to only use local0 through local7.

Default is local4
Valid values are listed auth, authpriv, daemon, cron, ftp, lpr, kern, mail, news, syslog, user, uucp, local0, local1, local2, local3, local4, local5, local6 and local7

syslog_priority

This sets the syslog priority of log messages.

Default is info
Valid values are listed Emergency, Alert, Critical, Error, Warning, Notice, Info and Debug

logfile

When logfile is set to yes, this must be set to define what file to write the logs to. It must be the full path and file name of the desired log file.

No default
Must be a valid path and file name

debug

When set to on, extra debugging information will be sent to the log(s).

Default is off
Valid options are on and off

Tag; logging_daemon

Zero or more logging_daemon elements are supported within the <logging ...> tag. Each one can be used to control specific logging settings for the named daemon.

Optional attributes include to_syslog, to_logfile, syslog_facility, syslog_priority, logfile_priority, logfile and debug. All of these optional attributes share the same defaults and valid values as in the parent tag. When not defined, the value assigned to the parent tag is inherited.

The following two attributes are supported by the logging_daemon element. The name attribute is required.

name

This is the name of the daemon being configured. This attribute is required.

No default
Valid values are corosync, qdiskd, groupd, fenced, dlm_controld, gfs_controld and rgmanager

subsys

By default, the settings for the named daemon apply to all of the corosync subsystems. You can configure the given subsystems individually by naming them using the subsys attribute. You can have multiple matching <logging_daemon name="foo" ...> entries provided that each has a unique subsys attribute value. If any subsystem is named, then any unnamed subsystem will use the values set in the parent logging tag.

No default (all subsystems configured)
Valid values are CLM, CPG, MAIN, SERV, CMAN, TOTEM, QUORUM, CONFDB, CKPT and EVT

logging example

no example yet.

clusternodes; Defining Cluster Nodes

This section is where all nodes in the cluster are defined, including their fence devices and other node-specific attributes. It has no attributes of it's how, rather, it contains zero or more <clusternode ...> child tags.

Tag; clusternode

Each node in the cluster must have it's own <clusternode ...> tag. The tag itself supports four attributes and can contain two other child-tags.

clusternode's name attribute

This is the name of the cluster node. It should match the name returned by uname -n. All other nodes must be able to connect to the node by that name. If necessary, edit your /etc/hosts file or your DNS server to ensure that this name resolves properly. Further, the IP address returned when resolving this name will be the interface that the primary totem ring interface.

All node names must resolve to an IP address in the same network as all other nodes. If a node resolves to an IP on another network, it's totem protocol messages will not reach the other nodes and it will not be able to join the cluster. Having node names with an underscore (_) may cause problems and are ill-advised.

The cluster will try to resolve the value here using the following methods, in order:

It looks up $HOSTNAME in cluster.conf
If this fails it strips the domain name from $HOSTNAME and looks up that in cluster.conf.
If this fails it looks in cluster.conf for a fully-qualified name whose short version matches the short version of $HOSTNAME
If all this fails then it will search the interfaces list for an (ipv4 only) address that matches a name in cluster.conf.

This attribute is required.

Q. Can this be the IP address of the interface the primary ring will run on instead of the hostname?

No default
Must match the name returned by uname -n and must be a valid host name.

Thanks to Christine Caulfield of Red Hat for linking to her paper that clarified this.

clusternode's nodeid attribute

This is a natural number used to reference the node. It must be unique among other nodes in you cluster.

This attribute is required.

Q. Is this true, or will it automatically generate an ID?

No default
Must be a unique natural number

clusternode's votes attribute

This tells the cluster how many votes a given node contributes towards quorum. This should be set to 1 unless you have a special case and understand the reasons for setting a higher value.

This attribute is required.

Q. Is this true, or will it default to '1'?

No default
Must be a natural number

clusternode's weight attribute

This sets the dlm locking weight.

Normally, when no node has a weight value set, the locking tasks are evenly distributed among the nodes in the cluster. By setting weights, you can set a given node to bear more of the locking responsibility. When this is set, the nodes with higher weights become known as "lock masters". If a non-master node fails, recovery will be faster. If a master node fails, then the locking load will be redistributed among the surviving nodes.

See man 8 dlm_controld for more information.

Default is 0
Must be 0 or a natural number

Tag; altname

This provides a means of setting up a second, redundant totem ring. This provides a means of communication between the nodes in the event of a failure of the primary ring, which could happen if a main switch failed. Without this, should the primary network fail, your nodes would all separate and there would be no way to gain quorum, stopping all hosted services. In corosync or openais, this would be known as ring number 1.

At the time of this writing, totem rings will not automatically recover after a failure. Please see the note in rrp_mode for information on how to manually recover a ring.

altname's name attribute

This is the host name that resolves to the interface connected to the network that the redundant ring will communicate on. All other nodes must have their redundant ring operating on the same network.

Q. Can this be the IP address of the interface the primary ring will run on instead of the hostname?

No default
Must be a host name that resolves to an IP of the interface the redundant ring will communicate on

altname's port attribute

This is the port used for the redundant ring communication.

Q. TCP and/or UDP? Must it be >1024? Is this the port that the multicast messages from the domU fence domain will come in on?

Default is ?
Must be a valid port number (1 to 65535)

altname's mcast attribute

This is the multicast address to use for the redundant ring communication.

Q. Are there similar recommendations or restriction as with the primary ring?

Unknown default
Must be a valid IPv4 or IPv6 multicast address

Example; altname

This is a typical configuration for a given node's redundant ring.

Note: This is showing just enough of the configuration to give the altname entry context,

	<clusternodes>
		<clusternode name="an-node01.alteeve.com" nodeid="1">
			<altname name="an-node01-sn" port="5406" mcast="239.192.122.46" />
		</clusternode>
		<clusternode name="an-node02.alteeve.com" nodeid="2">
			<altname name="an-node02-sn" port="5406" mcast="239.192.122.46" />
		</clusternode>
	</clusternodes>

Tag; fence

One or more fence child tags may be defined per <clusternode ...>. These tell the fence daemon, via the cluster manager, which fence device(s) to use, what port or address the node is connected to and what fence action to take.

This is not used to configure the fence device itself. This tag is specifically for node-specific fence device configuration information.

The <fence> tag does not have any attributes. It contains one or more <method> tags.

Tag; method

A <method> tag tells the cluster manager what steps to take to accomplish a fence attempt. Usually there is only one step, like calling an IPMI device on a node and telling it to turn off. In some cases though, it is necessary to take multiple steps to accomplish a single fence attempt. A good example of this would be when using two addressable PDUs to fence either side of a redundant power supply. In this case, you would need to call off on both devices to ensure the node died before you could call on for either side to begin rebooting the node.

The <method> tag supports one attribute, name. It contains one or more [[#Element; device|<device> tags.

method's name Attribute

This is a name or number meant to identify the method. It has no programmatic meaning and simply needs to be unique among the methods for the given node.

No default
Must be unique per node

Tag; device

This tag tells the cluster manager which fencedevice to call in order to perform the defined fence action against the node.

There is one required tag, name, though many more attributes may be defined depending on the fence device in use. All attributes other than name will be passed to the fence agent associated with the appropriate fencedevice.

device's name attribute

This must match the name of the fencedevice name.

It uses this to find out which physical (or logical) fence device to use, and in turn what fence agent to call. Any attributes beyond name will be passed to that fence agent. Generally this includes the port the node is connected to and what action to take. In the case of single-node fence devices, like IPMI, simply the action is defined.

Please see fencedevices and individual fence devices for other attributes that may be used here.

No default
Must match the name of a <fencedevice...>

Examples; method

This is example shows a primary IPMI fence method with a backup fence method using two addressable PDUs on a node with redundant power supplies.

Note: Just enough of the configuration file is shown to give context.

	<clusternodes>
		<clusternode name="node1" nodeid="1">
			<fence>
				<method name="ipmi1">
					<device name="ipmi1" action="reboot" /> 
				</method>
				<method name="pdu1">
					<device name="pdu1" port="11" action="off"/>
					<device name="pdu2" port="11" action="off"/>
					<device name="pdu1" port="11" action="on"/>
					<device name="pdu2" port="11" action="on"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>

In this example, if the IPMI fence method succeeded, then the cluster manager would stop there. However, if the IPMI fence failed, it would fall back on the second method using the PDUs.

fencedevices; Defining Fence Devices

This section is where and fence devices named in the clusternode section.

There are no supported attributes. It contains zero or more <fencedevice> child tags. See man 8 fenced for more information.

Tag; fencedevice

This tag is used to define a given fence device. There must me a fencedevice entry with a name matching each devices specified in the clusternode section.

There may be other valid attributes beyond the two listed below, depending on the fence agent being defined. Please refer to the documentation of each fence device for the extra attributes supported.

fencedevice's name attribute

This is the name of the fence device. This name is searched for by the fence daemon when it decides to fence a node. As such, it must match the name given to a device under the appropriate method for a given node.

No default
Must match the name of a device specified in a given clusternode's method

encedevice's agent attribute

This is the name of the script that fenced will call and pass arguments to. Fence Agents will generally be found in /sbin/ in RHEL/CentOS /usr/sbin/ in Fedora.

For details on how fence agents work, please see the FenceAgentAPI definition.

Q. Where is the directory to find these agents set? Compile-time option?

No default
Must either match the file name of the fence agent or be a full path and file name of the agent to use

Examples; fencedevice

This example shows two fence device configurations. The first is for an IPMI device and the second is for a Node Assassin. The additional attributes shown are specific to the given fence agents.

Note: Only enough of the configuration file is shown to give context.

	<fencedevices>
		<fencedevice name="ipmi1" agent="fence_ipmilan" ipaddr="10.255.128.1" login="name" passwd="secret" />
		<fencedevice name="na1" agent="fence_na" ipaddr="batou.alteeve.com" login="name" passwd="secret" quiet="1" />
	</fencedevices>

rm; The Resource Manager

The resource manager, also known as the resource group manager, handles clustered services.

When the cluster gains quorum, the resource manager brings clustered services online by starting them on their assigned nodes. In the event of a failure, the resource manager will wait until the node has been fenced and the cluster has been reconfigured. Once completed, it will recover the services that had been on the effected node(s).

On the node that lost quorum, the resource manager on that node will stop all clustered services. These services will not restart until quorum has been regained.

A key term to be familiar with is resource tree, resource group and service. These refer to a collection of resources defined by and contained in a <service> tag. For more information, please see Red Hat's article defining resource trees.

For more general information on resource groups, see man rgmanager.

log_level

This attribute is deprecated.

Please refer to logging; Global Logging Settings for configuring logging.

This sets the verbosity of messages sent to syslog. The following list indicates the priority of messages. Messages of level equal to or lower than the value set here will be logged. Messages from levels higher will be suppressed.

Log levels:
- 0 - System is unusable, emergency messages
- 1 - Action must be taken immediately
- 2 - Critical conditions
- 3 - Error conditions
- 4 - Warning conditions
- 5 - Normal but significant condition
- 6 - Informational
- 7 - Debug-level messages

This attribute is deprecated.

Default is 5
Must be an integer between 0 and 7

log_facility

This attribute is deprecated.

Please refer to logging; Global Logging Settings for configuring logging.

This sets the syslog facility to use when sending messages to syslog.

Default is daemon
Valid values are listed auth, authpriv, daemon, cron, ftp, lpr, kern, mail, news, syslog, user, uucp, local0, local1, local2, local3, local4, local5, local6 and local7

status_child_max

This sets the maximum number of status check threads that may run on any given node at a time. Specifically, it controls how many instances of clustat queries may be outstanding.

Do not change this unless you have a good reason to and understand the effects of your change.

Default is 5
Must be a natural number

status_poll_interval

No info

Unknown default
Must be a natural number

transition_throttling

No info

Unknown default
Must be a natural number

central_processing

No info

Unknown default
Must be a natural number

Tag; failoverdomains

Fail over domains are a collection of nodes that a defined service may relocate to when the host node fails. It also how or if a service is to be relocates as nodes return to the cluster.

The child tags, elements and attributes below this element are used to define preferred nodes, preferred pools of nodes known as domains and whether services are allowed to move off of preferred domains as a last resort. When a node from a preferred domain recovers, the service will always migrate back to the preferred domain. Beyond that though, you can configure whether a service returns to a preferred node once it has moved off or not.

More information

Tag; failoverdomain

This defines a given fail-over domain in your cluster.

name

This sets the name of the fail-over domain. It must be unique failoverdomain elements.

No default
Unknown restrictions on the name

ordered

By default, a service will migrate to any node in the fail-over domain. If you set this to 1, enabled, then the service will always migrate to the node with the highest priority (lowest number value) that is online.

Default is 0 (unordered)
Must be o, unordered, or 1, ordered.

Must be a natural number

Examples

Examples of CentOS 5 cluster.conf configurations.

Two Node configuration

Any questions, feedback, advice, complaints or meanderings are welcome.
`Alteeve's Niche!`	`Enterprise Support: Alteeve Support`	`Community Support`
© Alteeve's Niche! Inc. 1997-2024		Anvil! "Intelligent Availability®" Platform
`legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.`