AN!Wiki :: High-Availability Clustering in the Open Source Ecosystem

Warning: This is not considered complete yet, and may contain mistakes and factual errors.

High-Availability Clustering in the Open Source Ecosystem

Two Histories, One Future

Prologue

Open-source high-availability clustering has a complex history that can cause confusion for new users. Many traces of this history are still found on the Internet, causing many new users to make false starts down deprecated paths in their early efforts to learn HA clustering.

The principle goal of this document is to provide clarity on the future of open source clustering. If it succeeds in this goal, future new users will be less likely to start down the wrong path.

In order to understand the future, we must start in the past. Documenting the past is always a tricky prospect and prone to error. Should you have insight that would help improve the completeness or accuracy of this document, please contact contact us at "docs@alteeve.ca". We will reply as quickly as possible and will improve this document with your help.

Two Types of Clusters; HA and HPC

There are two distinct types of "clusters"; High-Availability clustering and High-Performance Compute clusters. The former, HA clusters, are designed to ensure the availability of services in spite of otherwise-catastrophic failures. The later, HPC clusters, are designed to speed up the processing of problems beyond the abilities of any single server.

This document speaks specifically about the history and future of high-availability clusters.

A Brief History of Two Pasts and One Future

Two entirely independent attempts to create an open-source high-availability platform were started in the late 1990s. The earliest of these two stacks was SUSE's "Linux HA" project. The other stack would become Red Hat's “Cluster Services", though it's genesis was spread across two projects; Mission Critical Linux and Sistina's "Global File System".

These two projects remained entirely separate until 2007 when, out of the Linux HA project, Pacemaker was born as a cluster resource manager that could take membership from and communicate via Red Hat's OpenAIS or SUSE's Heartbeat.

In 2008, an informal meeting was held where both SUSE and Red Hat developers discussed reusing some code. SUSE's main interest was CRM/pacemaker and Red Hat's main interest was OpenAIS. By this time, both companies had large installation bases. Both understood that this merger effort would take many years to complete to insure that there was minimal impact of existing users.

In 2013, Red Hat announced full support for Pacemaker with the release of Red Hat Enterprise Linux 6.5. In RHEL 6, pacemaker draws it's quorum and membership from cman still. Elsewhere, distributions that support corosync version 2 can draw their quorum and membership from it instead.

All new users to high availability clustering should now use "Pacemaker" for resource management and "corosync" for cluster membership, quorum and communications. Any existing users of other, older cluster software should now be actively developing a migration path to this new stack.

Necessary Vocabulary

There is some jargon in high-availability clustering. Understanding these terms should help those who are new to HA clustering understand the topic we will shortly discuss. Readers who are new to HA clustering would benefit from jumping down to the glossary section below before proceeding.

SUSE's Story; The Early Days of Linux HA

In 1998, Alan Robertson of Bell Labs (and later, IBM) created the "Linux HA", creating a new protocol called "Heartbeat". In 1999, a resource manager was created on this new protocol, creating a monolithic, two-node cluster stack. This provided membership, message passing, fencing and resource management in a single program. He was hired in 2000 by S.u.S.E. Linux GmbH.

With the release of SUSE 7 in August of 2000, support for heartbeat began and continues to the present (2014). Around this same time, work began on the Open Cluster Framework, "OCF". The idea was to have standard APIs for membership and messaging as well; but only the resource agent calling conventions survived. Conectiva was an early contributor to Heartbeat.

Heartbeat "version 1" was limited to two nodes (the membership layer supported more nodes, but the resource management did not) and provided a basic resource management model, but it had the benefit of being very easy to use.

In 2003, SUSE's Lars Marowsky-Brée conceived of a new project called the "cluster resource manager", "crm", spurred on by the limitations of the heartbeat resource manager. Andrew Beekhof was hired by SUSE to implement this new component. The CRM was created as a new program that was dedicated to resource management and designed to support up to 16 nodes. The crm sat on top of heartbeat and used it for membership, fencing and cluster communication.

Red Hat's Cluster Services; A More Complicated Story

Where SUSE's Linux HA project was fairly linear, Red Hat's history is somewhat more involved. It was born out of two projects started by two different companies.

Mission Critical Linux

In 1999, a group of engineers, mostly former DEC employees, founded a company called “Mission Critical Linux”. The goal of this company was to create an enterprise-grade Linux distribution based around a high-availability cluster stack.

In 2000, Mission Critical Linux looked at the Linux HA project and decided that it did not suit their needs. They did take STONITH from the Linux HA project and adapted it for their project. They built an HA platform called “Kimberlite” which supported two-node clusters.

Sistina Software's Global File System

Separately in 199?, a group of people from the University of Minnesota began work on a clustered file system they called “Global File System”. In 2000, this group started a company called Sistina Software.

Sistina's other notable contribution to Linux was the Logical Volume Manager - although they did not develop version 1 of the software, which was the project of Heinz Mauelshagen - the took over at version 2 and made sure it could be clustered. So although, this was not initially cluster project, it is one of the more important contribution to Linux and deserves to be noted.

In 2003, Red Hat purchased Sistina Software.

The Birth of Red Hat Cluster Services

In 2002, when MCL closed, Red Hat took their “Kimberlite” HA stack and used it to create “Red Hat Cluster Manager” version 1, called “clumanager” for short. This was introduced in Red Hat Advanced Server version 2.1 which itself became “Red Hat Enterprise Linux”.

Like the Linux HA project's Heartbeat program, cman was a monolithic program providing cluster communications, membership, quorum and resource management.

In 2003, with the acquisition of Sistina Software, GFS was introduced. Unlike traditional file systems, GFS was designed to coordinate access to a common file system from multiple nodes. It did this by way of a lock manager - originally GuLM, later DLM. In 20??, LVM would extended to support DLM, allowing for multiple nodes to manage LVM resources on shared storage in a coordinated manner.

Around the same time, Kimberlite was largely rewritten, extending commercially supported cluster sizes of up to 8 nodes, though technically, clusters could have more nodes.

Another core component of Red Hat's cluster stack was Quorum Disk, called “qdisk” for short. It used SAN storage as a key component of intelligently determining what nodes had quorum, particularly in a failure state. It did this by having one or more quorum votes.

Should the cluster partition, qdisk could use heuristics to decide which partition would get it's votes and thus have quorum. Through these heuristics, for example, qdisk could give its vote to a smaller partition if that partition still had access to critical resources, like a router.

Fenced; The Fence daemon, used power or fabric fencing to forcibly power off or logically disconnect from the (storage) network and node that stopped responding.
Clustered LVM; This extended normal LVM to support shared storage inside the cluster. It was designed at the same time as the LVM2 codebase and originally had its own lock manager (though the latter was never released).

GuLM was the lock manager written by Sistina solely for GFS use. Shortly before Red Hat took over Sistina, the DLM "Distributed Lock Manager" was developed for use with cman. DLM is designed using the principles of the DEC OpenVMS lock manager to be a general purpose facility for locking and fully distributed around the cluster. It is also used by clvmd as well as GFS1 & GFS2.

Q. When and why was fencing split out?

It was originally part of lock_gulmd, but the architecture of DLM/cman required a stand-alone fence daemon.

Q. When did qdisk get added? Was it from MCL or was it part of the re-write?

It was created around 2004, using some work done by MCL and extended by Red Hat. Check git history.

Outlier; OpenAIS

The OpenAIS project was created as an attempt to implement an open source implementation of the SA Forum's AIS API. This API is designed to provide a common mechanism for highly available applications and hardware.

OpenAIS was developed by Steven Dake, then working at MontaVista Software, as a personal project. Christine Caulfield drove the Red Hat adoption of OpenAIS.

In the open source high availability world, OpenAIS provided a very robust platform for multi-node cluster membership and communication. Principally, it managed who was allowed to be part of the cluster via “closed process groups” and ensured that all nodes in the CPG received messages in the same order and that all nodes confirmed receipt of a given message before moving on to the next message in a process called “virtual synchrony”. This was implemented using multicast groups; allowing for the efficient distribution of messages to many nodes in the cluster.

OpenAIS itself has no mechanism for managing services, which is why we do not consider it an HA project in it's own right. It can not start, stop, restart, migrate or relocate services on nodes. As we will see though, it played a significant role in early open source clustering, thus deserving immediate mention here.

First Contact

In 2004, a cluster summit in Prague was held where SUSE and Red Hat developers attended together.

SUSE's Lars Marosky-Bree presented on the CRM version 2. Pacemaker's Andrew Beekhof began looking into the viability of running the new CRM on OpenAIS.

Red Hat presented it's cluster managers, cman and GuLM. It also presented DLM, GFS and a new independent fencing mechanism.

Red Hat decided to support most of the Linux HA project's OCF resource agent draft version 1.0 API.

With the release of Red Hat Enterprise Linux 5, the kernel based cman was replaced with user-space cman using OpenAIS.

Red Hat; 2004 to 2008

In 2005, cluster manager's resource management was split off and the Resource Group Manager, called “rgmanager”, was created. It sat on top of “cman” which continued to provide quorum and cluster communication and messaging.

Initially, Red Hat opted not to support OpenAIS, opting instead to stick with their kernel-based cman. In 2006, however, support for OpenAIS was added. GuLM was deprecated and cman was reduced to providing membership and quorum for openais. OpenAIS became the cluster communication layer.

In 200?, Red Hat released RHEL 5, introducing Red Hat Cluster Services version 2. From a functional level, RHCS stable 2 changed minimally from the initial release of RHCS. OpenAIS was used as the cluster communication layer and qdisk became optional.

Note: When RHEL 6 was released, Red Hat switched to the name "High Availability AddOn" and retired the "RHCS" name.

SUSE; 2004 to 2008: Heartbeat and Pacemaker

With the release of SLES 9 in 2004, SUSE, in partnership with Oracle, released OCFS2, the first cluster-aware file system merged into the upstream Linux kernel. The first version of OCFS was never publicly released. This first version did not include heartbeat integration.

On July 29, 2005, "Heartbeat version 2" was released, introducing the CRM. Version 2 introduced support for larger clusters, more complex resource management and support for openais as the membership and communication layer. Heartbeat introduced the CCM, cluster consensus manager, the upgraded membership algorithm that gathered the list of agreed members and handled quorum.

In 2006, with the release of SLES 10, support for heartbeat plus the new CRM. Along with the new CRM's raw XML configuration, the python-based hb_gui was introduced to simplify using the new, more powerful and thus complex, CRM. At the same time, SUSE introduces support for OCFS2 under heartbeat/CRM control as well as support for the short-lives clustered EVMS2, based on heartbeat/CCM integration.

Enter Pacemaker

Until Heartbeat version 2.1.3, the CRM was coupled tightly to Heartbeat, matching it's versioning and release cycle. In 2007, it was decided to break the CRM out of the main Heartbeat package called "Pacemaker". It was extended to support both heartbeat and "OpenAIS" for cluster membership and communication. Core libraries were also split out of heartbeat into "cluster-glue"; the core infrastructure for local IPC, plugin loading, logging and so forth.This package originally included heartbeat's resource agents, fencing agents and the stonith daemon. This allowed these components to be used independently of heartbeat, including when pacemaker was run on top of OpenAIS. Quorum became a plug-in.

At this time, SUSE hired Dejan Muhamedagic who wrote the crm shell, an extremely popular command line tool for managing pacemaker clusters.

On January 17, 2008, Pacemaker version 0.6.0 was released, introducing support for OpenAIS. This remained the main version until the release of Pacemaker version 1.0 in 2010. Until then, working with Pacemaker required directly working with it's "Cluster Information Base", CIB, an XML configuration and state file that was as powerful as it was user unfriendly. With the release of version 1.0, the CIB was abstracted away behind the "Cluster Resource Manager" called "crm". Support for Red Hat's "cman" quorum provider was also added.

On February 16, 2010, pacemaker 1.1 was released as version 1.1.1. It introduces resource-based service placement, an ability to serialize unrelated sets or resources and other features.

Exit Heartbeat

On June 17, 2009, SUSE releases SLES 11, marking the first SLES version to ship with Pacemaker on OpenAIS instead of heartbeat. As part of the SLES10 to SLES11 transition, the SLES cluster stack was split into its own product extension, included the crm shell. In this release, OCFS2 was based on controld that utilized OpenAIS and certain SAF AIS modules like checkpoint (CKPT), the in-kernel file system and DLM code which was shared with GFS2. By this time, clustered EVMS2 was no longer actively maintained upstream, and thus replaced with clustered LVM2, which is based on OpenAIS.

With the release of SLES 11, SP1 on June 2, 2010, SLES switched from OpenAIS to Corosync, except for OCFS2 which still required the wire-protocol-compatibility of OpenAIS's CKPT. At this time, SLES introduced the Hawk, a web-based front-end for pacemaker that used crmsh in the background. With the release of SLES SP3 on July 1, 2013, the cluster-glue LRM was dropped in favour of Pacemaker's internal LRM.

The Heartbeat project reached version 3, which marked something of a feature reversion. The CRM of heartbeat 2 was removed and version 1 style resource management was restored. Heartbeat 3 can still be used instead of corosync under Pacemaker. That said, this is largely discouraged. Heartbeat development has ceased and there are no plans to restart development.

It is worth noting that LinBit, the company behind the very popular DRBD project, still provides commercial support for Heartbeat based clusters and still maintains the existing code base. However, it's user base is declining, thought it has some fans who love it's simplicity. The flexibility that comes with Pacemaker certainly raises the barrier to entry for many users.

With no plans to restart development, heartbeat as a project should be considered "deprecated" and it's use is generally discouraged now.

2008; Merge Begins

In 2008, an informal meeting was held in Prague between SUSE and Red Hat HA cluster developers.

At this meeting, it was decided to working towards reusing some code from each project. Some details were decided quickly, others would take many years. Red Hat's principle focus was OpenAIS and the Linux HA project's principle interest from Pacemaker.

One of the early changes was to strip out core functions from OpenAIS and create a new project called “Corosync”, first released in 2009. OpenAIS became an optional plug-in for corosync, should a future project wish to implement the entire AIS API.

Corosync itself was designed to be OpenAIS, simplified to just the core functions needed by HA clustering. Both pacemaker and Red Hat's cluster stacks to adopt it. Red Hat Enterprise Linux 6.0 would introduce Red Hat cluster services version 3. RHCS became an optional RHEL Add-On called “High-Availability Add-On”. It separated GFS2 as an optional Add-On called “Resilient Storage”. Users who purchased RS got the HA add-on as well.

In 2010, Pacemaker added support for cman, allowing pacemaker to take quorum and membership from cman. Pacemaker quickly became the most popular resource manager with it's good support across multiple Linux distributions and powerful and flexible resource manager. Red Hat's “rgmanager”, despite being very stable, was less flexible and it's popularity waned.

With the release of Red Hat Enterprise Linux 6.0, pacemaker was introduced under Red Hat's “Technology Preview” program. Between RHEL 6.0 and 6.4, Pacemaker underwent rapid development. Support was added for Red Hat's fence agents. Benefits of the heartbeat project's resource agents were back-ported to Red Hat's resource agents. Pacemaker version 1.1.10, released with RHEL 6.5, added support for Red Hat style fence methods and fence ordering.

2014; Here and Forward

Now we have a decent understanding of where we are. It's time to discuss where things are going.

Wither the Old

Red Hat builds it's Red Hat Enterprise Linux releases on the “upstream” Fedora community's releases. Red Hat is not obliged to follow Fedora, but it examining it is often a good indication of what the next RHEL will look like.

The Red Hat “cluster” package, which provided “cman” and “rgmanager”, came to an end in Fedora 16 at version 3.2. Corosync reached version 2 and with that, it became a quorum provider. As such, both “cman” and “rgmanager” development ceased. DLM, still used by GFS2 and clustered LVM, was split-out into a stand-alone project.

Red Hat has committed to supporting and maintaining cman and rgmanager through the life of RHEL 6, which is scheduled to continue until 2020. As such, users of Red Hat's traditional cluster stack will be supported for years to come. The writing does appear to be on the wall though, and cman and rgmanager are not expected to be included with RHEL 7.

In a similar manner, LinBit still supports the Heartbeat package. They have not announced and end to support, and they have diligently supported it for some years now. Like Red Hat's “cluster” package though, it has not been actively developed in some time and there are no plans to restart it.

A Word of Warning

All users of Red Hat's cluster suite and SUSE/LinBit's heartbeat packages should be making migration plans. Readers starting new projects should not use either of these projects.

The Future! Corosync and Pacemaker

If the predictions of RHEL 7 are correct, it will mark the end of the merger of the two stacks. From then until the foreseeable future, all open source, high availability clusters will be based on corosync version 2 or higher and pacemaker version 1.1.10 or higher.

Much simpler!

All other components; gfs2, clvmd, resource agents, fence agents and other components will be simple additions, based on the needs of the user.

A Short Discussion on Management Tools

Thus far, this paper has focused on the core components of the two, now one, HA stacks.

There are a myriad of tools out there designed to simplify the use and management of these clusters. Trying to create an exhaustive list would require it's own paper, but a few key ones are worth discussing.

Pacemaker Tools

With the release of Pacemaker 1.0, the “crm shell” called “crmsh” was introduced. It was build specifically to configure and manage pacemaker, abstracting away the complex XML found in the “cib.xml” file. It is a very mature tool that most existing pacemaker users are well familiar with.

As part of the adoption of pacemaker by Red Hat, a new tool called the “Pacemaker Configuration System”, called “pcs” was introduced with the release of RHEL 6.3. It's goal is to provide a single tool to configure pacemaker, corosync and other components of the cluster stack.

Red Hat Tools

Red Hat's “cman” and “rgmanager” based clusters can be configured directly by working with the “cluster.conf” XML configuration file. Alternatively, Red Hat has a web-based management tool called “Luci”. It is a web-based application that can run on any machine in or outside the cluster. It makes use of a program called “ricci” and “modclusterd” for manipulating the cluster's configuration and pushing changes to other nodes in the cluster.

The 'luci' program will be replaced by pcsd, which uses a REST-like in RHEL version 7 and beyond.

Thanks

This document would not exist without the help of:

Andrew Beekhof - Red Hat, Pacemaker creator
Lon Hohberger - Red Hat, Software Engineer
Fabio M. Di Nitto - Red Hat, Supervisor, Software Engineering
Steven Dake - Red Hat,
Christine Caulfield - Red Hat,
Lars Marowsky-Brée - SUSE, Distinguished Engineer

References

Glossary

This is not an attempt at a complete glossary of terms one will come across while learning about HA clustering. It is a short list of core terms and concepts needed to understand topics covered in this paper.

Fencing

When a node fails in an HA cluster, it enters an “unknown state”. The surviving node(s) can not simply assume that it has failed. To do so would be to risk a split-brain condition, even with the use of quorum. Clustered services that had been running on the lost node are likewise in an unknown state, so they can not be safely recovered until the host node is put into a known state.

The mechanism of putting a node into a known state is called “fencing”. It's fundamental role is to ensure that the lost node can no longer provide it's clustered services or access clustered resources. There are two effective fencing methods; Fabric Fencing and Power Fencing.

In Fabric fencing, the lost node is isolated. This is generally done my forcibly disconnecting it from the network and/or shared storage.

In Power fencing, the lost node is forced to power off. It is a crude but effective.

Quorum

Quorum is, at it's most basic, “simple majority”. It provides a mechanism that a node can use to determine if it is allowed to provide clustered resources. In it's basic form, “quorum” is determined by dividing the number of nodes in a cluster by two, adding one and then rounding down. For example; a five node cluster, divided by two, is “2.5”. Adding one is “3.5” and then rounded down is “3”. So in a five-node cluster, three nodes must be able to reach each other in order to be “quorate”.

A node must be “quorate” in order to start and run clustered services. If a node is “inquorate”, it will refuse to start any cluster resources and, if it was running any already, will shut down it's services as soon as quorum is lost.

Quorum can not be used on basic two-node clusters.

Split-Brain

In HA clustering, the most dangerous scenario is a “split-brain” condition. This is where a cluster splits into two or more “partitions” and each partition independently provides the same HA resources at the same time. Quorum helps prevent split-brain conditions, though it is not a perfect solution.

Split-brain conditions are ultimately only avoidable with proper fencing use. Should a split-brain occur, data divergence, data loss and file system corruption become likely.

STONITH

In the Linux HA project, the term “stonith” is used to describe fencing. It is an acronym for “shoot the other node in the head”, a term that particularly references power fencing. Please see “Fencing” for more information.

Any questions, feedback, advice, complaints or meanderings are welcome.
`Alteeve's Niche!`	`Enterprise Support: Alteeve Support`	`Community Support`
© Alteeve's Niche! Inc. 1997-2024		Anvil! "Intelligent Availability®" Platform
`legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.`

High-Availability Clustering in the Open Source Ecosystem

Contents