The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

OpenMosix::HA -- High Availability (HA) layer for an openMosix cluster

SYNOPSIS

  use OpenMosix::HA;

  my $ha = new OpenMosix::HA;

  # start the monitor daemon 
  $ha->monitor;

DESCRIPTION

This module provides the basic functionality needed to manage resource startup and restart across a cluster of openMosix machines.

This gives you a high-availability cluster with low hardware overhead. In contrast to traditional HA clusters, we use the openMosix cluster membership facility, rather than hardware serial cables or extra ethernet ports, to provide heartbeat and to detect network partitions.

All you need to do is build a relatively conventional openMosix cluster, install this module on each node, and configure it to start and manage your HA processes. You do not need the relatively high-end server machines which traditional HA requires. There is no need for chained SCSI buses (though you can use them) -- you can instead share disks among many nodes via any number of other current technologies, including SAN, NAS, GFS, or Firewire (IEEE-1394).

Commercial support is available for OpenMosix::HA as well as for openMosix clusters and related products and services: see "SUPPORT".

QUICK START

See http://www.Infrastructures.Org for cluster management techniques, including clean ways to install, replicate, and update nodes.

To use OpenMosix::HA to provide high availability for processes hosted on an openMosix cluster:

INSTALLATION

Use Perl's normal sequence:

  perl Makefile.PL
  make
  make test
  make install

You'll need to install this module on each node in the cluster.

This module includes a script, "mosha", which will be installed when you run 'make install'. See the output of perl -V:installscript to find out which directory the script is installed in.

CONCEPTS

See "CONCEPTS" in Cluster::Init for more discussion of basic concepts used here, such as resource group, high availability cluster, and high throughput cluster.

Normally, a high-throughput cluster computing technology is orthogonal to the intent of high availability, particularly if the cluster supports process migration, as in openMosix. When ordinary openMosix nodes die, any processes migrated to or spawned from those nodes will also die. The higher the node count, the more frequently these failures are likely to occur.

If the goal is high availability, then node failure in an openMosix cluster presents two problems: (1) All processes which had migrated to a failed node will die; their stubs on the home node will receive a SIGCHLD. (2) All processes which had the failed node as their home node will die; their stubs will no longer exist, and the migrated processes will receive SIGKILL.

Dealing with (1) by itself might be easy; just use the native UNIX init's "respawn" to start the process on the home node. Dealing with (2) is harder; you need to detect death of the home node, then figure out which processes were spawned from there, and restart them on a secondary node, again with a "respawn". If you also lose the secondary node, then you need to restart on a tertiary node, and so on. And managing /etc/inittab on all of the nodes would be an issue; it would likely need to be both dynamically generated and different on each node.

What's really needed is something like "init", but that acts cluster-wide, using one replicated configuration file, providing both respawn for individual dead processes and migration of entire resource groups from dead home nodes. That's what OpenMosix::HA does.

If processes are started via OpenMosix::HA, any processes and resource groups which fail due to node failure will automatically restart on other nodes. OpenMosix::HA detects node failure, selects a new node out of those currently available, and deconflicts the selection so that two nodes don't restart the same process or resource group.

There is no "head" or "supervisor" node in an OpenMosix::HA cluster -- there is no single point of failure. Each node makes its own observations and decisions about the start or restart of processes and resource groups.

You can build OpenMosix::HA clusters of dissimilar machines -- any given node only needs to provide the hardware and/or software to support a subset of all resource groups. OpenMosix::HA is able to test a node for eligibility before attempting to start a resource group there -- resource groups will "seek" the nodes which can support them.

IO fencing (the art of making sure that a partially-dead node doesn't continue to access shared disk or other resources) can be handled as it is in conventional HA clusters, by a combination of exclusive device logins when using Firewire, or distributed locks when using GFS or SAN.

In the Linux HA community, simpler, more brute-force methods for IO fencing are also used, involving network-controlled powerstrips or X10 controllers. These methods are usually termed STOMITH or STONITH -- "shoot the other machine|node in the head". OpenMosix::HA provides a callback hook which can be used to trigger these external STOMITH actions.

RESOURCE GROUP LIFECYCLE

Each OpenMosix::HA node acts independently, while watching the activity of others. If any node sees that a resource group is not running anywhere in the cluster, it attempts to start the resource group locally by following the procedure described here. The following discussion is from the perspective of that local node.

The node watches all other nodes in the cluster by consolidating /mfs/*/var/mosix-ha/clstat into the local "/var/mosix-ha/hastat". It then ensures that each resource group configured in "/var/mosix-ha/cltab" is running somewhere in the cluster, at the runlevel specified in "/var/mosix-ha/hactl".

If a resource group is found to be not running anywhere in the cluster, then the local OpenMosix::HA will attempt to transition the resource group through each of the following runlevels on the local node, in this order:

  plan
  test
  start (or whatever is named in hactl)
  stop  (later, at shutdown)

The following is a detailed discussion of each of these runlevels.

plan

Under normal circumstances, you should not create a 'plan' runlevel entry in "/var/mosix-ha/cltab" for any resource group. This is because 'plan' is used as a collision detection phase, a NOOP; anything you run at the 'plan' runlevel will be run on multiple nodes simultaneously.

When starting a resource group on the local node, OpenMosix::HA will first attempt to run the resource group at the 'plan' runlevel. If there is a 'plan' runlevel in "/var/mosix-ha/cltab" for this resource group, then OpenMosix::HA will execute it; otherwise, it will just set the runlevel to 'plan' in its own copy of "/var/mosix-ha/clstat".

After several seconds in 'plan' mode, OpenMosix::HA will check other nodes, to see if they have also started 'plan' or other activity for the same resource group.

If any other node shows 'plan' or other activity for the same resource group during that time, then OpenMosix::HA will conclude that there has been a collision, "stop" the resource group on the local node, and pause for several seconds.

The "several seconds" described here is dependent on the number of nodes in the cluster and a collision-avoidance random backoff calculation.

test

You should specify at least one 'test' runlevel, with runmode also set to 'test', for each resource group in "/var/mosix-ha/cltab". This entry should test for prerequisites for the resource group, and its command should exit with a non-zero return code if the test fails.

For example, if /usr/bin/foo requires the 'modbar' kernel module, then the following entries in "/var/mosix-ha/cltab" will do the job:

  foogrp:foo1:test:test:/sbin/modprobe modbar
  foogrp:foo2:start:respawn:/usr/bin/foo

...in this example, modprobe will exit with an error if 'modbar' can't be loaded on this node.

If a 'test' entry fails, then OpenMosix::HA will conclude that the node is unusable for this resource group. It will discontinue startup, and will cleanup by executing the "stop" entry for the resource group.

After a 'test' has failed and the resource group stopped, another node will typically detect the stopped resource group within several seconds, and execute "plan" and "test" again there. This algorithm continues, repeating as needed, until a node is found that is eligible to run the resource group. (For large clusters with small groups of eligible nodes, this could take a while. I'm considering adding a "preferred node" list in hactl to shorten the search time.)

start

After the 'test' runlevel passes, and if there are still no collisions detected, then OpenMosix::HA will start the resource group, using the runlevel specified in "/var/mosix-ha/hactl".

This runlevel is normally called 'start', but could conceivably be any string matching /[\w\d]+/; you could use a numerical runlevel, a product or project name, or whatever fits your needs. The only other requirement is that the string you use must be the same as whatever you used in "/var/mosix-ha/cltab".

stop

If you issue a "shutdown", then OpenMosix::HA will transition all resource groups to the 'stop' runlevel. If there is a 'stop' entry for the resource group in "/var/mosix-ha/cltab", then it will be executed.

You do not need to specify a 'stop' entry in "/var/mosix-ha/cltab"; you can specify one if you'd like to do any final cleanup, unmount filesystems, etc.

METHODS

new()

Loads Cluster::Init, but does not start any resource groups.

Accepts an optional parameter hash which you can use to override module defaults. Defaults are set for a typical openMosix cluster installation. Parameters you can override include:

mfsbase

MFS mount point. Defaults to /mfs.

mynode

Mosix node number of local machine. You should only override this for testing purposes.

varpath

The local path under / where the module should look for the hactl and cltab files, and where it should put clstat and clinit.s; this is also the subpath where it should look for these things on other machines, under /mfsbase/NODE. Defaults to var/mosix-ha.

timeout

The maximum age (in seconds) of any node's clstat file, after which the module considers that node to be stale, and calls for a STOMITH. Defaults to 60 seconds.

mwhois

The command to execute to get the local node number. Defaults to "mosctl whois". This command must print some sort of string on STDOUT; a /(\d+)/ pattern will be used to extract the node number from the string.

stomith

The *CODE callback to execute when a machine needs to be STOMITHed. The node number will be passed as the first argument. Defaults to an internal function which just prints "STOMITH node N" on STDERR.

monitor()

Starts the monitor daemon. Does not return.

The monitor does the real work for this module; it ensures the resource groups in "/var/mosix-ha/cltab" are each running somewhere in the cluster, at the runlevels specified in "/var/mosix-ha/hactl". Any resource groups found not running are candidates for a restart on the local node.

Before restarting a resource group, the local monitor announces its intentions in the local clstat file, and observes clstat on other nodes. If the monitor on any other node also intends to start the same resource group, then the local monitor will detect this and cancel its own restart. The checks and restarts are staggered by random times on various nodes to prevent oscillation.

See "CONCEPTS".

UTILITIES

mosha

OpenMosix::HA includes mosha, a script which is intended to be started as a "respawn" entry in each node's /etc/inittab. It requires no arguments.

This is a simple script; all it does is create an OpenMosix::HA object and call the "monitor" method on that object.

FILES

/var/mosix-ha/cltab

The main configuration file; describes the processes and resource groups you want to run in the cluster.

See "/etc/cltab" in Cluster::Init for the format of this file -- it's the same file; OpenMosix::HA tells Cluster::Init to place cltab under /var/mosix-ha instead of /etc. For a configured example, see t/master/mfs1/1/var/mosix-ha/cltab in the OpenMosix::HA distribution.

See "RESOURCE GROUP LIFECYCLE" for runmodes and entries you should specify in this file; specifically, you should set up at least one 'test' entry and one 'start' entry for each resource group.

You do not need to replicate this file to any other node -- OpenMosix::HA will do it for you.

/var/mosix-ha/hactl

The HA control file; describes the resource groups you want to run, and the runlevels you want them to execute at. See the "CONCEPTS" paragraph about the "start" runlevel. See t/master/mfs1/1/var/mosix-ha/hactl for an example.

You do not need to replicate this file to any other node -- OpenMosix::HA will do it for you.

Format is one resource group per line, whitespace delimited, '#' means comment:

  # resource_group  runlevel
  mygroup start
  foogroup start
  bargroup 3
  bazgroup 2
  # missing or commented means 'stop' -- the following two 
  #    lines are equivalent:
  redgrp stop
  # redgrp start

/var/mosix-ha/hastat

The cluster status file. Rebuilt periodically on each node by consolidating /mfs/*/var/mosix-ha/clstat. Each node's version of this file normally matches the others. Interesting to read; can be eval'd by other Perl processes for building automated monitoring tools.

/var/mosix-ha/clstat

The per-node status file; see "/var/run/clinit/cltab" in Cluster::Init. Not very interesting unless you're troubleshooting OpenMosix::HA itself -- see /var/mosix-ha/hastat instead.

BUGS

The underlying module, Cluster::Init, has a Perl 5.8 compatibility problem, documented there; fix targeted for next point release.

Quorum counting accidentally counts nodes that are up but not running OpenMosix::HA; easy fix, to be done in next point release.

This version currently spits out debug messages every few seconds.

No test cases for monitor() yet.

Right now we don't detect or act on errors in cltab.

At this time, mosha is a very minimal script which just gets the job done, and probably will need some more work once we figure out what else it might need to do.

SUPPORT

Commercial support for OpenMosix:::HA is available at http://clusters.TerraLuna.Org. On that web site, you'll also find pointers to the latest version, a community mailing list, and other cluster management software.

You can also find help for general infrastructure (and cluster) administration at http://www.Infrastructures.Org.

AUTHOR

        Steve Traugott
        CPAN ID: STEVEGT
        stevegt@TerraLuna.Org
        http://www.stevegt.com

COPYRIGHT

Copyright (c) 2003 Steve Traugott. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

SEE ALSO

Cluster::Init, openMosix.Org, qlusters.com, Infrastructures.Org