High Quality JMS Messaging.

HA Controller Swiftlet

Introduction

The High Availability Controller Swiftlet is responsible to synchronize ACTIVE and STANDBY HA instances. It maintains a replication channel between them and uses heart beat messages to detect a failed ACTIVE instance and initiates and controls the failover process.

It performs the following tasks:

Configurable Entities

The following configuration entities are configurable:

Negotiation Timeout

When a HA instance is started and the last saved HA state is not STANDALONE, it waits for a connection with the other HA instance to negotiate its state. When a timeout called "Negotiation Timeout" is reached and no connection with the other HA instance had happend, the HA instance starts up and switches to state STANDALONE if the last saved state was STANDBY or ACTIVE (it continues waiting if the last saved state was UNKNOWN). When the other HA instance is started and its state was also STANDALONE, it turns into a consistency problem because both HA instances in state STANDALONE are not allowed. Therefore, one HA instance is immediately and automatically shut down with an exception. See the "Problem Handling" section how to solve that.

The default for the negotiation timeout is 120000 ms (2 minutes). If you start both HA instances and if you start the STANDBY first, you will have this amount of time to start the other HA instance. To avoid trouble, simply start the ACTIVE/STANDALONE instance always first. To figure out which HA instance is ACTIVE or STANDALONE, check file haspool/<instance>/ha.state.

Preferred Active

One of the two HA instances can be declared as "Preferred Active" instance. This makes sense if you have a slower machine for the STANDBY, which should only take over operation when the ACTIVE fails. In that case, mark the HA instance on your primary server as "Preferred Active". If there is a failover to the STANDBY, the STANDBY becomes the ACTIVE instance. If the ACTIVE comes back up, it becomes the STANDBY. If the "Preferred Active" flag is set, a newly failover will be automatically initiated to switch ACTIVE and STANDBY. Hence, if you have 2 HA instance running, the ACTIVE instance will always the instance with the "Preferred Active" flag set.

A failover caused by a "Preferred Active" setting is indicated on System.out at the ACTIVE instance:

        +++ STANDBY is preferred ACTIVE instance: Failover in 10 seconds ...

The ACTIVE instance will initiate a newly failover to switch ACTIVE and STANDBY.

Replication Channel

The replication channel is the network connection between both HA instances. It consists of a network listener on one HA instance and a network connector on the other instance. It is not relevant where you define listener and connector.

The connection is validated by heart beat messages which are send in intervals (default 2000 ms, 2 sec) from both sides of the connection. The "Heart Beat Missing Threshold" defines the number of missed heart beat messages after which the connection is closed and the appropriate procedure will be initiated (e.g. a failover or a switch to STANDALONE). Missing of heart beats is possible, especially during synchronization when a large store is transferred. It is also possible that the ACTIVE instance is shut down and the network connection at the STANDBY is still alive. This is TCP related and called "half-open socket". In that case it takes at maximum 20 seconds (2000 ms x 10) before the failover is initiated.

"Maximum Packet Size" defines the maximum size of the replication packets. The replication channel gets its input from the spool that is filled by the replication tunnels. This data is send as replication packets over the replication channel. A replication packet is filled up to the maximum size or until the spool is empty. The maximum size should correlate with the router input/output buffers of the network listener and connector. Default is 128 KB.

Spool

The spool works as buffer between replication tunnels and replication channel. It has a memory cache, whose size is specified in attribute "Maximum Cache Size". Default is 5 MB. When the spool becomes larger, it swaps to disk into the directory defined in attribute "Spool Directory". This swap can become quite large during synchronization of ACTIVE and STANDBY, because the whole persistent store (if the Replicated File Store is used) of the ACTIVE HA instance will be send to the spool. So you should ensure that the spool directory has enough disk space.

The HA instance will also save its HA state in a file "ha.state" in the spool directory. This file is read on startup to determine the last saved state of this instance. If you delete it, the HA instance will start up in state UNKNOWN.

Static Configuration Entities

The following configuration entities are static and must never be changed.

Configuration Controller

The Swiftlet contains a configuration controller which is responsible to replicate configuration changes from ACTIVE to STANDBY. Hereto it registers on all entities and properties of the management tree of the ACTIVE HA instance, except those defined in the "Replication Exclude" list. This list contains instance-local elements. The "Property Substitutions" list contains properties that are substituted by another property during replication. For example, property "bindaddress" of JMS listeners are substituted with "bindaddress2" and vice versa.

Replication Tunnels

A replication tunnel is a generic tunnel consisting of a source at the ACTIVE HA instance and a sink at the STANDBY HA instance. A replication tunnel is identified by a "Tunnel Address". Currently there are 3 tunnels defined. One for the configuration controller, one for the queues to replicate JMS message Ids for duplicate message detection, and one for the store. Each tunnel is versioned to ensured Continuous Availability. The supported versions are contained in attribute "Protocol Versions".

Threadpool Freezes

When the STANDBY connects to the ACTIVE HA instance, it must get a consistent snapshot of the ACTIVE instance. The router must stop for a short moment to generate such a snapshot. This is realized by freezing its thread pools. If the thread pools are freezed, no activity occurs and a snapshot can be taken. Thread pools are freezed in HA state ACTIVE_SYNC_PREPARE. After all pools have reported freeze state, the HA state changes to ACTIVE_SYNC. Now, the resp. Swiftlets will create a snapshot of their state and publish it to the spool and from there it is send to the STANDBY HA instance over the replication channel. At the end, all pools are unfreezed and the ACTIVE HA instance turns into HA state ACTIVE and the STANDBY HA instance into state STANDBY.

The list "Threadpool Freezes" contains a list of thread pools which must be freezed/unfreezed in exactly the order in which they are defined in the list.