High Quality JMS Messaging.

Keepalive Mechanism

If a routing connection is established and one side disconnects, the other side (the other router) might not detect it. That's not a problem of SwiftMQ but one of the TCP protocol. To detect this invalid connections, SwiftMQ uses a keepalive mechanism on routing connections which works as follows:

If a connector of one router connects to a listener of another router, both sides pass different stages (connect, authentication, recovery, etc) until they reach the final delivery stage in which the actual message exchange takes place. The keepalive mechanism is started at the beginning of the delivery stage. Previous stages are protected by timers which ensure that such a stage will not stuck by waiting for a reply which will never happen due to a disconnect during that stage. If the timers are fired and the stage is in waiting mode, the connection will be disconnected.

The time after which such an "early" stage becomes invalid can be specified via attribute "stage-valid-timeout". Default is 15 seconds. Another attribute "reject-disconnect-delay" specifies the time after which a rejected connection (e.g. routername already connected, authentication failed) is closed. Both attributes are dependend on each other. If a connector connects to a listener, it initializes a timer with the value of "stage-valid-timeout". This ensures that the stages becomes invalid if no reply arrives. Say, the router name is already connected so the listener rejects the connection and sends a negative reply. The listener initializes a timer with the value of "reject-disconnect-delay" which leads to a physical close when the timer is fired.

The keepalive mechanism consists of keepalive messages which are sent from both sides of the connection in the interval specified in attribute "keepalive-interval" of the routing listener. This value is passed over to the connector during the connect stage. Each side has a counter which is initialized with 5. If a keepalive message is received, the counter is incremented. If a keepalive message is sent, the counter is decremented. If the counter reaches 0, the connection is probably invalid and is closed physically.

A routing connection is valid before the keepalive counter reaches 0. However, it might be already disconnected at one end and that router tries to reconnect which, in turn, will be rejected, because a given router can only be connected once. The other router is able to reconnect after the routing connection has been closed due to the keepalive mechanism. Therefore, the maximum time a router has to wait until it is able to reconnect is 5 * <keepaliveinterval>. The attribute "keepalive-interval" of the routing listener can be set to a lower value to shorten this time.