(Page 1 of 1 in this chapter) Version


Chapter 2

Redundant Signaling Subsystem Architecture


2.1 Definitions and Terminology
2.2 Reference Configurations
2.2.1 Single-Node Configuration
2.2.2 Dual-Node Configuration
2.2.3 Standalone Configuration
2.3 Software Architecture
2.3.1 SS7 Message Transfer Part (MTP)
2.3.2 SS7 ISUP
2.3.3 SS7 SCCP
2.3.4 SS7 TCAP
2.3.5 TX Monitor (txmon)
2.3.6 Health Management Interface (HMI)
2.3.7 Health Management API (HM-API)
2.4 Functional Description
2.4.1 Board State Model
2.4.2 Redundant Signaling Subsystem Initialization
2.4.3 Failure Detection and Recovery
2.5 Hot Swap Support
2.6 Configuration and Management
2.6.1 Configuration Utilities and APIs
2.6.2 Control, Status, and Statistics
2.6.3 Alarms

2.1 Definitions and Terminology

A pair of signaling boards deployed in a redundant signaling configuration is known as a mated pair.

A dedicated high speed Ethernet link, called the inter-board communication (IBC) link, connects a mated pair of boards. The IBC link allows the boards to exchange signaling messages and state information.

At any particular point in time, one board in a mated pair is designated the primary board. The primary board handles all live signaling traffic for the pair (even though active signaling links may be terminated on both boards). The other board in the pair is designated as the backup board. The backup board maintains state information so that it can take over operation from the primary board when needed.

Conveying state information from the primary board to the backup board, or from an application to the backup board (for ISUP), is known as checkpointing. A checkpoint defines a known state from which the backup board will begin operating when taking over from the primary board.

If the boards in a mated pair are unable to communicate over the IBC link then the boards are said to be isolated or in isolation. The primary board is still operational during isolation, but cannot checkpoint state information to its backup and may be running at reduced capacity.

Reversing the roles of the boards in a mated pair - i.e., switching the backup board to primary mode and vice versa - in response to a failure or for maintenance purposes is known as a switchover (also known as a failover).

2.2 Reference Configurations

The following sections describe the capabilities and operation of the signaling and health management subsystems in terms of two reference configurations: a single-node configuration and a dual-node configuration.

In these reference configurations, the signaling application is referred to as a signaling server, providing service to one or more signaling clients. The client-server model illustrated here is a common architecture for distributed call-processing applications but others are possible. The choice of application model is strictly up to the system designer.

A third reference configuration, a non-redundant standalone configuration, is also described briefly.

2.2.1 Single-Node Configuration

A single node configuration employs two TX boards in a single node (chassis) for board level redundancy, as shown in Figure 1. In this configuration a single application, if desired, can monitor and control the primary and backup boards and perform all call processing functions. This is the simplest migration path for an existing non-redundant call processing application path to a redundant signaling subsystem.

Figure 1. Single Node Reference Configuration


A single-node redundant signaling subsystem can survive both signaling link and board failures without a service outage. In addition, one board may be taken out of service at a time for upgrade or reconfiguration without impacting the service provided by the application.

2.2.2 Dual-Node Configuration

A dual node configuration (shown in Figure 2) employs two chassis, each with a single TX board for signaling. The dual node reference model assumes a signaling server application which, like the TX boards, operates in a primary/backup manner. The primary and backup server applications can communicate call states, if desired, through an application-specific interprocess communication (IPC) mechanism.

Figure 2. Dual Node Reference Configuration


A dual node configuration has all the reliability attributes of the single node configuration but can also survive a failure or planned outage (e.g. for upgrade or reconfiguration) of an entire node without a service outage. The cost of this added reliability is in the increased complexity of the server application(s). In the dual node configuration, monitoring and control of the boards must be shared between applications on each node. If active calls are to be maintained across an outage, call state information must also be exchanged between nodes.

2.2.3 Standalone Configuration

A standalone configuration consists of a single non-redundant signaling board in a single node. In this configuration, the health management subsystem is used primarily to monitor the board for failures and take corrective action, such as reloading the failed board and/or notifying maintenance personnel.

A standalone configuration does not have the availability properties of a redundant configuration, but the health management subsystem can still be a valuable tool for quickly detecting failures and minimizing the duration of the service outage that results.

2.3 Software Architecture

Figure 3 illustrates the functional components and information flows in the software architecture model. The figure shows a dual node system although the components for a single node system are similar. Figure 3 is followed by a brief description of each module and its information flows relating to high availability.

Figure 3. Software Architecture

2.3.1 SS7 Message Transfer Part (MTP)

The SS7 MTP layers provide the physical signaling link termination, data link control, message routing, and network management functions for the signaling node(s). In a redundant configuration, active signaling links can be terminated on both boards. That is, both boards may be active up through MTP layer 2. All traffic received on either board is forwarded to the MTP 3 layer on the primary board for processing. Likewise, all outgoing traffic is routed to the primary board by the application for delivery. The primary MTP 3 distributes outgoing traffic across all available links on both boards.

Changes in the status of signaling links and routes are checkpointed by the primary MTP 3 layer to the backup MTP 3 layer so that the backup is in the correct state in the event that it must become the primary.

2.3.2 SS7 ISUP

The SS7 ISUP layer provides for the establishment, supervision, and clearing of all circuit-switched connections. In a redundant configuration, the primary board handles all live traffic. The backup ISUP layer remains in a state ready to assume control when needed. In order to preserve active calls in the case of a failure of the primary board, the call processing application may checkpoint updates of circuit states to the backup ISUP layer (through the standard ISUP API) as calls progress or as circuits become blocked and unblocked.

The ISUP distribution package includes a sample call processing application, isupdemo, which illustrates the checkpointing of circuit states in various situations. For more information on isupdemo, see Appendix C.

2.3.3 SS7 SCCP

SS7 SCCP provides services for routing non-circuit related traffic, including Global Title translations. In a redundant configuration, the primary node will receive all live SCCP traffic. The backup SCCP will be ready to take over in case of a primary failure or switchover. The SCCP task checkpoints relevant routing information without any application involvement.

2.3.4 SS7 TCAP

SS7 TCAP provides services for non-circuit related messaging, often destined for databases such as Local Number Portability lookups. In a redundant configuration, the primary node will receive all live TCAP traffic. The backup TCAP will be ready to take over in case of a primary failure or switchover. The TCAP task is configured to checkpoint some or all transactions. This is done automatically without any application responsibilities.

The TCAP distribution package includes a sample redundant application, tcapdemo, which illustrates application behavior during transactions and switchovers. For more information on tcapdemo, see Appendix D.

2.3.5 TX Monitor (txmon)

The TX Monitor (txmon) is a board-resident task that provides the health management functions on the TX boards. Its functions include:

2.3.6 Health Management Interface (HMI)

The HMI is a host-based service (Windows NT) or daemon process (UNIX) that provides the actual execution of control functions requested by applications using the HM API. It supports multiple user applications and distributes asynchronous board events to all registered user applications. It also continuously monitors all configured boards in order to detect board failures.

2.3.7 Health Management API (HM-API)

The HM API provides a function-call library for applications to monitor and control the state of TX boards on their local machine. It provides primitives to download a board, halt a board, retrieve the current state of a board, and set a board into primary or backup state.

In addition, user applications may register to receive asynchronous events indicating changes in the state of a board. These events include notifications that a board has failed, been downloaded or halted, set into the primary or backup states, or has become connected to or isolated from its mate board.

The RMG sample application, included in the HMI distribution package, illustrates the use of the HM API. The RMG application performs the role of the management application shown in Figure 3. For more information, see Appendix B.

2.4 Functional Description

The following sections describe how the components interoperate to develop a redundant system.

2.4.1 Board State Model

For health management purposes, each board in a redundant pair is in one of several states, as described in the following table. Boards change state as a result of application commands issued through the health management API or other external events, such as hardware or software failures on the board.
State Name

Description

Starting

Initial state of each board immediately after download, waiting for command from application to be primary or backup.

Primary

The board is active, and is the primary member of a redundant board pair.

Backup

The board is active, and is the backup member of a redundant board pair.

Shutdown

Reserved for future use.

Failed

The board is not operational due to a hardware or software failure, or has been halted by an application. Application may attempt to reload the board.

Standalone

The board is not equipped or not licensed for redundant operation and is running as a standalone signaling board.

Stopped

The board has been extracted.

2.4.2 Redundant Signaling Subsystem Initialization

The signaling subsystem initialization phase involves downloading and configuring each board, setting it into the appropriate state (either primary or backup), and binding the applications and SS7 layers together.

Board download is initiated by an application issuing a hmiLoadBoard request through the HM API. Once a board download is initiated (including configuration), a HMI_EVN_LOADING event is sent to all applications registered with the HM API, including the application that initiated the download. Once the download is complete and the board is ready for operation, the application receives a HMI_EVN_STARTING event (or HMI_EVN_STANDALONE event, if the board is not in a redundant configuration).

The HMI service cannot currently detect certain types of board download failures. Therefore, it is recommended that applications time for the HMI_EVN_STARTING (or HMI_EVN_STANDALONE) event in order to detect a failed download. The duration of the timer could be anywhere from five seconds (for a normal sized configuration) to ten seconds or longer for a very large configuration.

After download, each board is initially in the STARTING state, waiting for a hmiPrimary or hmiBackup command. In STARTING state, the various protocol tasks may be configured and BIND requests may be honored, but no links are enabled and no actual data traffic is accepted. During this time the txmon process attempts to establish communication with its mate board.

The determination of which board should be primary and which should be backup is application specific. Once this is determined, the application issues a hmiPrimary [hmiBackup] command to each board, as appropriate, through the HM API. Once the command is accepted, a HMI_EVN_NOWPRIMARY [HMI_EVN_NOWBACKUP] event is sent to all applications registered with the HM API, including the application that initiated the request.

During this period each application also binds to its service provider layer through the appropriate API call (i.e., call processing applications bind to the ISUP layer, direct MTP 3 user applications bind to the MTP 3 layer, etc.). On a single node signaling subsystem, the application typically binds to its service provider on both the primary and backup boards. On a dual node signaling subsystem, the primary application binds to the primary board and the backup application binds to the backup board.

Upon a successful bind, the service user (e.g., application) is notified of the board status - primary or backup - via a status indication event. The service user must wait for the "now primary" status indication event before starting data traffic. This event always precedes any incoming data traffic being delivered to the service user and signals that normal data transfer may begin in either direction.

2.4.3 Failure Detection and Recovery

This section distinguishes the different types of failures possible, and the methods by which the system recovers from them.

Signaling Link Failures

Signaling link failures are handled completely within the MTP layers. Applications are not informed of signaling link failures unless the failure leaves a concerned destination unreachable (in which case it receives a PAUSE event for the concerned destination) or the application/user part explicitly registers for link status events.

Likewise, signaling link recovery is automatic and transparent to MTP user parts and applications unless it results in a previously unreachable destination becoming reachable or the application/user part has explicitly registered for link status events.

Signaling Board Failures

A signaling board failure is detected by the HMI service on the local signaling node. A failure can be a software failure on the board, detected by the txmon process and reported to the HMI, or a hard failure such that the HMI loses communication with the board. Both are reported to registered applications as board failures so that recovery can take place. Two different recovery scenarios are distinguished: failure of the primary board and failure of the backup board.

Upon receipt of an HMI event indicating a failure of the primary board, the application typically initiates a switchover to the backup signaling board by issuing a hmiPrimary command through the HM_API. Call processing applications (or applications using other SS7 signaling services) must then wait for the NOW PRIMARY status indication from its service provider(s) before resuming data traffic.

Once the switchover has been initiated, an application may initiate reloading of the failed board with the hmiLoadBoard command. Once the download has completed (HMI_EVN_STARTING event received), the application can set the reloaded board into the backup state with the hmiBackup command. At that point any SS7 service applications must re-bind to their service providers. Any failed signaling links terminated on the reloaded board are automatically activated by the primary MTP 3.

After the reload and rebind, the TCAP and SCCP tasks will automatically resynchronize with the primary TX board. The backup is then ready to take over operation. The backup ISUP layer will consider all circuits to be idle. At this point it is recommended that the application re-synchronize the backup by checkpointing all non-idle circuit states via the ISUP API. The recovered board is then ready to take over the role of primary when needed.

If the board fails to reload cleanly (the HMI_EVN_STARTING event is not received within a reasonable time period), as might be the case with a true hardware failure, the board should be halted with the hmiHaltBoard command. Manual intervention is then required to recover the failed board.

Failure of a backup board is detected and reported in the same fashion as the primary board. The board is typically reloaded (if possible) and set into the BACKUP state. Applications must re-bind with their service providers. Any failed signaling links terminated on the reloaded board are automatically activated by the primary MTP 3.

After a backup with TCAP and/or SCCP is brought back into service, the application is ready to take over since TCAP and SCCP will automatically resynchronize with the primary TX board.

Signaling Node Failures

Detection of signaling node failures in a dual-node configuration is completely application specific. No monitoring of the host or application status is done by the signaling subsystem. Recovery scenarios are similar to the failed board recovery scenarios described above.

When a primary signaling node fails, it is up to an application on the backup node to detect the failure and set the backup board into primary operation with the hmiPrimary command. During the outage of the primary node, messages arriving on signaling links terminated on the backup node are queued if possible, waiting for the switchover. If the traffic load is too heavy or the failure detection and recovery take too long, the links may be placed in a local processor outage state and the queued messages may be lost.

Following the restoration of the failed signaling node, the signaling board in the failed node is reloaded and placed into backup state. Again the sequence is the same as the recovery of a failed board described above, except that some additional synchronization is required between the applications on the primary and backup nodes in order to convey changes in circuit status that occurred while the failed node was unavailable. This synchronization is completely application specific.

Recovery of a failed backup signaling node is similar to the recovery of a failed backup board described in the previous section. No disruption of signaling traffic is expected in this case. If the failure is a total failure (the primary board detects the failure of the backup board), signaling links terminating on the failed board are declared failed until the backup board is restored. Blocking or resetting of any voice circuits that may have been terminated on the failed node is strictly up to the application.

If only the backup host processor fails, the signaling links terminating on the backup board will remain operational until the backup node is rebooted.

Synchronization of circuit states that may have changed while the backup node was out of service is the responsibility of the application.

Signaling Board Isolation

Signaling board isolation occurs when the inter-board link fails. In this case, neither board can communicate with the other and cannot distinguish this case from a failure of its mate board.

During isolation, the primary board keeps running but at (potentially) reduced capacity since the signaling links on the backup board cannot be accessed. Normal checkpointing of ISUP circuit states to the backup board from the host may still take place.

When isolation is detected on the backup board, the active signaling links are put into an isolated state, queuing inbound packets but still delivering any queued outbound packets, and a short isolation timer is set. If the isolation ceases before the timer expires, normal traffic is resumed starting with the queued packets. If the isolation timer expires before the isolation condition is corrected, then the isolated links are placed into the local processor outage (LPO) state and the queued inbound packets are discarded.

Switching the backup board into primary mode (such as would be the case if the primary board failed) clears the isolated/LPO condition on those links and resumes normal traffic flow.

After the failed inter-board link is restored, the active MTP 3 layer clears the LPO condition on the isolated links to restore normal traffic and checkpoints any route or link states that may have changed during the isolation.

Planned Switchovers

In some cases it may be necessary to remove a primary board from service in order to upgrade the software or hardware on the signaling node or the board itself. The recommended procedure in this case is to manually switch the backup board into primary mode before shutting down the (now backup) board or node, as follows.

Once the applications have agreed that a switchover is necessary, the primary board is set into backup mode with the hmiBackup command. This effectively sets all signaling links into a flow-controlled state, resulting in all inbound packets being queued. Each layer (starting from MTP 3 on up) then sends a status indication (NOW BACKUP) to each of its service users.

Due to queuing between layers and within the device driver, the application may still receive some incoming signaling traffic between the issuing of the hmiBackup request and receipt of the NOW BACKUP status indication. For ISUP messages, the following procedure is recommended:

For TCAP and ISUP messages, the following is recommended:

During this period the application should not generate any new outbound signaling traffic.

Once the NOW BACKUP status indication has been received, indicating the end of any in-progress signaling traffic, the mate board is set into primary mode. This restarts the flow of signaling traffic to/from the mate board/node, including any messages queued within MTP 2 during the switchover.

Note: It is possible that packets may be lost during a switchover. In addition, heavy traffic during a switchover may result in either or both boards becoming congested due to the queuing of incoming packets. For these reasons, planned switchovers are not recommended during periods of heavy load (i.e., maintenance should be scheduled during off-peak periods whenever possible).

2.5 Hot Swap Support

The HMI service supports Hot Swap for Compact PCI installations. When the ejector handle is lifted to indicate the board is to be extracted, the HMI service will issue an HMI_EVN_EXTRACT event to all applications registered for receipt of asynchronous events. A redundancy manager application receiving this event will take the necessary steps to prepare for board removal any applications associated with the board to be removed. The controlling application will then call hmiStop, which will cause HMI to send an HMI_EVN_STOP event to all applications registered for receipt of asynchronous events. Upon receipt of this event, applications should close all communications channels to the board. After all applications have closed their communications channels to the board, the Hot Swap LED will light, indicating that the board is ready for extraction.

When a board is inserted into the chassis, the HMI service will issue an HMI_EVN_INSERT event to all applications registered for receipt of asynchronous events. When the redundancy manager application receives this event, it should initiate a load of the board.

2.6 Configuration and Management

An explanation of how the hardware and software is configured and managed is provided in the following subsections.

2.6.1 Configuration Utilities and APIs

In general, each signaling board is loaded and configured independently. Each configuration utility or API call sends configuration packets only to the target board - no configuration requests are explicitly exchanged between boards. Therefore, each node in a multiple node signaling subsystem must have its own copy of each SS7 configuration file/database, or be able to access a common file/database through networking. Similarly, each dynamic configuration API call must be executed on both boards.

In general, the configurations downloaded to the SS7 layers are identical on both boards in a pair. In order to support configuration changes without a service outage, however, it is sometimes necessary to download a backup board with a new configuration, make it the primary, and then reload the other board with the new configuration. In this case, the configurations on the two boards are out of sync for some time period. This is allowed for during the checkpointing of state information between boards.

2.6.2 Control, Status, and Statistics

Control, status, and statistics requests are also applied individually to each board in a mated pair. Control requests (enable/disable signaling links, block/unblock voice circuits, etc.) can only be issued to the primary board. Control requests issued to the backup board are rejected with an "invalid state" indication.

Status type requests can be issued to either the primary or the backup board. The results returned by a board reflect the status of the entity as currently viewed by that board. Thus, status requests issued to the backup board can be used to determine if an event, such as a call being answered, has been checkpointed correctly to the backup.

Statistics requests can also be issued to either the primary or backup board. The statistics returned reflect events that occurred on that board only; no attempt is made to collate statistics between the primary and backup boards.

2.6.3 Alarms

Each board generates its own independent alarms. In a dual node signaling subsystem, the MTP 2 alarms associated with a particular link appear on the node that the link terminates on, not necessarily the active node. MTP 3, SCCP, TCAP, and ISUP alarms relating to operational events will typically appear only on the active node.

The txmon task on each board generates alarms relating to the state of that board: transitions between primary and backup mode, failures of tasks on the board, and changes in the status of the inter-board link.



(Page 1 of 1 in this chapter) Version


tech_support@nmss.com
Copyright © 2000, Natural MicroSystems, Inc. All rights reserved.