- This section distinguishes the different types of failures possible, and the methods by which the system recovers from them.
Signaling Link Failures
- Signaling link failures are handled completely within the MTP layers. Applications are not informed of signaling link failures unless the failure leaves a concerned destination unreachable (in which case it receives a PAUSE event for the concerned destination) or the application/user part explicitly registers for link status events.
- Likewise, signaling link recovery is automatic and transparent to MTP user parts and applications unless it results in a previously unreachable destination becoming reachable or the application/user part has explicitly registered for link status events.
Signaling Board Failures
- A signaling board failure is detected by the HMI service on the local signaling node. A failure can be a software failure on the board, detected by the txmon process and reported to the HMI, or a hard failure such that the HMI loses communication with the board. Both are reported to registered applications as board failures so that recovery can take place. Two different recovery scenarios are distinguished: failure of the primary board and failure of the backup board.
- Upon receipt of an HMI event indicating a failure of the primary board, the application typically initiates a switchover to the backup signaling board by issuing a hmiPrimary command through the HM_API. Call processing applications (or applications using other SS7 signaling services) must then wait for the NOW PRIMARY status indication from its service provider(s) before resuming data traffic.
- Once the switchover has been initiated, an application may initiate reloading of the failed board with the hmiLoadBoard command. Once the download has completed (HMI_EVN_STARTING event received), the application can set the reloaded board into the backup state with the hmiBackup command. At that point any SS7 service applications must re-bind to their service providers. Any failed signaling links terminated on the reloaded board are automatically activated by the primary MTP 3.
- After the reload and rebind, the TCAP and SCCP tasks will automatically resynchronize with the primary TX board. The backup is then ready to take over operation. The backup ISUP layer will consider all circuits to be idle. At this point it is recommended that the application re-synchronize the backup by checkpointing all non-idle circuit states via the ISUP API. The recovered board is then ready to take over the role of primary when needed.
- If the board fails to reload cleanly (the HMI_EVN_STARTING event is not received within a reasonable time period), as might be the case with a true hardware failure, the board should be halted with the hmiHaltBoard
command. Manual intervention is then required to recover the failed board.
- Failure of a backup board is detected and reported in the same fashion as the primary board. The board is typically reloaded (if possible) and set into the BACKUP state. Applications must re-bind with their service providers. Any failed signaling links terminated on the reloaded board are automatically activated by the primary MTP 3.
- After a backup with TCAP and/or SCCP is brought back into service, the application is ready to take over since TCAP and SCCP will automatically resynchronize with the primary TX board.
Signaling Node Failures
- Detection of signaling node failures in a dual-node configuration is completely application specific. No monitoring of the host or application status is done by the signaling subsystem. Recovery scenarios are similar to the failed board recovery scenarios described above.
- When a primary signaling node fails, it is up to an application on the backup node to detect the failure and set the backup board into primary operation with the hmiPrimary command. During the outage of the primary node, messages arriving on signaling links terminated on the backup node are queued if possible, waiting for the switchover. If the traffic load is too heavy or the failure detection and recovery take too long, the links may be placed in a local processor outage state and the queued messages may be lost.
- Following the restoration of the failed signaling node, the signaling board in the failed node is reloaded and placed into backup state. Again the sequence is the same as the recovery of a failed board described above, except that some additional synchronization is required between the applications on the primary and backup nodes in order to convey changes in circuit status that occurred while the failed node was unavailable. This synchronization is completely application specific.
- Recovery of a failed backup signaling node is similar to the recovery of a failed backup board described in the previous section. No disruption of signaling traffic is expected in this case. If the failure is a total failure (the primary board detects the failure of the backup board), signaling links terminating on the failed board are declared failed until the backup board is restored. Blocking or resetting of any voice circuits that may have been terminated on the failed node is strictly up to the application.
- If only the backup host processor fails, the signaling links terminating on the backup board will remain operational until the backup node is rebooted.
- Synchronization of circuit states that may have changed while the backup node was out of service is the responsibility of the application.
Signaling Board Isolation
- Signaling board isolation occurs when the inter-board link fails. In this case, neither board can communicate with the other and cannot distinguish this case from a failure of its mate board.
- During isolation, the primary board keeps running but at (potentially) reduced capacity since the signaling links on the backup board cannot be accessed. Normal checkpointing of ISUP circuit states to the backup board from the host may still take place.
- When isolation is detected on the backup board, the active signaling links are put into an isolated state, queuing inbound packets but still delivering any queued outbound packets, and a short isolation timer is set. If the isolation ceases before the timer expires, normal traffic is resumed starting with the queued packets. If the isolation timer expires before the isolation condition is corrected, then the isolated links are placed into the local processor outage (LPO) state and the queued inbound packets are discarded.
- Switching the backup board into primary mode (such as would be the case if the primary board failed) clears the isolated/LPO condition on those links and resumes normal traffic flow.
- After the failed inter-board link is restored, the active MTP 3 layer clears the LPO condition on the isolated links to restore normal traffic and checkpoints any route or link states that may have changed during the isolation.
Planned Switchovers
- In some cases it may be necessary to remove a primary board from service in order to upgrade the software or hardware on the signaling node or the board itself. The recommended procedure in this case is to manually switch the backup board into primary mode before shutting down the (now backup) board or node, as follows.
- Once the applications have agreed that a switchover is necessary, the primary board is set into backup mode with the hmiBackup
command. This effectively sets all signaling links into a flow-controlled state, resulting in all inbound packets being queued. Each layer (starting from MTP 3 on up) then sends a status indication (NOW BACKUP) to each of its service users.
- Due to queuing between layers and within the device driver, the application may still receive some incoming signaling traffic between the issuing of the hmiBackup
request and receipt of the NOW BACKUP status indication. For ISUP messages, the following procedure is recommended:
- All others: Discard and allow far end to timeout and retry if desired.
- For TCAP and ISUP messages, the following is recommended:
- During this period the application should not generate any new outbound signaling traffic.
- Once the NOW BACKUP status indication has been received, indicating the end of any in-progress signaling traffic, the mate board is set into primary mode. This restarts the flow of signaling traffic to/from the mate board/node, including any messages queued within MTP 2 during the switchover.
Note: It is possible that packets may be lost during a switchover. In addition, heavy traffic during a switchover may result in either or both boards becoming congested due to the queuing of incoming packets. For these reasons, planned switchovers are not recommended during periods of heavy load (i.e., maintenance should be scheduled during off-peak periods whenever possible).