By Rich Pellegrini, Product Marketing, NMS Communications
The follow article is an excerpt from the NMS white paper, Economics of Five-Nines Systems: CompactPCI vs. AdvancedTCA. The complete paper contains a cost comparison of the various examples given in this article, as well as a discussion of the operational and opportunity cost of service failures and downtime.
An important attribute of any system that provides a critical communications or transaction-based service is that it be highly available (HA), from both a user perspective and a system perspective. Examples of communications systems in this space include telecom value-added services such as conferencing, wireless music services, and voice and video portals, for which an interruption in service can mean lost revenue.
Attributes of Five-Nines HA Systems
The overall “availability” of a service is determined by both how often outages occur and how long it takes to recover from them. The goal is to design a system such that outages occur infrequently, and when they do occur, they can be identified and repaired quickly. For telecom and transaction-based monetary systems with advertised high availability, service is guaranteed not just 99 percent of the time, but 99.999 percent of the time, or with five-nines availability. This translates into about five minutes and fifteen seconds of downtime per year—or virtually continuous uptime.
Commercially Available Technologies for use in HA System Designs
A HA telecom system can be developed using both rackmount servers and blade servers.
Rackmount Servers
Rackmount servers are turnkey computing chassis/enclosures, often in 1U or 2U form factors, that include processors, memory, disk drives, power, cooling fans, networking interfaces, and chassis management. All computing components are on a single motherboard, with limited expansion slots for specialized peripherals. Rackmount servers are offered in both redundant and non-redundant models.
Blade Servers
Blade servers provide basic hardware, such as power, cooling fans, chassis management, and backplane signal distribution, in a base rackmount chassis. Processing, storage, and interface functionality is available in the form of plug-in cards or “blades.” Blade servers are offered in standards-based form factors, like PICMG-compliant cPCI or ATCA, and proprietary form factors, such as those offered by Dell, IBM or HP (HP also offers standards-based blade servers).
The focus of this article is on developing HA systems using standards-based cPCI and ATCA blade servers.
Building a High Availability System
When considering current cPCI product offerings, there are two generations to consider—the base PICMG 2.0 generation and the later cPCI Packet-Switched Backplane (cPSB) PICMG 2.16 generation. (Refer to the complete white paper for a detailed discussion of the attributes of the PICMG 2.0 and 2.16 specifications.)
PICMG 2.0 Example
Fully redundant systems built with PICMG 2.0 generation cPCI equipment consist of a split backplane chassis with two distinct PCI bus domains representing a cluster in a box. This architecture implements a full 2N system redundancy with up to seven peripheral boards per node, as two redundant host SBCs interact to determine their active or standby role. External digital cross-connect equipment would be required to switch between the telecom interfaces of the active or standby cluster for telecommunications-based systems. A redundant system with this description is shown in Figure 1.
Figure 1: PICMG 2.0 Generation Chassis Example
Assuming a 2000-port interactive voice response (IVR) telephony system constructed with 480-port telephony boards, five telephony boards would be required for each redundant cluster, for a total of ten boards, stranding 2000+ ports of capacity.
PICMG 2.16 Example
Redundant systems built with PICMG 2.16 generation cPCI equipment consist of a redundant host chassis with two bridgeable PCI bus domains, representing a dual-host single system with up to fourteen peripheral boards. Two redundant host SBCs interact to determine their active or standby role. The active SBC utilizes a PCI bridge card to gain control of the second PCI bus segment. The standby SBC has no PCI bus control until it is required to become the active system controller. A redundant system with this description is shown in Figure 2. Note the IPMI bus is omitted in the diagram to show other components.
Figure 2: PICMG 2.16 Generation Chassis Example (with legacy PCI Bus, Redundant Host Configuration)
Assuming the same 2,000-port telephony system constructed with 480-port telephony boards, only six telephony boards would be required to meet the required density, regardless of which host was the active chassis controller. Because the telecom interface boards are a pooled resource in this case, an N+1 redundancy architecture is employed, allowing the host to swap in a spare resource for a failed one.
Autonomous PICMG 2.16 Example
|
AUTONOMOUS OPERATION
The dictionary definition of “autonomous” implies totally independent operation. But “autonomous” used in the context of PICMG 2.16 and ATCA means that the device can be controlled remotely through IP (Start, Stop, Query) and it is not dependent on a computing bus within the chassis. |
Redundant systems built with purely autonomous PICMG 2.16 generation equipment would consist of a non-PCI, fully packet-switched backplane chassis with two redundant SBC system hosts, two Gigabit Ethernet fabric switches, and up to sixteen peripheral/processor boards. The two redundant host SBCs would interact to determine their active or standby role and would communicate with the autonomous telecom blades using an Ethernet-based control protocol. There is no PCI bus in the chassis, so no bus arbitration or bridge hardware is required. A redundant system with this description is illustrated in Figure 3.
Figure 3: PICMG 2.16 Generation Chassis Example (pure Autonomous Blade Server with Redundant Host Configuration)
Assuming the same 2,000-port telephony system constructed with 480-port telephony blades, only six autonomous telephony blades would be required in order to provide an N+1 redundant telecom interface architecture, regardless of which host was the active chassis controller.
AdvancedTCA Example
ATCA defined by the PICMG 3.x series of specifications represents the third generation of standards-based carrier-grade platform hardware. This specification set builds on the HA improvements of the PICMG 2.16 hardware generation by improving chassis management, redundant, modularity, blade size, and power allotment. (Refer to the complete white paper for a detailed discussion of the attributes of the AdvancedTCA specifications.)
Redundant systems built with purely autonomous ATCA generation equipment would consist of a dual-Gigabit Ethernet, fully packet-switched backplane chassis with two redundant SBC system hosts, two Gigabit Ethernet dual fabric switches, and up to ten peripheral/processor blades. Note, the increased blade width allows for fewer peripheral node blades than an equivalent 19-inch rackmount PICMG 2.16 generation chassis. The two redundant host SBCs would interact to determine their active or standby role and would communicate with the autonomous telecom blades using an Ethernet-based control protocol. An ATCA redundant system with this description is shown in Figure 4.
Figure 4: ATCA Generation Chassis Example (pure Autonomous Blade System with Redundant Host Configuration)
Although the larger blade size of the ATCA form factor allows for higher density telecom interface hardware, service providers and HA system designers must ensure the loss of a single resource does not provide a service interruption to more than a certain percentage of subscribers.
Assuming the same 2,000-port telephony system constructed with 480-port telephony blades, in order to minimize the number of ports impacted during a blade failure, six autonomous telephony blades would be required for an N+1 redundancy telecom interface architecture, regardless of which host was the active chassis controller.
The Cost of High Availability
While the cost of creating HA systems using cPCI and ATCA vary (see the complete white paper for details), more than just the equipment cost must be considered when making a choice between cPCI and ATCA. Other items include the cost of:
- Service monitoring and maintenance personnel
- Platform software system upgrades (patch and version upgrades)
- Repair personnel and associated logistics required to perform a hardware maintenance action (including problem debugging cost depending on the number of levels of support required to be engaged based on the fault reporting and logging/debugging capabilities of the platform)
- Operational and opportunity loss due to service failures and downtime
The costs associated with service downtime can include lost revenue, decline in customer satisfaction, loss of a customer to the competition (which is an easier decision with the advent of local number portability), and even damage to reputation via bad publicity (press or blog/message board) or word of mouth.
Developers of live, revenue generating services who require high availability must choose platform standards and designs that provide numerous and well-planned HA features. ATCA includes a number of HA features that eliminate single points of failure and allow system developers to truly reach five-nines available systems with reduced time-to-market and development costs.
Although not the final word in HA platforms, a concerted effort should be made by service providers and telecom equipment manufacturers to require ATCA platforms, and by solution providers to transition to ATCA platforms, in order to reduce operating costs by taking advantage of the improved HA features it offers.