Fault tolerance is the property of any equipment or system to remain operational after the failure of one or more components.
The reliability of a fault-tolerant system is characterized by the number of nines. For example, any web page guarantees stable operation in 99% of cases, and the database of a Sberbank—level organization guarantees 99.9999%.
Specifications
A fault-tolerant system is characterized by the presence of redundant elements. Conventionally, they belong to the following types:
1. The software part. The presence of an identical application on each module of the information system. It is mandatory to have control software that will monitor the status of each node and redirect the load.
A striking example is the clustering scheme based on the Veritas Cluster Module. If one element fails, the application disconnects it from the cluster and redistributes the load to the rest.
2. The hardware part. It is similar to the previous one, but here redundancy occurs at the level of logical modules or equipment. For example, a data storage system has duplicate elements: two controllers, two power supplies, two network adapters, etc. If one of the modules fails, the load is distributed to the second one.
Redundancy at the hardware level implies the presence of several devices with similar characteristics. An example is a high-density server with computing nodes installed inside it.
3. The disaster-resistant part. This type of reservation is provided only for mission-critical systems, as it is associated with high financial costs and the availability of qualified specialists.
The redundancy scheme is transferred to the scale of data centers. Similar infrastructures are being built at two different sites. Communication is established between them, and then specialized software is used.
The first such software was created by NetApp, known for its technological innovations in the field of data storage systems. The vendor has developed a MetroCluster product that fully reserves all data center components at a remote site. Even if one of the data centers completely shuts down, the second one will fully recover within a few seconds.
To build fault-tolerant systems, the customer’s current infrastructure is first audited to identify vulnerabilities.
The next step determines the risks in case of loss of one of the infrastructure elements. Different variants of events are considered, in which the client will suffer maximum losses. Based on the information received, a scheme for building a fault-tolerant system of necessary elements is being developed. As a result, the client is provided with a comprehensive solution that will cover the risks as much as possible at an acceptable cost.
Fault tolerance is an important indicator of any information system. Redundancy can occur at different levels of the IP, starting with the software and ending with the data center.