-
chevron_right
Erlang Solutions: Why do systems fail? Tandem NonStop system and fault tolerance
news.movim.eu / PlanetJabber • 3 October • 6 minutes
If you’re an Elixir, Gleam, or Erlang developer, you’ve probably heard about the capabilities of the BEAM virtual machine, such as concurrency, distribution, and fault tolerance. Fault tolerance was one of the biggest concerns of Tandem Computers. They created their Tandem Non-Stop architecture for high availability in their systems, which included ATMs and mainframes.
In this post, I’ll be sharing the fundamentals of the NonStop architecture design with you. Their approach to achieving high availability in the presence of failures is similar to some implementations in the Erlang Virtual Machine, as both rely on concepts of processes and modularity.
Systems with High Availability
Why do systems fail? This question should probably be asked more often, considering all the factors it involves. It was central to the NonStop architecture because achieving high availability depends on understanding system failures.
For tandem systems , any system has critical components that could potentially cause failures. How often do you ask yourself how long can your system operate before a failure? There is a metric known as MTBF (mean time between failures), which is calculated by dividing the total operating hours of the system by the number of failures. The result represents the hours of uninterrupted operation.
Many factors can affect the MTBF, including administration, configuration, maintenance, power outages, hardware failures, and more. So, how can you survive these eventualities to achieve at least virtual high availability in your systems?
High availability in hardware has taught us important insights about continuous operation. Some hardware implementations rely on decomposing the system into modules, allowing for modularity to contain failures and maintain operation through backup modules instead of breaking the whole system and needing to restart it. The main concept, from this point of view, is to use modules as units of failure and replacement.
High Availability for Software Systems
But what about the software’s high availability? Just as with hardware, we can find important lessons from operative system designers who decompose systems into modules as units of service. This approach provides a mechanism for having a unit of protection and fault containment.
To achieve fault tolerance in software, it’s important to address similar insights from the NonStop design:
- Modularity through processes and messages.
- Fault containment.
- Process pairs for fault tolerance.
- Data integrity.
Can you recognise some similarities so far?
The NonStop architecture essentially relies on these concepts. The key to high availability, as I mentioned before, is modularity as a unit of service failure and protection.
A process should have a fail-fast mechanism, meaning it should be able to detect a failure during its operation, send a failure signal and then stop its operation. In this way, a system can achieve fault detection through fault containment and by sharing no state.
Another important consideration for your system is how long it takes to recover from a failure. Jim Gray, software designer and researcher at Tandem Computers, in his paper ”Why computers stop and what can be done about it?” proposed a model of failure affected by two kinds of bugs: Bohrbugs, which cause critical failures during operation, and Heisenbugs, which are more soft and can persist in the system for years.
Implementing Processes-Pairs Strategies
The previous categorisation helps us to understand better strategies for implementing processes-pairs design, based on a primary process and a backup process:
- Lockstep: Primary and backup processes execute the same task, so if the primary fails, the backup continues the execution. This is good for hardware failures, but in the presence of Heisenbugs, both processes will remain the failure.
- State checkpointing: A requestor entity is connected to a processes-pair. When the primary process stops operation, the requestor switches to the backup process. You need to design the requestor logic.
- Automatic checkpointing: Similar to the previous, but using the kernel to manage the checkpointing.
- Delta checkpointing : Similar to state checkpointing but using logical rather than physical updates.
- Persistence: When the primary process fails, the backup process starts its operation without a state. The system must implement a way to synchronise all the modules and avoid corrupt interaction.
All of these insights are drawn from Jim Gray’s paper, written in 1985 and referenced in Joe Armstrong’s 2003 thesis, “Making Reliable Distributed Systems in the presence of software errors” . Joe emphasised the importance of the Tandem NonStop system design as an inspiration for the OTP design principles.
Elixir and High Availability
So if you’re a software developer learning Elixir, you’ll probably be amazed by all the capabilities and great tooling available to build software systems. By leveraging frameworks like Phoenix and toolkits such as Ecto, you can build full-stack systems in Elixir. However, to fully harness the power of the Erlang virtual machine (BEAM) you must understand processes.
Just as the Tandem computer system relied on transactions, fault containment and a fail-fast mechanism, Erlang achieves high availability through processes. Both systems consider it important to modularise systems into units of service and failure: processes.
About the process
A process is the basic unit of abstraction in Erlang, a crucial concept because the Erlang virtual machine (BEAM) operates around this. Elixir and Gleam share the same virtual machine, which is why this concept is important for the entire ecosystem.
A process is:
- A strongly isolated entity.
- Creation and destruction is a lightweight operation.
- Message passing is the only way to interact with processes.
- Share no state.
- Do what they are supposed to do or fail.
Just remember, these are the fundamentals of Erlang, which is considered a message-oriented language, and its virtual machine (BEAM), on which Elixir runs.
If you want to read more about processes in Elixir I recommend reading this article I wrote: Understanding Processes for Elixir Developers.
I consider it important to read papers like Jim Gray’s article because they teach us the history behind implementations that attempt to solve problems. I find it interesting to read and share these insights with the community because it’s crucial to understand the context behind the tools we use. Recognising that implementations exist for a reason and have stories behind them is essential.
You can find many similarities between Tandem and Erlang design principles:
- Both aim to achieve high availability .
- Isolation of operations is extremely important to contain failure.
- Processes that share no state are crucial for building modular systems.
- Process interactions are key to maintaining operation in the presence of errors. While Tandem computers implemented process-pairs design, Erlang implemented OTP patterns .
To conclude
Take some time to read about the Tandem computer design. It’s interesting because these features share significant similarities with OTP design principles for achieving high availability. Failure is something we need to deal with in any kind of system, and it’s important to be aware of the reasons and know what you can do to manage it and continue your operation. This is crucial for any software developer, but if you’re an Elixir developer, you’ll probably dive deeper into how processes work and how to start designing components with them and OTP.
Thanks for reading about the Tandem NonStop system. If you like this kind of content, I’d appreciate it if you shared it with your community or teammates. You can visit this public repository on GitHub where I’m adding my graphic recordings and insights related to the Erlang ecosystem or contact the Erlang Solutions team to chat more about Erlang and Elixir.
Illustrations by Visual Partner-Ship @visual_partner
Jaguares, ESL Americas Office
@carlogilmar
The post Why do systems fail? Tandem NonStop system and fault tolerance appeared first on Erlang Solutions .