Aeron Cluster

Traditional Service¶

We all recognise a 'traditional' service that looks something like this:

Clients send in messages. The service processes them, builds up some internal state and sends response messages. Any modifications to the state are also written to a database, so if the service crashes, the state isn't lost.

This is fine for moderate loads, but the database is a source of contention, so it becomes a bottleneck as the load increases. This kind of architecture will never be able to process 100,000's of messages per second with consistent, 99.999% latency measured in microseconds. Not on a typical budget anyway. Aeron Cluster aims to solve this.

Deterministic State Machines¶

Take a simple calculator, which we'll use as a proxy for the service. If you feed it the inputs 8, x5, +2 and =, it will output 42 and have 42 as its internal state, ready for the next input.

You can do this over and over and it will always behave in the same way. It's a deterministic state machine.

Replicated State Machines¶

Wikipedia

State Machine Replication is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas

If you have a deterministic state machine, you can replicate it for fault tolerance. If you put all your calculators in a box, somehow give them all the same inputs (A), and somehow merge the outputs (B), then you have a level of fault tolerance. Some of them can fail and the box will continue to operate without users of the box noticing. Users have the experience of interacting with a single, reliable state machine.

But now you've got the issue of A and B: coordinating client interactions. This is what the Raft Consensus Algorithm is all about, which is what Aeron Cluster provides an implementation of. The box is Aeron Cluster and the calculator is your service. If your service can be written as a deterministic state machine, Aeron Cluster can work with it to provide a high performance, fault-tolerant service.

Replicated Log (Raft & Aeron Cluster in a Nutshell)¶

Replicated State Machines are typically implemented using a Replicated Log, which is the case with Raft and Aeron Cluster.

Cluster Member¶

In Raft, each state machine is combined with a Consensus Module and a Log, creating a cluster member.

The Log contains all the inputs from different clients, which are put into a single sequence. This is the order that each cluster member will process the inputs in, one after the other, in single file.

The Consensus Module contains the Raft algorithm, which is in charge. It controls how the member manages its Log, when it feeds Log entries into its state machine, and how it interacts with the other members.

The State Machine is isolated from the outside world and only takes input from the Log. This is needed for determinism, as each state machine needs to process the same inputs.

Cluster¶

Each cluster member runs on a different machine, and they work together to form a cluster (there is no separate process implementing A and B above).

One of the cluster members is elected leader and the others become followers. The leader is in charge. Its Consensus Module receives inputs from clients and sequences them into its Log. The leader replicates its Log to the followers, via their Consensus Modules. A follower can reject a Log entry in some circumstances, such as if it believes it was sent from an old leader that has since been replaced.

Once the leader knows a majority of the members have accepted a Log entry, it has achieved consensus that that entry has been accepted by the cluster as a whole. At that point, the entry referred to as being committed. Committed entries will never be lost, and are safe to process.

The safety properties of the consensus algorithm ensure that, logically at least, committed entries will never be lost or overwritten. They are safe, even if cluster members crash and restart, or leadership changes multiple times. Physical safety is provided by the Log being on disk on a majority of the members.

Once a new Log entry reaches committed status, that implicitly commits any earlier Log entries. The Consensus Modules feed committed entries to their state machines. Given that the state machines are deterministic, this means the state on each member should be the same (determinism is critical). Outputs from the leader's state machine are sent to the clients.

Aeron Cluster Service¶

The above section is the theory. Aeron Cluster is an implementation of that theory, which comes with its own implementation details. Naturally, it is built using Aeron Transport and Aeron Archive. They provide efficient alternatives to some of Raft's mechanics that don't affect how the overall algorithm works, so all the safety guarantees still hold.

Here's a quick look at how Aeron Cluster works. There are a lot of moving parts, but hopefully it makes sense after reading the Transport and Archive overviews, and the sections above.

IntroStep 1Step 2Step 3Step 4Step 5Step 6Step 7Step 8

Use the tabs above to step through the animation.

With Aeron Cluster, messages from clients are referred to as Ingress messages. They arrive in the Cluster on an Ingress Channel on the leader, which puts them into an Aeron Transport log buffer per client. A log buffer is just a chunk of shared memory for holding messages.

As they arrive, the Consensus Module copies Ingress messages into a single sequence - the Log, which is another Aeron Transport log buffer (coincidence in same term 'log').

The followers have an Aeron Transport subscription to the leader's log. Whenever the leader adds new messages, they are replicated to the followers.

Each member uses Aeron Archive to record the Log. Whenever new messages are added to the Log, they are appended to the end of a file on disk - the Log Recording.

The followers send an AppendPosition message back to the leader, telling it how much they have appended to disk. They don't need to be in sync, as shown here, and will usually be at slightly different positions on a busy system.

Consensus: the leader calculates the position that the majority (at least half) of the members have on disk and sends this to the followers - this is the CommitPosition (latest committed entry in Raft). If one of the members dies, this amount of data is guaranteed to be on at least one other member, where it can be recovered from.

The leader and followers are then safe to process up to the CommitPosition, applying the newly committed entries to their replicated state machines - your service.

The services respond to clients via Egress channels. Only the leader's response gets through - responses from the followers are silently dropped by Aeron Cluster. The followers essentially act in the background for failover. If the leader dies, one of the followers is guaranteed to have up to the CommitPosition on disk, and it can become the new leader with automatic failover.

The above steps might seem like a lot of overhead, but they all have a lot of mechanical sympathy. It's mostly shovelling bytes from one place to the next, without interpreting them, and with no contention. Processors love streaming through sequences. When messages arrive at a high rate, they are processed in batches, not individually. Writing to the Log Recording is just streaming out to the end of a file, not jumping around to different positions on disk. All the service code runs on a single thread, so there is no contention or locking required.

If the processing is kept sensible and efficient, with very little garbage, it really is possible to process 100,000's of messages with 99.999% latency measured in microseconds.

Further details¶

This concludes the Aeron Cluster overview, but if you want to dig a bit deeper, still at a non-technical level, see the Raft and Raft Implementation pages.