Raft Consensus Algorithm¶
A description of Aeron Cluster really needs to start with an overview of the Raft Consensus Algorithm (and a link to Diego and John's paper). This is the theory behind Aeron Cluster, but it isn't theory for the sake of it. Everything in this description of Raft applies to Aeron Cluster, so it's like learning Aeron Cluster by stealth. After that, we'll look at how Aeron Cluster has implemented Raft.
This page builds on terms introduced in the Overview.
Raft¶
Raft is a consensus algorithm for managing a replicated log. Consensus algorithms allow a group of machines to work together, providing a single interface to clients, where the group can tolerate failure of a minority of its members. Raft was designed to be easier to understand and implement than other consensus algorithms. It separates out the key elements of leader election, log replication and safety.
Safety here refers to the correctness of the algorithm, in that the logic will not lose data that has been accepted by a majority of the cluster members (committed data). Raft provides a formal proof for this. That is not to say physical disks won't crash, but that is addressed by committed data being replicated to a number of members.
Organisation¶
As per the Overview, a cluster has a number of members, usually 3 or 5. One is the leader and the others are followers. They each have a log, containing all the inputs from clients, which the leader replicates to the followers. Each member has a Consensus Module, which implements the consensus algorithm and coordinates everything. Each member has a copy of the service that processes the log, depicted here as a simple calculator.
Consensus¶
Raft requires consensus between members when voting and appending to the log. Consensus means getting agreement from a majority of the members, which means more than half (including the member requiring consensus). The remaining members are the minority.
Majority and minority are important terms. Majority refers to the minimum number of cluster members that need to agree in order for the cluster to function. The minority can also be keeping up and agreeing, but the cluster can function without them. The cluster's performance is determined by the majority and is not affected by the minority. The cluster can tolerate losing a minority of its members - they can be slow, or even offline, and the cluster will still operate at full speed.
- A 3 node cluster requires 2 nodes for consensus, so it can tolerate losing 1 node
- A 5 node cluster requires 3 nodes for consensus, so it can tolerate losing 2 nodes
- A 4 node cluster doesn't make sense, as the additional cost of the 4th node doesn't buy you anything beyond a 3 node cluster
Cluster Roles¶
The cluster members can have the roles of leader, follower or candidate. The leader is chosen in an election. The candidate role is only used during an election - the successful candidate will become leader and any other candidates will become followers. Followers are passive, only responding to inputs from the leader.
Clients know about all the members, but only talk to the leader. If a client contacts a follower, the follower redirects it to the leader. This is how clients discover the leader.
Leader¶
In Raft, the leader is a 'strong' leader, compared with some other consensus algorithms. Log information only flows from the leader to the followers.
The leader is solely responsible for updating the log. It replicates it to the followers and tells them which entry they can process up to. The followers treat their logs as read-only. The leader has authority and does not need to consult followers before making updates. If the leader fails, or the followers lose contact with it, a new one is elected.
Electing a leader requires consensus, which requires a majority of cluster members. This means it's impossible for the cluster to experience a split brain with 2 leaders.
Terms¶
In Raft, the period of time from one election to the next is referred to as a term. Whenever there is an election, a new term starts. If the election results in a leader, the leader leads for the entirety of that term. If the election results in a split vote, the term ends without a leader, which triggers another election, which starts another term.
Terms are numbered with consecutive integers. Each time members communicate with each other, they include the current term number in the message. This lets them easily identify when they have received a message from a member in a newer or older term.
Imagine if the leader of term 1 loses contact with the followers, but is still running. After a timeout, the followers would elect a new leader, which would start term 2. If the leader of term 1 then reconnected and sent a message containing term number 1, the others would ignore it as they are in term 2. Conversely, if the leader of term 1 received a message from the new leader, it would see term 2 is higher than its own term number, accept the new leader, and become a follower in term 2.
Each log entry contains its index in the log, the current term number and the message to be processed, so a log for the above terms might look like:
Remote Procedure Calls¶
Raft defines interactions between members as Remote Procedure Calls. The Aeron Cluster implementation is quite different, but at a high level, the same information is passed between members, so it's worth understanding.
Raft defines two RPCs for consensus and one related to snapshots (click the sections below to expand them).
AppendEntries - replicate log entries
This is issued by the leader to the followers to replicate a batch of new log entries to them. It can also be sent with no entries, which acts as a heartbeat to retain leadership. The followers reply, indicating whether they accept or reject the new entries.
Fields
term - the leader's term number
prevLogIndex - identifies the expected last log index on the follower, which these entries will follow
prevLogTerm - term for the above
entries[] - the entries to add
leaderId - so followers can redirect clients
leaderCommit - the commit index from the leader for previously replicated entries
Result
term - the follower's term number (could be ahead if the sender is an old leader)
success - true if the follower accepts the entry
Note
The RPC contains 3 distinct groups of fields, which was presumably done to keep Raft simple. In Aeron Cluster, these are sent in 3 separate messages.
The first 4 fields convey new log entries for replication. prevLogIndex
and prevLogTerm
act as a consistency
check on the follower. The follower checks that it has an entry at prevLogIndex
and that it was added during
prevLogTerm
term number. If not, it returns success=false
.
The leaderId
field tags along for a free ride. In Aeron Cluster, this is sent to followers in a NewLeadershipTerm
message at the start of a term, rather than being included in every message.
The leaderCommit
field tells followers the current commit index. This is calculated from the result of a previous
AppendEntries RPC, so it doesn't relate to the first 4 fields in the current RPC. In Aeron Cluster, this is sent in
a separate CommitPosition
message.
RequestVote - request vote from other members
During an election, candidates issue RequestVote RPCs to the other members. The RPC contains the index and term number of the candidate's last log entry. The other members can use this to decide whether the candidate has enough of the log to become leader. The other members reply, voting for or against the candidate.
Fields
term - the candidate's term number (proposed new term number)
candidateId - sender's id
lastLogIndex - last log index on the candidate
lastLogTerm - term of the last log
Result
term - the follower's term number (could be ahead of the candidate's)
voteGranted - true if the follower is voting for the candidate
InstallSnapshot - replicate a chunk of a snapshot to a follower
The leader issues InstallSnapshot to replicate a chunk of a snapshot to a follower. Aeron Cluster supports snapshotting, but it is not implemented in this way, so this isn't covered here. See page 12 of the Raft paper if you want more details.
Leader Election¶
Members start in the follower role. They remain a follower while they receive RPCs from the leader or a candidate. If a follower hasn't received an RPC for the election timeout, it starts a new election.
A member starts an election by incrementing its current term number to create a 'candidate' term number, moving to the candidate role, voting for itself, then issuing RequestVote RPCs to the other members. It then waits until either it wins the election, another member wins the election, or the election times out.
Note
When a member 'starts' an election, it doesn't necessarily mean the other members will stop what they're doing and join it. It's just asking the other members to vote for it, to become leader for the candidate term number. If the majority of the members have already had an election for that, or a later term number, the leader would tell the candidate the current term number in an AppendEntries RPC. The candidate would then realise it's out of date and would become a follower in that term. The majority of the cluster, processing client requests, would continue uninterrupted.
A candidate wins an election if it receives votes from a majority of the members (including itself). Each member votes only once in a given term. Voting is on a first-come-first-served basis, but there is a election restriction that contributes to algorithm safety, where voters check how up to date the candidate is (described below in Safety). Majority voting means each election can never result in more than one leader.
A candidate finds out it has lost an election by receiving an AppendEntries RPC from another member, containing a term number matching its candidate term number, or higher. It would recognise the sender as the new leader and would become a follower.
If the vote is split between members, the election times out and the members start a new election, in the next term. In order to avoid repeated split votes, each member has a slightly different randomised election timeout, so one member usually times out before the others and requests votes first. This solution adds a small delay to elections, but is simpler and easier to understand than the alternatives.
Log Replication¶
Once elected, the leader listens for client requests and adds them to its log.
It replicates new entries by making an AppendEntries RPC to each follower. The RPC is repeated if a follower is stopped or the RPC wasn't delivered by the network, so each follower should eventually receive it. Having the index in each entry means receiving it more than once is idempotent.
The followers reply to the RPC, saying whether they accept or reject the new entries. They would normally accept them, but could reject them if the term is older than the follower's current term (sent by an old leader).
As the responses come in, the leader assesses whether it has consensus that the new entries have been accepted by a majority of the members. Once it has consensus, the entries are referred to as being committed. The leader keeps track of the last committed entry. All entries prior to the last committed entry are implicitly committed. Waiting for only a majority means it runs as fast as the majority - a minority of slow, or stopped followers does not impact performance. Whatever happens in the future, the algorithm will not overwrite a committed entry. This is one of the Safety guarantees from Raft.
Whenever the commit index advances, the leader applies the newly committed entries to its state machine and returns the result to the client. The leader includes the current commit index in the next AppendEntries RPC, which notifies the followers that they can apply the new entries to their state machines too. It doesn't matter that the members don't update their state machines at the same time.
Use the tabs above to step through the animation.
The user sends '8' to the leader. It appends it to its log and sends it to the followers in AppendEntries RPCs. The RPCs also contain the commit index, which is currently 0.
Follower 1 adds it to its log and returns success. Follower 2 is being slow.
The leader has consensus from a majority of the cluster that the commit index can be increased to 1. This lets it apply '8' to its state machine.
The user sends 'x5', '+2' and '=' in rapid succession. The leader appends them to its log and sends them to follower 1 in an AppendEntries RPC. The RPC contains the new commit index of 1.
The leader also sends an AppendEntries to follower 2, but it includes all the log entries that follower 2 hasn't replied to yet, in case the original message is lost. The RPC also contains the commit index of 1.
Follower 2 processes the first RPC, adding '8' to its log and replying 'success'. The leader updates its record of where follower 2 is up to, but it doesn't affect the commit index.
Follower 2 processes the second RPC. The RPC tells it where the entries should be appended after in the log, so it knows the '8' is the same and its existing '8'. It adds 'x5', '+2' and '=' after the '8'. It updates its commit index to 1 and applies '8' to its state machine. It replies 'success' to the leader.
Follower 1 processes the second RPC and returns success to the leader. All member are in sync at this point.
The leader receives the replies and updates the commit index to 4. It applies up to index 4 to its state machine and the state machine outputs an answer. The leader returns this to the user.
There are no more inputs, so after a timeout, the leader sends a heartbeat AppendEntries RPC with no entries. This contains the commit index of 4.
Both followers apply up to index 4 to their state machines.
Safety¶
The leader of a cluster will always replicate its log to the followers, possibly overwriting entries they already have in their log. Take the following scenario with 5 cluster members. The boxes are log entries and the figure inside is the term number.
On the left, M1 is leader of term 1. It has appended 5 entries to its log, and has replicated up to a different position on each follower. If M1 dies, an election will be held.
On the right, M3 became the leader of term 2. Its log is now the source of truth, which will be replicated to the followers. If it receives a new client message, it would add it to the end of its log at index 4. When M3 replicates its log to M2, M2 will overwrite the entry it had in index 4 from term 1.
Election Restriction¶
Raft guarantees that committed entries will never be lost by adding an election restriction:
Raft's Election Restriction is that a candidate cannot become leader unless it contains all the committed entries. This is required to guarantee that the committed entries will never be lost.
It is ok to lose uncommitted entries, because they won't have been processed - it's no different from the leader dying just before the messages were received. However, committed entries may have been processed, so it is not ok to lose them. Imagine an entry that causes a $10M trade to happen between two customers, sending confirmation messages to them and trade information to other systems. Once the message has been processed, you can't lose it as though it never happened.
Up-to-date Check¶
The election restriction is implemented by each voting member checking whether the candidate is 'up to date', when compared with its own log. A member will only vote for a candidate if it is up to date. A candidate requires a majority of members to vote for it, so at least one of those members will know about all the committed entries.
To start an election, a candidate sends a RequestVote RPC. This contains the index and term number of the candidate's last log entry. Voting members compare this with their own last log entry to see whether the candidate is up-to-date.
The term number is compared first. The candidate's last log entry needs to be in the same or high term than the voter. If it is in the same term, then the candidate needs to have the same or more in its log than the voter.
Election Restriction Example¶
On the left, when M1 was leader, it would have set the commit index. It can only advance the commit index to the position that is on disk on a majority of the members, so the highest it could be is index 3.
When M1 dies, the election restriction would prevent M4 or M5 from winning the election, because they don't have all the committed entries. The following election outcomes are possible:
- M2 could become leader, with votes from any of M3, M4 and M5. If a new client message is received in term 2, it would be added at index 5. The uncommitted entry in index 5 on M1 would be lost. The entry from term 1 in index 4 would be preserved. M2 would replicate its log to M3, M4 and M5, so their logs would eventually become identical to M2's. If M1 came back online, it would become a follower in term 2 and its log would also become identical to M2's.
- M3 could become leader with votes from M4 and M5. M2 would not vote for it, as M2 contains more entries, but M2 could be offline. If M3 became leader, it would protect the committed entries, but it could accept a new entry in index 4. If M1 or M2 came back online, M3 would replicate its new entry in index 4 onto M1 or M2, overwriting their uncommitted entry (M1 would also remove its entry in index 5).
- M4 could not become leader, because it can only secure a vote from M5. M2 and M3 would not vote for it because of the election restriction.
- M5 could not become leader because none of the other members would vote for it because of the election restriction.
Other Log Examples¶
Figure 7 on page 7 of the Raft paper contains several scenarios that are worth reproducing here (credit to the Raft authors). These all show what might be in a follower's log when a leader wins an election. Let's say the leader is leading term 8.
When the leader comes to power, it's possible for a follower to be missing entries.
A follower could be missing a lot of entries if it's been down for several terms.
A follower could have extra uncommitted entries. This could happen if the follower was leader of term 6 and crashed before committing index 11. Term 7 could have been a split vote. The follower could have restarted, then the leader could have won the election for term 8 with votes from other members that didn't receive the uncommitted entry.
Another example of a follower with extra uncommitted entries. The follower could have been leader for term 7, adding uncommitted entries, before crashing. Again, the new leader for term 8 could have won with votes from other members.
When the leader comes to power, a follower can have both missing entries and extra uncommitted entries. The follower could have been leader for term 4, crashing with 2 uncomitted entries. It could have remained down for several terms, missing out on terms 5 and 6.
Another example of a follower with both missing entries and extra uncommitted entries. The follower could have been leader for term 2, added entries and crashed before committing them. It could have become leader for term 3, added more entries and crashed again before committing any of them. It could have remained down for several terms, allowing another leader to populate index 4 with an entry from term 4, etc. The follower would lose all these uncommitted entries, when it replicates the log from the new leader.
Snapshots¶
When a cluster member restarts, it rebuilds its state by applying the entire log to its state machine. Obviously, the longer the log, the longer it will take to replay and the more storage it will require. For some systems that don't need to carry state from one business day to the next, it might be acceptable to clear the log at the start of each day. In those systems, the log is bounded by one day's worth of business, so it might be acceptable to replay all of it, which keeps the system simple.
For other systems, state needs to be held forever. For those, the log would continue to grow in an unbounded fashion, which is impractical. There are various approaches to compacting the log, but Raft employs a simple snapshotting mechanism. The entire state of the state machine is written out to disk and the log up to that point is discarded. The snapshot contains the index and term number of the last log entry that was applied to the state machine.
When a cluster node starts, it would read the latest snapshot into its state machine, then apply log entries starting from the entry after the last entry in the snapshot. In theory, there should be no difference between a cluster member that has restarted from a snapshot, from one that didn't restart (this should also be the case in practise, but the implementation of snapshotting can be tricky and is often the source of bugs!).
In Raft, each cluster member takes its own snapshot. Aeron Cluster is the same, but taking a snapshot requires the cluster member to stop processing the log while the snapshot is taken. This introduces a latency spike, so for this reason, snapshots are often taken at the end of the business day, when the system is less busy. If this is not possible, Aeron Premium provides a feature where snapshots can be done by dedicated cluster member that does not participate in consensus.
Aeron Cluster implements snapshotting, but with some implementation differences, so it's not worth going into more details about Raft snapshotting.
Comparison with other consensus algorithms¶
In Raft, the leader of the group plays a stronger role than in other consensus algorithms. Only the leader appends to the log, and followers replicate the leader's log. If a follower starts up and has a log that contains entries not on the leader (an edge case), the follower discards them and makes its log match the leader's.
During leader elections, Raft uses randomised timers to reduce the chance of split votes. This adds a small delay to elections, but it's effective and easier to understand than other alternatives.
Raft introduces a new approach for membership changes, where servers can be added or removed from the cluster configuration. However, this isn't supported by Aeron Cluster, so won't be covered here. Aeron Cluster used to support members dynamically joining using its own mechanism, but that was removed because it had a number of edge cases. An alternative may return as an Aeron Premium feature.