State Machine

The Election object is implemented as a state machine, shown below. If you're like me, you'll want a printable copy.

As a rough outline, you can think of the Canvass column as the happy path for a leader, the FollowerBallot column as the happy path for a follower, and the FollowerLogReplication column as the states for Archive Replication and Catchup, for back-filling missing Log entries. Clicking a state takes you to its full page description.

Example Election¶

The Election state machine can be quite daunting when you haven't seen it before, so this section walks through the simplest example of a successful Election. It focuses on 2 members out of a 3 node cluster, just to make the description shorter. A third member would just be another follower, doing what the first follower does (in this example).

The description below is in the form of a dialogue between the two members, and a description of each state as they pass through them. This simple example won't use any of the states in the right hand column of the diagram above - no Replication or Catchup. Replication and Catchup are described in the next section.

Member 1: INIT

The Election starts in the INIT state. In this state, the member clears any previous Election state, so it's ready to enter a new one. It closes the Ingress and any Egress connections. It stops any Catchup or Replication activities, if it was part-way through an Election before starting this one. It moves to CANVASS if it's a multi node cluster, or LEADER_LOG_REPLICATION if it's a single node cluster. In this case, it moves to CANVASS.

Member 2: INIT

All members start off in the same way, so this member also starts in INIT, clears any old state, then moves to CANVASS.

Member 1: CANVASS

In CANVASS, the member periodically sends a CanvassPosition message to all the other members, which publishes the leadership term and position of the last entry in its Log Recording. It then waits for other members to send theirs, or if there's already a leader, for it to send a NewLeadershipTerm (there could already be a leader if Member 1 started up late and was joining the other members of the cluster that have already had an Election).

Member 2: CANVASS

In this case, there isn't already a leader. Member 2 also sends a CanvassPosition message periodically.

When Member 2 receives the CanvassPosition from Member 1, it evaluates whether its own Log is in the highest leadership term with the highest log position, across all the members. The member would wait for a CanvassPosition from each of the other cluster members, or wait for a timeout (10 seconds), then evaluate again to see whether there is enough for consensus (a member may not be running, which you could say this is an example of).

Let's say both members have the same entries in their Log, so they are equally valid candidates. This member evaluates that it is a candidate to become leader, so it sets a random nomination deadline and moves to NOMINATE.

Member 2: NOMINATE

It waits in NOMINATE until the random nomination deadline expires, or it receives a NewLeadershipTerm from another member.

While in NOMINATE, it continues to publish a CanvassPosition periodically to the other members, in case others are still starting up.

Member 1: CANVASS

When Member 1 receives the CanvassPosition from Member 2, it also evaluates that it is a candidate, so it also sets a random nomination deadline and moves to NOMINATE.

Member 1: NOMINATE

Let's say Member 1's nomination deadline expires first. It clears any old voting records it holds about votes from other members, then marks a vote for itself. In a 3 node cluster, it only needs one more vote to win. It moves to CANDIDATE_BALLOT.

Member 1: CANDIDATE_BALLOT

In this state, it sends a RequestVote to the other members, asking them to vote for it, then waits for them to reply with a Vote (which can be FOR or AGAINST).

Follower: NOMINATE

Member 2 receives the RequestVote before its nomination deadline expires. It sends a Vote to Member 1, voting FOR it becoming leader. Member 2 switches to the follower role, so we'll now refer to it as the Follower. It moves to FOLLOWER_BALLOT.

Follower: FOLLOWER_BALLOT

The Follower waits for the outcome of the Election (in a real example, there would be another member voting). If the Election ends up being a split vote, they would all start again in CANVASS, starting another leadership term.

Member 1: CANDIDATE_BALLOT

When Member 1 receives the Vote message, it records it in its voting records, then evaluates whether it has won the Election. In this case, it was a landslide! It moves to LEADER_LOG_REPLICATION, and we'll start referring to it as the Leader.

If the election had timed out, it would have moved back to CANVASS.

Leader: LEADER_LOG_REPLICATION

It sends a NewLeadershipTerm to the Follower, proclaiming its win.

It starts publishing a CommitPosition to all the Followers periodically, telling them the consensus position. It will remain doing this for the entirety of the leadership term.

It creates its Log Publication, which will eventually be used to publish new Log messages to the Followers, and be recorded into its Log Recording.

In this state, the Leader waits for a quorum of members to reach the same position in their Log Recording as itself. The Followers sent their Log position in the Vote message. If they were behind, the Leader would wait for them to use Replication to catch up and send an AppendPosition containing a Log position matching its own.

In this example, the Followers already have the same entries in their Log Recording as the Leader, so the Leader moves to LEADER_REPLAY.

Leader: LEADER_REPLAY

In this state, the Leader gets it Archive to replay its Log Recording, so that its Consensus Module and Clustered Service can process it (the Consensus Module only processes the log to keep track of client sessions). It is replayed either from the snapshot position, if it loaded a snapshot before starting the Election, or from the start.

The Leader periodically sends a NewLeadershipTerm to the followers, in case they are slow to start. It also continues to send a CommitPosition whenever it has changed, or periodically.

When the Leader has processed the Log to the end of the Log Recording, it moves to LEADER_INIT.

Follower: FOLLOWER_BALLOT

The Follower receives the NewLeadershipTerm, telling it that Member 1 won the Election. It looks at the NewLeadershipTerm message to see whether it needs to Replicate some missing Log entries from the Leader. In this example, it has the same Log Recording contents as the Leader, so it doesn't, and moves to FOLLOWER_REPLAY.

Follower: FOLLOWER_REPLAY

This state is the follower equivalent of LEADER_REPLAY, where it replays its Log into itself to process it up to the end of the Log Recording. When it finishes, it moves to FOLLOWER_LOG_INIT.

Leader: LEADER_INIT

In this state, it asks its Archive to start recording its Log Publication to its Log Recording. It also asks its Clustered Service to start listening to the Log Publication.

It then moves to LEADER_READY.

Follower: FOLLOWER_LOG_INIT

If it hasn't already created a Log Subscription (as in this example), it creates it, then moves to FOLLOWER_LOG_AWAIT.

If the Follower had performed Catchup, it would already have a Log Subscription, in which case, it would have moved to FOLLOWER_READY.

Follower: FOLLOWER_LOG_AWAIT

It asks its Archive to start recording its Log Subscription to its Log Recording. It also asks its Clustered Service to start listening to the Log Subscription.

Then it moves to FOLLOWER_READY.

Leader: LEADER_READY

In this state, the leader waits for the followers to reach the position at the end of the Leader's Log Recording. The Leader reset its record of their positions in LEADER_REPLAY, so it needs to receive at least one AppendPosition from each of them.

It continues to publish CommitPosition and NewLeadershipTerm messages to the followers periodically.

Follower: FOLLOWER_READY

It publishes its AppendPosition, telling the Leader that it is at the position at the end of its Log Recording (matching the Leader's). The Election object tells the Consensus Module that the election is complete, and it connects to the Ingress.

The Election moves to CLOSED and the Consensus Module discards it.

Leader: LEADER_READY

Now that the Leader has received the AppendPosition message, it knows the Follower is at the correct position. The Election object tells the Consensus Module that the election is complete, and it connects to the Ingress.

The Election moves to CLOSED and the Consensus Module discards it.

Replication and Catchup Details¶

Now that we've introduced the state machine, let's revisit the Replication and Catchup scenario from the last page, looking in more detail. This provides a real example of how some of the Consensus messages are used. Here's the same diagram again:

When member 0 starts, it knows nothing about the other members and doesn't know it is behind, so it enters an election and reaches the CANVASS state. It sends out a CanvassPosition message, saying "I'm member 0, I'm in Term 0, and my Log Recording ends in Term 0 at position A".

1st CanvassPosition

Example message that matches the scenario:

CanvassPosition {
    logLeadershipTermId=0   // my log ends in term 0
    logPosition=1312        // my log ends at position A
    leadershipTermId=0      // I am in term 0
    followerMemberId=0      // I am member 0
    protocolVersion=65536
}

The leader is in a higher term than member 0, so the CanvassPosition message does not trigger an election on it. Instead, the leader responds with a NewLeadershipTerm message, which says "your log is on Term 0, but I'm on Term 2 at position D; your next Term is 1, it starts at position B and is currently at position C (its end position)"

1st NewLeadershipTerm

NewLeadershipTerm {
    logLeadershipTermId=0           // echoed from CanvassPosition
    nextLeadershipTermId=1          // your next term is Term 1
    nextTermBaseLogPosition=2848    // it starts at position B
    nextLogPosition=4064            // it ends at position C
    --- above here is in response to the CanvassPosition, below is about the current leadership term
    leadershipTermId=2              // leader is on term 2
    termBaseLogPosition=4064        // it starts at position C
    logPosition=6496                // it is currently at position D
    leaderRecordingId=0             // if you want to replicate my Log, it's this recordingId
    timestamp=1737306989652
    leaderMemberId=1                // I'm member 1
    logSessionId=1115055142
    appVersion=1
    isStartup=FALSE
}

Now member 0 can see that it is behind. It already knows that it is at position A. It has been told that the next leadership term runs from B to C, and the current leadership term is from C to D (if there were more leadership terms in between, it would learn about them as it progressed through multiple rounds of Replication).

Member 0 can see that it needs to back-fill from A to B next, so it moves to FOLLOWER_LOG_REPLICATION. This is where it uses Archive Replication to replicate the Log from A to B, from the leader's Archive, straight into its own Archive. It only replicates one leadership term at a time, so this only back-fills the rest of Term 0. It sends a ReplicateRequest2 to its own Archive, which sends a ReplayRequest to the leader's Archive.

1st ReplicateRequest2 and ReplayRequest

ReplicateRequest2 {             // member 0 Consensus Module to member 0 Archive
    controlSessionId=537541647
    correlationId=69
    srcRecordingId=0            // leaderRecordingId from NewLeadershipTerm
    dstRecordingId=0            // local Log Recording on member 0
    stopPosition=2848           // stop at B (the Archive will start at A - the end of the local Recording)
    channelTagId=-1
    subscriptionTagId=-1
    srcControlStreamId=10
    fileIoMaxLength=-1
    replicationSessionId=68
    srcControlChannel='aeron:udp?endpoint=localhost:9101'
    liveDestination=''
    replicationChannel='aeron:udp?endpoint=localhost:0'
    encodedCredentials=0 bytes of raw data
    srcResponseChannel=''
}

ReplayRequest {                 // member 0 Archive to leader's Archive
    controlSessionId=190436538
    correlationId=77
    recordingId=0
    position=1312               // start position A
    length=1536                 // length from A to B
    replayStreamId=100
    fileIoMaxLength=-1
    replayToken=-1
    replayChannel='aeron:udp?session-id=68|endpoint=localhost:49946'
}

When the Archive Replication is complete, member 0 returns to the CANVASS state and repeats the process. It sends out another CanvassPosition message, now saying "I'm in Term 2 (because I will join the new leader), and my Log Recording ends in Term 0 at position B".

2nd CanvassPosition

CanvassPosition {
    logLeadershipTermId=0   // my log still ends in term 0
    logPosition=2848        // my log ends at position B
    leadershipTermId=2      // I am in term 2, even though my log is behind
    followerMemberId=0      // I am member 0
    protocolVersion=65536
}

The leader responds with another NewLeadershipTerm message, which is identical to the last one, because member 0's log still hasn't started term 1. It says the same: "your log is on Term 0, but I'm on Term 2 at position D; your next Term is 1, it starts at position B and is currently at position C (its end position)"

2nd NewLeadershipTerm

NewLeadershipTerm {
    logLeadershipTermId=0           // echoed from CanvassPosition
    nextLeadershipTermId=1          // your next term is Term 1
    nextTermBaseLogPosition=2848    // it starts at position B
    nextLogPosition=4064            // it ends at position C
    leadershipTermId=2              // leader is on term 2
    termBaseLogPosition=4064        // it starts at position C
    logPosition=6592                // it is currently at position D (which has increased a little)
    leaderRecordingId=0             // if you want to replicate my Log, it's this recordingId
    timestamp=1737306989744
    leaderMemberId=1
    logSessionId=1115055142
    appVersion=1
    isStartup=FALSE
}

Member 0 can see that its term 0 is complete, so it moves to FOLLOWER_LOG_REPLICATION again, this time to replicate term 1, from B to C.

2nd ReplicateRequest2 and ReplayRequest

ReplicateRequest2 {             // member 0 Consensus Module to member 0 Archive
    controlSessionId=537541647
    correlationId=85
    srcRecordingId=0            // leaderRecordingId from NewLeadershipTerm
    dstRecordingId=0            // local Log Recording on member 0
    stopPosition=4064           // stop at C (the Archive will start at B - the end of the local Recording)
    channelTagId=-1
    subscriptionTagId=-1
    srcControlStreamId=10
    fileIoMaxLength=-1
    replicationSessionId=84
    srcControlChannel='aeron:udp?endpoint=localhost:9101'
    liveDestination=''
    replicationChannel='aeron:udp?endpoint=localhost:0'
    encodedCredentials=0 bytes of raw data
    srcResponseChannel=''
}

ReplayRequest {                 // member 0 Archive to leader's Archive
    controlSessionId=190436539
    correlationId=93
    recordingId=0
    position=2848               // start position B
    length=1216                 // length from B to C
    replayStreamId=100
    fileIoMaxLength=-1
    replayToken=-1
    replayChannel='aeron:udp?session-id=84|endpoint=localhost:54958'
}

When the Archive Replication is complete, member 0 returns to the CANVASS state and repeats the process. It sends out another CanvassPosition message, now saying "I'm in Term 2 (because I will join the new leader), and my Log Recording ends in Term 1 at position C".

3rd CanvassPosition

CanvassPosition {
    logLeadershipTermId=1   // my log now ends in term 1
    logPosition=4064        // my log ends at position C
    leadershipTermId=2      // I am in term 2, even though my log is behind
    followerMemberId=0      // I am member 0
    protocolVersion=65536
}

The leader responds with another NewLeadershipTerm message. This one says: "your log is on Term 1, but I'm on Term 2 at position D; your next Term is 2, it starts at position C and hasn't ended yet (still being appended to)"

3rd NewLeadershipTerm

NewLeadershipTerm {
    logLeadershipTermId=1           // echoed from CanvassPosition
    nextLeadershipTermId=2          // your next term is Term 2
    nextTermBaseLogPosition=4064    // it starts at position C
    nextLogPosition=-1              // it ends hasn't ended yet (still being appended to)
    leadershipTermId=2              // leader is on term 2
    termBaseLogPosition=4064        // it starts at position C
    logPosition=6656                // it is currently at position D (which has increased some more)
    leaderRecordingId=0             // if you want to replicate my Log, it's this recordingId
    timestamp=1737306989974
    leaderMemberId=1
    logSessionId=1115055142
    appVersion=1
    isStartup=FALSE
}

This time, member 0 can see that it is behind, within the current leadership term, so it needs to switch to Catchup. But first, it needs to process the entries that it hasn't processed yet, including those that it has just replicated, so it moves to FOLLOWER_REPLAY. In this example, it didn't load a snapshot, so it needs to replay and process the Log from the beginning. It sets up an Archive Replay from its own Archive, into a temporary Subscription (not the Log), from A to C. The Consensus Module and Clustered Service listen to the replay and process the entries. Once complete, the Archive Replay connection is closed.

ReplayRequest

ReplayRequest {                 // member 0 Consensus Module to member 0 Archive
    controlSessionId=537541647
    correlationId=116
    recordingId=0               // member 0's Log Recording id
    position=0                  // start from the beginning (no snapshot loaded)
    length=4064                 // stop at C
    replayStreamId=103
    fileIoMaxLength=-1
    replayToken=-1
    replayChannel='aeron:ipc'
}

Once FOLLOWER_REPLAY has finished, it needs to decide whether its Log is behind the leader's Log in the current leadership term. It is in this example, so it needs to perform Catchup, and moves to FOLLOWER_CATCHUP_INIT. If it wasn't behind (e.g. when all the members start at the same time, the leader doesn't start appending to the new leadership term until all the followers are ready), then it would have moved straight to FOLLOWER_LOG_INIT.

In FOLLOWER_CATCHUP_INIT, member 0 adds a Catchup destination to its Log Subscription. This is a different endpoint (port number) that member 0 is listening on to receive Log messages, which will go into its Log log buffer, distinct from the live Log destination that it would normally receive new Log entries from the leader. This is the start of replay merge.

Then it sends a CatchupPosition to the leader. This asks the leader to replay its Log Recording into the Catchup endpoint.

CatchupPosition

CatchupPosition {
    leadershipTermId=2                  // member 0 is in term 2
    logPosition=4064                    // replay from C
    followerMemberId=0                  // member 0 wants the catchup
    catchupEndpoint='localhost:9005'    // member 0's catchup endpoint, where the leader's Archive will replay to
}

Then it moves to FOLLOWER_CATCHUP_AWAIT. Here, it waits for the leader's Archive to connect to its Catchup endpoint. Once connected, it starts recording the Log Subscription and asks the Clustered Service to start processing any new messages on it.

Then it moves to FOLLOWER_CATCHUP. Here, it receives Log messages on its Catchup endpoint and processes them. While processing, it sends AppendPosition messages to the leader, telling it the position it is up to.

AppendPosition

AppendPosition {
    leadershipTermId=2
    logPosition=5280        // the position member 0 has written into its Log Recording
    followerMemberId=0
    flags=1                 // flag to say it's in 'catchup mode'
}

Member 0 is also receiving CommitPosition messages from the leader, telling it where position D is at (to be more precise, it's the consensus position - the minimum position written to the Log Recording across a majority of the members).

CommitPosition

CommitPosition {
    leadershipTermId=2
    logPosition=6880        // position D
    leaderMemberId=1
}

When the amount written to member 0's Log Recording is 'near' to D (quarter of the Log log buffer term length), member 0 adds the live Log destination to its Log Subscription. This means it can receive Log entries from two sources, Catchup and Live, both of which are written into its Log log buffer. At some point, they will overlap, when Log entries from the Catchup stream reach the start of those from the Log stream.

When the leader receives an AppendPosition from member 0 that confirms it has caught up to the end of the leader's Log, it stops the Archive replay on the Catchup stream. The leader sends a StopCatchup to member 0.

StopCatchup

StopCatchup {
    leadershipTermId=3
    followerMemberId=0
}

This is the trigger for member 0 to remove the Catchup destination from its Log Subscription, as it has now caught up. It stays listening to the live Log messages on the live endpoint. At this point, member 0 has completed replay merge and it moves to FOLLOWER_LOG_INIT. Its Election quickly steps through the last few steps and closes. Member 0 is now a working member of the cluster.