Elections

What is an Election? What are its outcomes? An Election is not just about choosing a leader. When the leader opens the doors to the Ingress and starts appending new messages to the Log, it needs the majority of the members to be ready, so it can achieve consensus on them immediately, so they can be processed. For this reason, an Election also coordinates the recovery of at least a majority of the members, before completing. Only then can the Ingress be opened.

There are also constraints imposed by Raft on an Election. Once a Log message (an entry) has been confirmed to be on disk on a majority of the members, Raft considers it Committed. Committed entries are eligible to be processed, so they cannot be lost, as business decisions may have been made on them. The cluster commits itself to not losing Committed entries. When all the cluster members start, some may have fewer entries in their Log than others. The Election process must choose a leader that contains all the Committed entries, so none are lost.

Leadership Terms¶

A leadership term is a logical concept, representing the period of time from one Election until the next. It has an integer id, which is equivalent to an Election number. Each Election creates a new leadership term, even if it fails to elect a leader (there could be a split vote).

The current leadership term id is passed in every consensus message, so it is easy to spot a message from a stale member, or for a member to spot when itself is stale. When members vote, the vote request is for a given leadership term and members only vote once within it. A successful Election requires a majority of the members to agree on the leader for the given term, which will go on to lead until the next Election (term).

Where do Elections fit in?¶

When a cluster member starts, it first loads its latest snapshot, if there is one. If there was one, its logPosition will be the snapshot position. If not, logPosition will be 0. When the member gets to processing the Log, it will process it from the logPosition.

After the snapshot has been loaded, the member will enter an Election. This term isn't quite accurate - when a member 'enters an Election', that just means it goes through its Election process (state machine). If the other members are at the same stage, they will interact and indeed hold an Election to elect a leader. But if the member is behind, and the other nodes have already had an Election, there is no need for them to stop. Instead, the leader tells it how to get back up to speed, so it can rejoin the cluster as a follower.

An Election can also be triggered for various reasons while the cluster is running. The most obvious is if the leader dies. The followers will time out waiting for it to communicate with them, then hold an Election.

Reloading State¶

After a leader has been elected, each member will replay its Log from the logPosition to rebuild its state. If a member joins a cluster late, it will replay its Log before joining the cluster.

When members replay the Log, there are no Egress connections connected, so duplicate messages are not published.

Election Implementation¶

When a cluster member wants to enter an Election, its ConsensusModuleAgent creates an Election object. A lot of state is injected into the Election object, and it can send Consensus messages to the other cluster members. In the ConsensusModuleAgent's duty cycle, if it has an Election object, it calls doWork() on it. When the ConsensusModuleAgent receives certain Consensus messages, it forwards them to the Election object. When the Election is complete, the ConsensusModuleAgent discards the Election object and sets its election field to null.

The Election object is a large state machine, which is described on the next page.

Replication and Catchup¶

If a cluster member starts and finds its Log Recording does not have all the entries that the leader has, then it needs to back-fill them from the leader. The way it does that depends on whether they are missing from just the current leadership term, or also from one or more earlier leadership terms.

Consider the following example. The top row shows the Log for members 1 and 2, which have been through 3 elections, so there are 3 leadership terms. They are still running in term 2, which is being appended to at position D (a moving target). The bottom row shows the Log for member 0, which died at position A, so it is missing entries from A to D.

When member 0 starts, any leadership terms earlier than the current one will be back-filled using Archive Replication. This is where the leader's Archive replays the Log Recording straight into member 0's Archive. The entries are not processed at this stage, it's just data replication. Replication is performed for each term in turn, so there will be a Replication session from A to B, then another from B to C.

When member 0's Log Recording contains up to the start of the current leadership term, i.e. position C, it replays (processes) the new Log entries, from A to C.

It then switches to using Catchup to back-fill the current leadership term, i.e. C to D (the moving target). Catchup works by getting the leader's Archive to replay straight into member 0's Log log buffer (where it would normally receive live Log messages from the leader). As these arrive, it processes them, and its Archive appends them to its Log Recording. When the Archive replay gets 'close' to the live Log position D, member 0 also subscribes to the live Log. This results in two senders writing to different parts of member 0's Log log buffer. Archive replay will be writing from C onwards and the live Log will start writing at some point probably beyond D. When these overlap, the Archive replay is stopped, and member 0 continues seamlessly processing the live Log messages. This is 'replay merge'.

Note: if a follower is restarted and comes up in the same leadership term, it won't need to use Archive Replication and will go straight to catchup.

An in-depth look at the above scenario is presented on the next page, once we've looked at the Election state machine. See Replication and Catchup Details.

Election triggers¶

An Election is triggered when a ConsensusModuleAgent is started, and several other places, listed below:

when a member receives an unexpected RequestVote, NewLeadershipTerm or CommitPosition message in a newer leadership term than its own
when a member receives a RecordingSignalEvent message from its Archive, saying it has unexpectedly stopped recording the Log
when the leader hasn't received an AppendPosition message from enough followers in the last 10 seconds to achieve consensus
when a follower hasn't received a CommitPosition message from the leader in the last 10 seconds
when a follower loses its connection to the leader, i.e. its Log Image closes