Snapshots
When a cluster member starts, it replays its Log in order to recreate its internal state. Some applications don't run 24/7 and don't need to carry state forward from one day to the next. For those, it may make sense to clear the Log each day (stopping the application and deleting all the files). But for others, the Log will continue to grow, so replaying it from the start will become prohibitively time consuming.
Instead, a snapshot of the cluster state can be taken at a given Log position, then at startup, the snapshot can be loaded and the Log replayed from that position. Older parts of the Log Recording can then be archived.
Taking a snapshot¶
When a snapshot is taken, it affects every cluster member. They all receive the request from the leader and all stop
processing the Log in order to take the snapshot at the same Log position. For this reason, snapshots are not taken
automatically. They are done on request, either using ClusterTool or from a cluster client
by calling sendAdminRequestToTakeASnapshot()
on the AeronClient, so it is up to you to decide when to take
them. Standby Snapshots is an Aeron Premium feature that addresses this, which is described below.
When a snapshot is taken, the Consensus Module writes out some state to one Recording in Aeron Archive, and each Clustered Service writes out its state to its own Recording. One snapshot therefore consists of multiple Recordings, all at the same Log position.
Loading a snapshot is done by getting the Archive to replay the Recordings, which the Consensus Module and Clustered Services subscribe to.
Snapshot contents¶
The contents of a Consensus Module snapshot is described here.
The contents of a Clustered Service snapshot is described here.
Whenever a snapshot is made, an entry is added to the cluster's recording log.
Standby Snapshots¶
Standby Snapshots are part of Aeron Cluster Standby, which is an Aeron Premium feature. I haven't used this feature, but I believe it works as follows.
This is where an additional 'standby' member can be added to a cluster for the purpose of taking snapshots. For a 3 node cluster, this means the 3 active members can continue processing the log without any delays incurred by taking a snapshot. As snapshotting no longer affects the active cluster performance, they can be taken more frequently, which would reduce the startup time of the active members.
The log is replicated to the standby member, as it is to the followers, but the standby member does not participate in consensus. Standby snapshots are stored as recordings in Aeron Archive on the standby member. When a standby snapshot is taken, a StandbySnapshot message is sent to the active members on the consensus channel. They all store the information in an entry in their RecordingLog, with a 'standby snapshot' entry type. This includes the channel of the standby member's Aeron Archive, so that the active members can access the snapshot, should they need to.
When an active cluster member restarts, it queries the RecordingLog for standby snapshots. If it decides it wants to start from a standby snapshot, it replicates it from the standby member's Aeron Archive to its local Archive and adds it to its RecordingLog as a normal snapshot. From there, it can load the snapshot.
In addition to replicating standby snapshots at startup, there is a NodeControl utility that can be used to trigger replication on a specific cluster member. It sets a flag in the cnc.dat file, which is specific to that cluster member.
Snapshot process¶
todo: describe what happens when ClusterTool requests a snapshot