Determinism
Deterministic behaviour in the Clustered Service (your application code) is key to Aeron Cluster.
Aeron Cluster requires that you write your server side application code as a deterministic state machine that only takes input from the Log
If your application code is not deterministic, then the members of a multi node cluster could become out of sync, rendering them worthless. Even on a single node cluster, it means replaying the Log on startup might not recreate the same state. If Bob placed an order on your system that was accepted, then you restarted the system and replaying Bob's order causes it to be rejected and not appear in the system's internal state, what are you going to tell Bob?
What to avoid¶
Making the application code deterministic is done through self-control. Aeron Cluster will feed the application Log messages, but the application needs to behave. Here are some examples of what not to do:
- don't access the system clock (Aeron Cluster provides the current time with each input message)
- don't access other services on the network
- don't generate random numbers or GUIDs (unless they are seeded the same on each member)
- don't start other threads
- don't configure the application through config, in case it becomes different across cluster members, or different when the Log is replayed. Instead, try to configure by sending in Log messages
- be careful with data structures that have an undefined iteration order, e.g. Maps
As always, there will be exceptions, like logging, or publishing metrics, but these shouldn't affect application state. It's always worth bearing in mind whether something can affect determinism.
Tales from the coalface¶
Imagine you have a 3 node cluster where all the members load snapshot 'n' when they start. At some point, they take snapshot 'n + 1', then one of the members restarts. It will start and load snapshot 'n + 1', so it has started from a different snapshot than the others. In an ideal world, this should not matter, but I have seen bugs uncovered by this scenario.
Imagine you have a bug where you are inadvertently iterating over a Map, which has an undefined, yet consistent iteration order. You might not spot that bug for a long time, because the iteration order is consistent, so all the cluster members have the same state. However, one day, one of the cluster members is restarted and loads a later snapshot.
If you add an entry M to a Map, there may be a hash collision, where its ideal position is occupied and it ends up in an alternate position. The ideal hash position may become free later, as other entries are removed. This leaves entry M in its alternate position, even though its ideal position is now free.
Now imagine you restart a cluster member and it loads its snapshot. The Map contains the same entries as the other cluster members, but now entry M could end up in its ideal position. When the Map is iterated over, the iteration order could be different than on the other cluster members. This may seen innocuous, but small differences can be amplified and cause significant issues.
I'm not going to admit to being on the end of the on-call phone at 3 a.m., loading snapshots and replaying Logs to investigate a production issue, but if I did, I'd learn to be more careful.