Skip to content

Log Buffers

Log buffers are where the application writes messages prior to them being sent to a receiving machine, or being read directly by a local Subscription.

Each log buffer file contains 3 equal length sections known as Terms, followed by a Metadata section. Messages are written into Term 0 until it is full, then the Publication moves onto Term 1. This continues until the end of Term 2 is reached, at which point it wraps back to the start of Term 0 again, rather like writing to a ring buffer. Older messages that have been read are overwritten with zeros, so Term 0 will be clean by the time it is used again.

A log buffer is not a persistent store for messages. That's the purpose of Aeron Archive - it creates recording files on disk (rather than shared memory), which create a permanent record of all messages that flowed through a log buffer.

Term 0
Term 1
Term 2

The Terms are written to using a Concurrent / Exclusive Publication in the Aeron Client. They are read by a Subscription (for an IPC Publication) or the Sender (for a Network Publication). In both cases, the consumer regularly feeds back to the Publication a 'receive window' of how much more data it can consume. If the consumer is slow and drops too far behind, the size of the window will reduce and ultimately, the Publication will stop allowing new messages to be written to the log buffer in order to apply back-pressure further upstream. This means the log buffer can never become full of messages to the point of overflowing.

The cleaning of older messages after they have been read means that if you ever look at a log buffer file's contents, you will typically only see a small number of messages, with zeros in the rest of the Term buffers.

Image log buffers

Messages are published to a Publication log buffer. For Network Publications, this is replicated into an Image log buffer on the receiving machine. When up to date, the Image log buffer will be identical to the Publication log buffer, apart from one or two small differences in the metadata section (described more below).

If you are debugging an issue and want to compare Publication and Image log buffers, bear in mind that they will probably be cleaned at different rates, so the oldest messages in one may have been zeroed already in the other.

Terms

Each Term's length must be a power of 2 between 64 KB and 1 GB inclusive (which gives 15 valid Term lengths). The Term length is also used to calculate the maximum length of a message, which is termLength / 8, up to a maximum of 16MB.

Each Term has a 32-bit TermId that is previous TermId + 1. The initial TermId is randomly generated when the Publication is created, and is stored in the metadata section of the log buffer. The metadata also contains ActiveTermCount, which starts at 0 and increments each time a Term rotates (when a Publication moves from one Term to the next). The current Term buffer (0, 1 or 2) is calculated using ActiveTermCount % 3. The active Term's TermId is InitialTermId + ActiveTermCount.

In the diagram below, InitialTermId is 1005. When the log buffer was created, Term 0 was the active Term and you could consider Terms 1 and 2 to be the previous Terms, i.e. TermIds 1003 and 1004. When ActiveTermCount reaches 3, Term 0 becomes the active Term again, and it will be used for TermId 1008 rather than 1005. By this time, all messages in Term 0 for TermId 1005 will have been read and cleaned.

Note that InitialTermId can be negative. All the logic is the same, but it can be confusing to look at as the numbers increment towards zero and appear to decrease rather than increase (-2999, -2998, -2997, etc).

Metadata: InitialTermId: 1005 ActiveTermCount: 3 ActiveTermCount: 2 ActiveTermCount: 1 ActiveTermCount: 0 Term 2, TermId: 1007 Term 2, TermId: 1004 Term 1, TermId: 1006 Term 0, TermId: 1008 Term 0, TermId: 1005 Term 1, TermId: 1003

Use the tabs above to step through the animation.


The Publication starts by writing messages to the active Term, which is ActiveTermCount % 3 = Term 0. Its TermId is InitialTermId + ActiveTermCount = 1005.

When the Publication gets to the end of Term 0, the Terms rotate. ActiveTermCount is incremented, which means Term 1 becomes the active Term and it is now used for TermId 1006. The Publication starts to write messages to Term 1. Note that earlier messages at the start of Term 0 have been cleaned.

Similarly, when the Publication gets to the end of Term 1, the Terms rotate again. ActiveTermCount becomes 2, Term 2 becomes the active Term and it now has TermId 1007.

When the Publication gets to the end of Term 2, the Terms rotate again. ActiveTermCount becomes 3. The current Term buffer is ActiveTermCount % 3 = 0. The current TermId is InitialTermId + ActiveTermCount = 1008, so Term 0 is now used for TermId 1008 rather than 1005. The Publication now writes to Term 0. This continues until ActiveTermCount maxes out a 32 bit integer (not in our lifetimes).


Messages

DATA frames

When the application calls on a Publication to publish a message, the Publication writes the message to the log buffer Term in a DATA frame. This starts with a DATA frame header, followed by the message payload, optionally followed by unused space to pad the frame out to a multiple of 32 bytes. This means each frame is aligned on (starts on) a 32 byte boundary "for efficiency of operation and reduced latency".

For example, consider writing a 100 byte message. The DATA frame header is 32 bytes long, so the frame length is 32 + 100 = 132 bytes. 132 is written to the Frame Length field in the frame header. The aligned frame length is 132 rounded up to the next multiple of 32, which is 160. There are 28 unused bytes at the end of the frame. The DATA frame header of the next message would start at byte 160 below.

frame length = 132 aligned frame length = 160 DATA frame header 100 byte message (payload) unused 0 32 64 96 128 160

PAD frames

If the application attempts to publish a message and it won't fit in the remaining space in the Term, the Publication writes a PAD frame as the last frame in the Term to consume the remaining space (not to be confused with the alignment padding that may appear at the end of each DATA frame). The Publication then returns ADMIN_ACTION to the application, indicating that the Term rotated. If the application tried to send the message again, it would be written to the start of the next Term. When a PAD frame is sent over the network to a receiving machine, only the PAD frame header is sent. PAD frames are ignored by a Subscription and not returned to the receiving application.

DATA DATA DATA DATA DATA DATA DATA DATA PAD Term 0 Term 1

PAD frames

PAD frames can also appear in the middle of a Term buffer in a couple of scenarios.

  • If the application fails halfway through publishing a message, it causes a blocked message. The Publication will attempt to unblock the message by replacing it with a PAD frame to skip it (and increment the UNBLOCKED_PUBLICATIONS system counter).
  • When rebuilding an Image log buffer, if a DATA frame message is lost on the network, it will leave a gap in the Image. If the Subscription is configured to be unreliable (the default is reliable), Aeron will fill the gap with a PAD frame rather than send a NAK.

Message Fragmentation

A Network Publication has a default MTU of 1408 (which happens to be a multiple of 32). IPC Publications have the same, but can be overridden separately. The MTU for an IPC Publication can be larger, but if you ever have a scenario where you want to record your IPC Publication (using Aeron Archive) and replay it over the network, then the network MTU needs to be taken into consideration.

When a Publication is created, its mtuLength is written into the log buffer's metadata. When the Concurrent / Exclusive Publication is created in the client, it reads the mtuLength, subtracts the DATA frame header length (32) and remembers that as the maxPayloadLength. Whenever the application tries to publish a message, if it is longer than maxPayloadLength, the message is fragmented.

Fragmenting a message means breaking it up into maxPayloadLength chunks (apart from the last chunk) and writing it to the log buffer using multiple, contiguous DATA frames that do not span a Term boundary. If they don't all fit in the active Term, a PAD frame is written, and they are all written to the start of the next Term. When writing a fragmented message, the total aligned length of all fragments is reserved before any of the fragments are written. For a Concurrent Publication, this means a message's fragments will never be interleaved with those of another message, written by another thread.

Fragmenting when writing to the log buffer means the fragments are ready to be sent over the network by the Sender with no additional work to do.

Each DATA frame header contains B & E flags to indicate whether a frame begins or ends a fragmented message. The first frame will have B set, the end frame will have E set and any middle frames with have neither set. Unfragmented messages have both B & E flags set.

When the application polls a Subscription, it passes a FragmentHandler, which the Subscription calls back with each fragment. The Subscription does not do message reassembly. Some Aeron clients may be able to process fragmented messages directly and wouldn't want the overhead of message assembly imposed on them.

If your application wants reassembly, it can pass a FragmentAssembler (which is an implementation of a FragmentHandler) to the Subscription. It copies fragments into a buffer to reassemble the original message, before delegating to another FragmentHandler (provided by your application) that will only be given whole messages.

Message Header

DATA and PAD frames share the same Message Header, which essentially contains:

  • Frame Length (header length + data length)
  • BES flags: (B)egin, (E)nd and (S)End of Stream
  • Type = HDR_TYPE_DATA / HDR_TYPE_PAD
  • Term Offset
  • Session ID
  • Stream ID
  • Term ID

Each DATA / PAD frame header contains the TermId that the message is in, so all messages in the Term with TermId 1005 will have TermId=1005 in their header. Each message also has its own TermOffset written into the frame header, so if a message is written starting at offset 160 in the Term, its header will contain TermOffset=160. For a Network Publication, this self-describing information is used by the Receiver.

When the Receiver receives a DATA / PAD frame, it will use the StreamId and SessionId to find the correct Image log buffer. It then uses the TermId and TermOffset to insert the message in the correct location. The Receiver just reads packets and files them in the correct place. No need for temporary buffers, other data structures or any other overhead.

Metadata

And finally, each log buffer ends with a metadata section.

The log buffer files have a metadata section, described in LogBufferDescriptor. Some fields are only used in Publication log buffers, some only in Image log buffers and some are used in both.

Field Used in Description
Tail Counter 0 Pub Tail position for Term 0 (64 bits; the top 32 bits contain the TermId)
Tail Counter 1 Pub Tail position for Term 1 (64 bits; the top 32 bits contain the TermId)
Tail Counter 2 Pub Tail position for Term 2 (64 bits; the top 32 bits contain the TermId)
Active Term Count Pub The active Term count used by the producer of this log. Starts at 0 and increments each time a Term rotates
End of Stream Position Pub/Img If the stream has ended, this is the end position
Is Connected Pub True if there is at least one subscriber connected
Active Transport Count Img The number of active transports for the Image. Written by PublicationImage
Registration / Correlation ID Pub/Img The correlation id of the command that created this log buffer
Initial Term Id Pub/Img A random 32-bit number (can be negative)
Default Frame Header Length Pub/Img The length of the Default Frame Header (see below)
MTU Length Pub/Img When message length + HEADER_LENGTH > MTU Length, the message is written to the log buffer in fragments. Should not be greater than the network MTU
Term Length Pub/Img The length of each Term buffer. Must be a power of 2 between 64 KB and 1 GB inclusive
Page Size Pub/Img Page size in bytes to align all files created by the Media Driver to (set their length to a multiple of). Must be between 4KB (the default) and 1GB inclusive. The file system must support the requested size. Also used to touch a byte in each page of a log buffer file on creation if not using sparse files
Default Frame Header Pub/Img The default values written to each 32-byte message header. Contains StreamId, SessionId, TermId and other values