memory ordering is not what you think

Your compiler reorders your code. Your CPU reorders your code. Your cache reorders your code. The memory you think you wrote is, in various real senses, not there yet. And yet most programs work, because for decades the dominant CPU was unusually well-behaved. That is changing, and the cracks are where the interesting bugs live.

I first ran into this when some lock-free code that had been running fine on x86 started losing messages on a Graviton instance. It took me a week to figure out why. This post is the explanation I needed then.

the myth

The myth is that instructions execute in the order you wrote them. They do not. The machine underneath is a deeply pipelined, out-of-order, superscalar engine that will execute any two independent instructions in either order, in parallel, or both, whichever is faster. It puts them back in program order at the very end, for your thread, observed from your thread.

The catch is that last clause. Program order is guaranteed for a single thread observing itself. It is not guaranteed for another thread watching the same memory. What thread B sees, when it reads the addresses thread A wrote, is defined by a memory model, a formal contract between the CPU and the programmer about which reorderings the hardware may perform, and which it may not.

Every architecture has a different one. x86 has a strong model called TSO. ARM has a weak (relaxed) model. The difference is not theoretical. Code that has been running correctly on Intel for a decade will sporadically lose messages on a Graviton or an M-series core, and the bug will not reproduce under a debugger because the reorderings only happen when the stars align.

the store buffer

The root cause is a small piece of hardware called the store buffer. A store to memory is expensive: even an L1 hit is about 4 cycles, and a cache line that needs to be fetched before write is much worse. To avoid stalling the pipeline on every store, the CPU writes the store into a per-core buffer and moves on. The buffer drains to the cache asynchronously, when the cache line is available and coherence permits.

A store goes into the buffer and the core continues. A subsequent load from the same core, to a different address, can execute immediately. It does not wait for the store to drain. From the core's own viewpoint, program order is preserved (store-buffer forwarding makes its own stores visible to its own loads). From another core, the store is not visible yet. That gap is the entire memory-ordering problem.

Two consequences fall out of this design:

Store-buffer forwarding. A core always sees its own most recent store, even if it has not drained to cache. So a single thread never observes its own stores out of order. The madness only starts when a second thread is involved.
A later load can pass an earlier store to a different address, because the load does not wait for the store's buffer to drain. This is the only reordering x86 permits, and it is the one that breaks code that assumed "stores are visible immediately."

x86 TSO

x86's model is TSO (Total Store Order), formalized by Owens, Sarkar, and Sewell in 2009. The name gives you three guarantees and one allowance:

Stores from a single core become visible to other cores in the order they were issued. No store-store reordering.
Loads do not reorder with loads. Load-load preserved.
Loads do not reorder with earlier stores to the same address. Load-after-store preserved.
The one allowance: a load may be reordered after an earlier store to a different address. This is the store-buffer effect. The load runs while the store is still in the buffer.

The fourth bullet is the entire relaxivity of x86. Everything else is sequentially consistent. That makes x86 feel safe, and it is, mostly. "Mostly" is not a memory model.

ARM

ARM permits much more. Store-Store, Load-Load, Load-Store, and Store-Load can all be reordered, subject to explicit dependency ordering (a load that depends on a prior load's address is ordered, a store that depends on a prior load's value is ordered, etc). The model is described as a long list of allowed reorderings plus a set of barrier and acquire/release instructions that restore ordering where you need it.

The practical effect: on ARM, two consecutive stores to different addresses can be observed by another core in either order. On x86 they cannot. If you wrote a flag-then-data pattern and relied on x86's store-store ordering, your code is correct on Intel and subtly broken on every ARM server in the cloud. This is why Apple Silicon and AWS Graviton have been a slow-motion reckoning for a lot of lock-free code.

Read the table as "may the first op be reordered after the second?" The single reorder cell in the x86 column is the entire difference between x86 and sequential consistency. The ARM column is almost all reorder. You get to choose what stays ordered, via barriers or acquire/release.

the litmus test

Two threads, two shared variables, both initialized to zero.

// x and y initially 0

// Thread 0            // Thread 1
x = 1;                 r1 = y;
y = 1;                 r2 = x;

Can r1 = 1, r2 = 0 ever happen? Can Thread 1 see the write to y but miss the write to x, even though Thread 0 wrote x first?

On x86 TSO: no. Thread 0's stores are ordered (store-store preserved). If Thread 1 sees y = 1, then x = 1 must already be visible. So r2 must read 1. No barrier needed.
On ARM relaxed: yes. The two stores can be observed in either order. Thread 1 can see y = 1 while x is still 0 in its cache. To forbid it you need a barrier between the two stores, or you write the second store with release semantics.

This is the message-passing pattern. It shows up everywhere: publishing a pointer and a ready flag, enqueueing an item and incrementing a count, writing a struct and setting its valid bit. The producer does the data store first and the flag store second. The consumer reads the flag first and the data second. On x86 it works for free. On ARM it is broken without a barrier, and the breakage is rare and non-reproducible. I spent a week on this once. I do not recommend it.

The producer stores data then flag. The consumer loads flag then data. The forbidden outcome r1=1, r2=0 ("I saw the flag but not the data") is impossible on x86 and merely unlikely on ARM. A release store on the flag plus an acquire load on the flag closes the window on every architecture.

barriers

A memory barrier is an instruction that tells the CPU to drain whatever is in its store buffer before proceeding past this point (a store barrier), or to not let any load cross this point until earlier loads have completed (a load barrier), or both (a full barrier). On x86 that is mfence, sfence, lfence. On ARM it is dmb, dsb, isb, and the lighter-weight ldar / stlr (acquire-load / release-store) that order only what they need to.

The cost is real. A full barrier is tens of cycles on a modern core. It forces the store buffer to drain, which means waiting for the cache coherence protocol, which means waiting for the bus. This is why modern languages give you the lighter options:

Release store (stlr on ARM, mov on x86, release is free on TSO). All earlier loads and stores are visible to any core that does an acquire load of the same address.
Acquire load (ldar on ARM, free on x86). No later load or store on this core moves before the acquire.
Together, release + acquire on the same variable give you message-passing on every architecture without a full barrier. This is what atomic<T> with memory_order_release / memory_order_acquire gets you in C++, and what Rust's Ordering::Release / Ordering::Acquire gets you.

Use the weakest ordering that is correct. Stronger than necessary costs cycles. Weaker than necessary costs bugs. memory_order_seq_cst (the default in most languages) is correct and slow. memory_order_relaxed is fast and almost never what you want for cross-thread communication. Release/acquire is the sweet spot for almost every publish pattern.

seqlocks

A sequence lock is a reader-writer pattern for mostly-read data where readers do not want to take a lock. The writer does:

seq++;          // odd: write in progress
write(data);
seq++;          // even: write done

The reader does:

do {
  s1 = seq;              // must be acquire-ish
  read(data);
  s2 = seq;              // must be acquire-ish
} while (s1 != s2 || s1 & 1);

If s1 == s2 and even, no write happened during the read, and the data is consistent. On x86 this works as written, because loads are ordered and you do not need anything special. On ARM, the reads of data can be reordered around the reads of seq, so you can see a torn write with an even sequence number. To make it correct on ARM, the seq loads must be acquire and the data loads must not cross them. Typically via smp_load_acquire on the seq reads, or by placing a barrier after the second seq read and re-checking.

The Linux kernel's read_seqcount_begin / read_seqcount_retry are exactly this, written portably. The whole READ_ONCE / WRITE_ONCE / smp_store_release / smp_load_acquire vocabulary in the kernel exists to express these orderings without pulling in the full (and slow) smp_mb() where it is not needed. Paul McKenney's formalization of this into the Linux Kernel Memory Model (LKMM), checked by the herd7 tool, is one of the great underappreciated engineering-precision efforts of the last decade. If you want to go deeper than this post, start with his book and the herd7 documentation.

Memory models are contracts, not descriptions. The CPU is not telling you what it does. It is telling you what you are allowed to assume. Anything you assume beyond the contract is a bug that has not happened yet, and "has not happened yet" across a few billion cycles per second is a much shorter time than it sounds.

thanks: to everyone who pointed out on Twitter that the original version of this post conflated ldar/stlrordering semantics with the older dmb barriers. They are related but not identical. The acquire/release semantics of ldar/stlr are actually stronger thandmb ish in some subtle cases involving RCpc vs RCsc. I have simplified here. See the ARMv8 ARM (Section B2.9) for the authoritative version.

references

Owens, Sarkar, Sewell (2009). "A Better x86 Memory Model: x86-TSO." PDF. The formalization of x86 that compiler writers actually use.
McKenney, et al. "A (Rare) Depth Tour of the Linux Kernel Memory Model" (LPC 2018). lpc.events. Plus the herd7 tool for checking litmus tests.
ARM. "ARM Architecture Reference Manual" (ARMv8, Section B2.9, "The memory model"). The authoritative source for ARM ordering. Not light reading.
Maranget, Sarkar, Sewell (2012). "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models." PDF. The best primer on relaxed models I have read.
Boehm, Demsky (2014). "Outlining the C++ Memory Model." ACM. Why memory_order_consume was a mistake.