src/atomics.md - rust-lang/nomicon - Git at Google

 # Atomics

 Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
 due to this model being particularly excellent or easy to understand. Indeed,
 this model is quite complex and known to have [several flaws][C11-busted].
 Rather, it is a pragmatic concession to the fact that *everyone* is pretty bad
 at modeling atomics. At very least, we can benefit from existing tooling and
 research around the C/C++ memory model.
 (You'll often see this model referred to as "C/C++11" or just "C11". C just copies
 the C++ memory model; and C++11 was the first version of the model but it has
 received some bugfixes since then.)

 Trying to fully explain the model in this book is fairly hopeless. It's defined
 in terms of madness-inducing causality graphs that require a full book to
 properly understand in a practical way. If you want all the nitty-gritty
 details, you should check out the [C++ specification][C++-model].
 Still, we'll try to cover the basics and some of the problems Rust developers
 face.

 The C++ memory model is fundamentally about trying to bridge the gap between the
 semantics we want, the optimizations compilers want, and the inconsistent chaos
 our hardware wants. *We* would like to just write programs and have them do
 exactly what we said but, you know, fast. Wouldn't that be great?

 ## Compiler Reordering

 Compilers fundamentally want to be able to do all sorts of complicated
 transformations to reduce data dependencies and eliminate dead code. In
 particular, they may radically change the actual order of events, or make events
 never occur! If we write something like:

 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 1;
 y = 3;
 x = 2;
 ```

 The compiler may conclude that it would be best if your program did:

 <!-- ignore: simplified code -->
 ```rust,ignore
 x = 2;
 y = 3;
 ```

 This has inverted the order of events and completely eliminated one event.
 From a single-threaded perspective this is completely unobservable: after all
 the statements have executed we are in exactly the same state. But if our
 program is multi-threaded, we may have been relying on `x` to actually be
 assigned to 1 before `y` was assigned. We would like the compiler to be
 able to make these kinds of optimizations, because they can seriously improve
 performance. On the other hand, we'd also like to be able to depend on our
 program *doing the thing we said*.

 ## Hardware Reordering

 On the other hand, even if the compiler totally understood what we wanted and
 respected our wishes, our hardware might instead get us in trouble. Trouble
 comes from CPUs in the form of memory hierarchies. There is indeed a global
 shared memory space somewhere in your hardware, but from the perspective of each
 CPU core it is *so very far away* and *so very slow*. Each CPU would rather work
 with its local cache of the data and only go through all the anguish of
 talking to shared memory only when it doesn't actually have that memory in
 cache.

 After all, that's the whole point of the cache, right? If every read from the
 cache had to run back to shared memory to double check that it hadn't changed,
 what would the point be? The end result is that the hardware doesn't guarantee
 that events that occur in some order on *one* thread, occur in the same
 order on *another* thread. To guarantee this, we must issue special instructions
 to the CPU telling it to be a bit less smart.

 For instance, say we convince the compiler to emit this logic:

 ```text
 initial state: x = 0, y = 1

 THREAD 1        THREAD 2
 y = 3;          if x == 1 {
 x = 1;              y *= 2;
                 }
 ```

 Ideally this program has 2 possible final states:

 * `y = 3`: (thread 2 did the check before thread 1 completed)
 * `y = 6`: (thread 2 did the check after thread 1 completed)

 However there's a third potential state that the hardware enables:

 * `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)

 It's worth noting that different kinds of CPU provide different guarantees. It
 is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
 Most notably x86/64 provides strong ordering guarantees, while ARM
 provides weak ordering guarantees. This has two consequences for concurrent
 programming:

 * Asking for stronger guarantees on strongly-ordered hardware may be cheap or
   even free because they already provide strong guarantees unconditionally.
   Weaker guarantees may only yield performance wins on weakly-ordered hardware.

 * Asking for guarantees that are too weak on strongly-ordered hardware is
   more likely to *happen* to work, even though your program is strictly
   incorrect. If possible, concurrent algorithms should be tested on
   weakly-ordered hardware.

 ## Data Accesses

 The C++ memory model attempts to bridge the gap by allowing us to talk about the
 *causality* of our program. Generally, this is by establishing a *happens
 before* relationship between parts of the program and the threads that are
 running them. This gives the hardware and compiler room to optimize the program
 more aggressively where a strict happens-before relationship isn't established,
 but forces them to be more careful where one is established. The way we
 communicate these relationships are through *data accesses* and *atomic
 accesses*.

 Data accesses are the bread-and-butter of the programming world. They are
 fundamentally unsynchronized and compilers are free to aggressively optimize
 them. In particular, data accesses are free to be reordered by the compiler on
 the assumption that the program is single-threaded. The hardware is also free to
 propagate the changes made in data accesses to other threads as lazily and
 inconsistently as it wants. Most critically, data accesses are how data races
 happen. Data accesses are very friendly to the hardware and compiler, but as
 we've seen they offer *awful* semantics to try to write synchronized code with.
 Actually, that's too weak.

 **It is literally impossible to write correct synchronized code using only data
 accesses.**

 Atomic accesses are how we tell the hardware and compiler that our program is
 multi-threaded. Each atomic access can be marked with an *ordering* that
 specifies what kind of relationship it establishes with other accesses. In
 practice, this boils down to telling the compiler and hardware certain things
 they *can't* do. For the compiler, this largely revolves around re-ordering of
 instructions. For the hardware, this largely revolves around how writes are
 propagated to other threads. The set of orderings Rust exposes are:

 * Sequentially Consistent (SeqCst)
 * Release
 * Acquire
 * Relaxed

 (Note: We explicitly do not expose the C++ *consume* ordering)

 TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
 synchronize"

 ## Sequentially Consistent

 Sequentially Consistent is the most powerful of all, implying the restrictions
 of all other orderings. Intuitively, a sequentially consistent operation
 cannot be reordered: all accesses on one thread that happen before and after a
 SeqCst access stay before and after it. A data-race-free program that uses
 only sequentially consistent atomics and data accesses has the very nice
 property that there is a single global execution of the program's instructions
 that all threads agree on. This execution is also particularly nice to reason
 about: it's just an interleaving of each thread's individual executions. This
 does not hold if you start using the weaker atomic orderings.

 The relative developer-friendliness of sequential consistency doesn't come for
 free. Even on strongly-ordered platforms sequential consistency involves
 emitting memory fences.

 In practice, sequential consistency is rarely necessary for program correctness.
 However sequential consistency is definitely the right choice if you're not
 confident about the other memory orders. Having your program run a bit slower
 than it needs to is certainly better than it running incorrectly! It's also
 mechanically trivial to downgrade atomic operations to have a weaker
 consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
 course, proving that this transformation is *correct* is a whole other matter.

 ## Acquire-Release

 Acquire and Release are largely intended to be paired. Their names hint at their
 use case: they're perfectly suited for acquiring and releasing locks, and
 ensuring that critical sections don't overlap.

 Intuitively, an acquire access ensures that every access after it stays after
 it. However operations that occur before an acquire are free to be reordered to
 occur after it. Similarly, a release access ensures that every access before it
 stays before it. However operations that occur after a release are free to be
 reordered to occur before it.

 When thread A releases a location in memory and then thread B subsequently
 acquires *the same* location in memory, causality is established. Every write
 (including non-atomic and relaxed atomic writes) that happened before A's
 release will be observed by B after its acquisition. However no causality is
 established with any other threads. Similarly, no causality is established
 if A and B access *different* locations in memory.

 Basic use of release-acquire is therefore simple: you acquire a location of
 memory to begin the critical section, and then release that location to end it.
 For instance, a simple spinlock might look like:

 ```rust
 use std::sync::Arc;
 use std::sync::atomic::{AtomicBool, Ordering};
 use std::thread;

 fn main() {
     let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"

     // ... distribute lock to threads somehow ...

     // Try to acquire the lock by setting it to true
     while lock.compare_and_swap(false, true, Ordering::Acquire) { }
     // broke out of the loop, so we successfully acquired the lock!

     // ... scary data accesses ...

     // ok we're done, release the lock
     lock.store(false, Ordering::Release);
 }
 ```

 On strongly-ordered platforms most accesses have release or acquire semantics,
 making release and acquire often totally free. This is not the case on
 weakly-ordered platforms.

 ## Relaxed

 Relaxed accesses are the absolute weakest. They can be freely re-ordered and
 provide no happens-before relationship. Still, relaxed operations are still
 atomic. That is, they don't count as data accesses and any read-modify-write
 operations done to them occur atomically. Relaxed operations are appropriate for
 things that you definitely want to happen, but don't particularly otherwise care
 about. For instance, incrementing a counter can be safely done by multiple
 threads using a relaxed `fetch_add` if you're not using the counter to
 synchronize any other accesses.

 There's rarely a benefit in making an operation relaxed on strongly-ordered
 platforms, since they usually provide release-acquire semantics anyway. However
 relaxed operations can be cheaper on weakly-ordered platforms.

 [C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
 [C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order
	# Atomics

	Rust pretty blatantly just inherits the memory model for atomics from C++20. This is not
	due to this model being particularly excellent or easy to understand. Indeed,
	this model is quite complex and known to have [several flaws][C11-busted].
	Rather, it is a pragmatic concession to the fact that everyone is pretty bad
	at modeling atomics. At very least, we can benefit from existing tooling and
	research around the C/C++ memory model.
	(You'll often see this model referred to as "C/C++11" or just "C11". C just copies
	the C++ memory model; and C++11 was the first version of the model but it has
	received some bugfixes since then.)

	Trying to fully explain the model in this book is fairly hopeless. It's defined
	in terms of madness-inducing causality graphs that require a full book to
	properly understand in a practical way. If you want all the nitty-gritty
	details, you should check out the [C++ specification][C++-model].
	Still, we'll try to cover the basics and some of the problems Rust developers
	face.

	The C++ memory model is fundamentally about trying to bridge the gap between the
	semantics we want, the optimizations compilers want, and the inconsistent chaos
	our hardware wants. We would like to just write programs and have them do
	exactly what we said but, you know, fast. Wouldn't that be great?

	## Compiler Reordering

	Compilers fundamentally want to be able to do all sorts of complicated
	transformations to reduce data dependencies and eliminate dead code. In
	particular, they may radically change the actual order of events, or make events
	never occur! If we write something like:

	<!-- ignore: simplified code -->
	```rust,ignore
	x = 1;
	y = 3;
	x = 2;
	```

	The compiler may conclude that it would be best if your program did:

	<!-- ignore: simplified code -->
	```rust,ignore
	x = 2;
	y = 3;
	```

	This has inverted the order of events and completely eliminated one event.
	From a single-threaded perspective this is completely unobservable: after all
	the statements have executed we are in exactly the same state. But if our
	program is multi-threaded, we may have been relying on `x` to actually be
	assigned to 1 before `y` was assigned. We would like the compiler to be
	able to make these kinds of optimizations, because they can seriously improve
	performance. On the other hand, we'd also like to be able to depend on our
	program doing the thing we said.

	## Hardware Reordering

	On the other hand, even if the compiler totally understood what we wanted and
	respected our wishes, our hardware might instead get us in trouble. Trouble
	comes from CPUs in the form of memory hierarchies. There is indeed a global
	shared memory space somewhere in your hardware, but from the perspective of each
	CPU core it is so very far away and so very slow. Each CPU would rather work
	with its local cache of the data and only go through all the anguish of
	talking to shared memory only when it doesn't actually have that memory in
	cache.

	After all, that's the whole point of the cache, right? If every read from the
	cache had to run back to shared memory to double check that it hadn't changed,
	what would the point be? The end result is that the hardware doesn't guarantee
	that events that occur in some order on one thread, occur in the same
	order on another thread. To guarantee this, we must issue special instructions
	to the CPU telling it to be a bit less smart.

	For instance, say we convince the compiler to emit this logic:

	```text
	initial state: x = 0, y = 1

	THREAD 1 THREAD 2
	y = 3; if x == 1 {
	x = 1; y *= 2;
	}
	```

	Ideally this program has 2 possible final states:

	* `y = 3`: (thread 2 did the check before thread 1 completed)
	* `y = 6`: (thread 2 did the check after thread 1 completed)

	However there's a third potential state that the hardware enables:

	* `y = 2`: (thread 2 saw `x = 1`, but not `y = 3`, and then overwrote `y = 3`)

	It's worth noting that different kinds of CPU provide different guarantees. It
	is common to separate hardware into two categories: strongly-ordered and weakly-ordered.
	Most notably x86/64 provides strong ordering guarantees, while ARM
	provides weak ordering guarantees. This has two consequences for concurrent
	programming:

	* Asking for stronger guarantees on strongly-ordered hardware may be cheap or
	even free because they already provide strong guarantees unconditionally.
	Weaker guarantees may only yield performance wins on weakly-ordered hardware.

	* Asking for guarantees that are too weak on strongly-ordered hardware is
	more likely to happen to work, even though your program is strictly
	incorrect. If possible, concurrent algorithms should be tested on
	weakly-ordered hardware.

	## Data Accesses

	The C++ memory model attempts to bridge the gap by allowing us to talk about the
	causality of our program. Generally, this is by establishing a *happens
	before* relationship between parts of the program and the threads that are
	running them. This gives the hardware and compiler room to optimize the program
	more aggressively where a strict happens-before relationship isn't established,
	but forces them to be more careful where one is established. The way we
	communicate these relationships are through data accesses and *atomic
	accesses*.

	Data accesses are the bread-and-butter of the programming world. They are
	fundamentally unsynchronized and compilers are free to aggressively optimize
	them. In particular, data accesses are free to be reordered by the compiler on
	the assumption that the program is single-threaded. The hardware is also free to
	propagate the changes made in data accesses to other threads as lazily and
	inconsistently as it wants. Most critically, data accesses are how data races
	happen. Data accesses are very friendly to the hardware and compiler, but as
	we've seen they offer awful semantics to try to write synchronized code with.
	Actually, that's too weak.

	**It is literally impossible to write correct synchronized code using only data
	accesses.**

	Atomic accesses are how we tell the hardware and compiler that our program is
	multi-threaded. Each atomic access can be marked with an ordering that
	specifies what kind of relationship it establishes with other accesses. In
	practice, this boils down to telling the compiler and hardware certain things
	they can't do. For the compiler, this largely revolves around re-ordering of
	instructions. For the hardware, this largely revolves around how writes are
	propagated to other threads. The set of orderings Rust exposes are:

	* Sequentially Consistent (SeqCst)
	* Release
	* Acquire
	* Relaxed

	(Note: We explicitly do not expose the C++ consume ordering)

	TODO: negative reasoning vs positive reasoning? TODO: "can't forget to
	synchronize"

	## Sequentially Consistent

	Sequentially Consistent is the most powerful of all, implying the restrictions
	of all other orderings. Intuitively, a sequentially consistent operation
	cannot be reordered: all accesses on one thread that happen before and after a
	SeqCst access stay before and after it. A data-race-free program that uses
	only sequentially consistent atomics and data accesses has the very nice
	property that there is a single global execution of the program's instructions
	that all threads agree on. This execution is also particularly nice to reason
	about: it's just an interleaving of each thread's individual executions. This
	does not hold if you start using the weaker atomic orderings.

	The relative developer-friendliness of sequential consistency doesn't come for
	free. Even on strongly-ordered platforms sequential consistency involves
	emitting memory fences.

	In practice, sequential consistency is rarely necessary for program correctness.
	However sequential consistency is definitely the right choice if you're not
	confident about the other memory orders. Having your program run a bit slower
	than it needs to is certainly better than it running incorrectly! It's also
	mechanically trivial to downgrade atomic operations to have a weaker
	consistency later on. Just change `SeqCst` to `Relaxed` and you're done! Of
	course, proving that this transformation is correct is a whole other matter.

	## Acquire-Release

	Acquire and Release are largely intended to be paired. Their names hint at their
	use case: they're perfectly suited for acquiring and releasing locks, and
	ensuring that critical sections don't overlap.

	Intuitively, an acquire access ensures that every access after it stays after
	it. However operations that occur before an acquire are free to be reordered to
	occur after it. Similarly, a release access ensures that every access before it
	stays before it. However operations that occur after a release are free to be
	reordered to occur before it.

	When thread A releases a location in memory and then thread B subsequently
	acquires the same location in memory, causality is established. Every write
	(including non-atomic and relaxed atomic writes) that happened before A's
	release will be observed by B after its acquisition. However no causality is
	established with any other threads. Similarly, no causality is established
	if A and B access different locations in memory.

	Basic use of release-acquire is therefore simple: you acquire a location of
	memory to begin the critical section, and then release that location to end it.
	For instance, a simple spinlock might look like:

	```rust
	use std::sync::Arc;
	use std::sync::atomic::{AtomicBool, Ordering};
	use std::thread;

	fn main() {
	let lock = Arc::new(AtomicBool::new(false)); // value answers "am I locked?"

	// ... distribute lock to threads somehow ...

	// Try to acquire the lock by setting it to true
	while lock.compare_and_swap(false, true, Ordering::Acquire) { }
	// broke out of the loop, so we successfully acquired the lock!

	// ... scary data accesses ...

	// ok we're done, release the lock
	lock.store(false, Ordering::Release);
	}
	```

	On strongly-ordered platforms most accesses have release or acquire semantics,
	making release and acquire often totally free. This is not the case on
	weakly-ordered platforms.

	## Relaxed

	Relaxed accesses are the absolute weakest. They can be freely re-ordered and
	provide no happens-before relationship. Still, relaxed operations are still
	atomic. That is, they don't count as data accesses and any read-modify-write
	operations done to them occur atomically. Relaxed operations are appropriate for
	things that you definitely want to happen, but don't particularly otherwise care
	about. For instance, incrementing a counter can be safely done by multiple
	threads using a relaxed `fetch_add` if you're not using the counter to
	synchronize any other accesses.

	There's rarely a benefit in making an operation relaxed on strongly-ordered
	platforms, since they usually provide release-acquire semantics anyway. However
	relaxed operations can be cheaper on weakly-ordered platforms.

	[C11-busted]: http://plv.mpi-sws.org/c11comp/popl15.pdf
	[C++-model]: https://en.cppreference.com/w/cpp/atomic/memory_order