Erik Rigtorp

Aligned AVX loads and stores are atomic

On the latest CPU microarchitectures (Skylake and Zen 2) AVX/AVX2 128b/256b aligned loads and stores are atomic even though Intel and AMD officially doesn’t guarantee this. I wrote a small program isatomic that I used to verify this.

The table below contains the results from testing on a few different CPUs. A ✅ indicates that loads and stores are atomic and ❌ indicates that they are not atomic.

CPU μarch 128b 256b 512b
Aligned Unaligned Split cache line Aligned Unaligned Split cache line Aligned Split cache line
Ryzen 3900X Zen 2
i7-8550U Skylake
i7-10510U Skylake
Xeon Gold 6143 Skylake
i7-7820X Skylake
i5-3330 Ivy Bridge
AMD FX 8320 Piledriver

Update 2020-07-08: Travis Down suggested that I should make sure 16B unaligned L/S test crosses both the 16B and 32B alignment boundaries. This makes the previously succeeding 16B unaligned test fail on Zen 2.

What does the processor manuals say about atomic operations? The AMD64 Architecture Programmer’s Manual only guarantees that memory operations up to 8-byte wide and CMPXCHG16B are atomic1. The Intel® 64 and IA-32 Architectures Software Developer’s Manual makes similar guarantees2. Additionally the Intel manuals specifically states that AVX instructions do not have any atomicity guarantees3.

Checking uops.info we can see that for AMD Zen 2 loading and storing 128b and 256b are executed as 1 μop and for Intel Skylake loads are 1 μop and stores are 2 μops. Since loads are only a single μop it indicates that the data path between the AVX register file and L1 data cache is 256 bits wide. It’s then likely that AVX load and store operations are indeed atomic.

Using data from WikiChip I compiled the table below that shows the data path width (between load-store units, execution units and cache) and the AVX execution unit width. Unsurprisingly there seem to be a relationship between the minium of these widths and the maximum AVX load-store width that is atomic.

μarch Data path width AVX execution unit width
Sandy Bridge 128b 256b
Haswell 256b 256b
Skylake (client) 256b 256b
Skylake (server) 512b 512b
Sunny Cove 512b ?
Zen / Zen+ 256b 128b
Zen 2 256b 256b

The isatomic tool works by running a thread on each available CPU that loads a value from memory, checks that the load was atomic, stores a new value to the same location and then repeats this a number of times. Initially I used the intrinsic functions to implement this, but 256b loads are sometimes optimized as two 128b loads as recommended by Intel for Sandy Bridge and Ivy Bridge4 when compiling with march=native. Instead I ended up using inline assembly:

for (size_t i = 0; i < iters; ++i) {
    int x;
    double y = i % 2 ? 0 : -1;
    asm("vmovdqa %3, %%ymm0;"           
        "vmovmskpd %%ymm0, %0;"         
        "vmovq %2, %%xmm1;"
        "vpbroadcastq %%xmm1, %%ymm2;"
        "vmovdqa %%ymm2, %1;"           
        : "=r"(x), "=m"(buf[0])
        : "r"(y), "m"(buf[0])
        : "%ymm0", "%xmm1", "%ymm2");
    counts[x]++;                        
}

The above code ① loads 256b of aligned integer data from buf to YMM0, ② stores a bitmask containing the 4 double floating point sign bits into x, ③ stores 4 doubles of all negative one or all zeros, and ④ finally counts how many times each value of x (in range 0-15) has been seen. Values of x other than 0 and 15 means that a torn/partial load or store occurred. I specifically test the integer load/store instructions VMOVDQA and VMOVDQU, but the floating point load/store instructions VMOVAPS/VMOVAPD will likely behaves the same5.

There is also the question “SSE instructions: which CPUs can do atomic 16B memory operations?" on Stackoverflow with additional results and discussion.

I would appreciate if readers could help me test additional microarchitectures. It would be great to also test the AVX 512 bit extensions, but I currently don’t have easy access to any machine that supports these extensions.

Thanks to Matt Godbolt and YumiYumiYumi for reporting results for additional CPUs and Thiago Macieira for discovering that 256b intrinsics where sometimes optimized as 2x128b loads/stores.

Discussion on Reddit r/simd.

References


  1. “Single load or store operations (from instructions that do just a single load or store) are naturally atomic on any AMD64 processor as long as they do not cross an aligned 8-byte boundary. Accesses up to eight bytes in size which do cross such a boundary may be performed atomically using certain instructions with a lock prefix, such as XCHG, CMPXCHG or CMPXCHG8B, as long as all such accesses are done using the same technique. (Note that misaligned locked accesses may be subject to heavy performance penalties.) CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-bit mode(with certain alignment restrictions).” ↩︎

  2. “The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically: Reading or writing a byte, Reading or writing a word aligned on a 16-bit boundary, Reading or writing a doubleword aligned on a 32-bit boundary.

    The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically: Reading or writing a quadword aligned on a 64-bit boundary, 16-bit accesses to uncached memory locations that fit within a 32-bit data bus,

    The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically: Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

    Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel ® AtomTM, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided.

    An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page.” ↩︎

  3. “AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.” ↩︎

  4. Intel® 64 and IA-32 Architectures Optimization Reference Manual:

    “15.16.3.5 256-bit Fetch versus Two 128-bit Fetches

    On Sandy Bridge and Ivy Bridge microarchitectures, using two 16-byte aligned loads are preferred due to the 128-bit data path limitation in the memory pipeline of the microarchitecture.

    To take advantage of Haswell microarchitecture’s 256-bit data path microarchitecture, the use of 256-bit loads must consider the alignment implications. Instruction that fetched 256-bit data from memory should pay attention to be 32-byte aligned. If a 32-byte unaligned fetch would span across cache line boundary, it is still preferable to fetch data from two 16-byte aligned address instead.” ↩︎

  5. See The microarchitecture of Intel, AMD and VIA CPUs by Agner Fog:

    “The execution units are divided into domains as described on page 113, and there is sometimes a delay of one clock cycle when the output of an instruction in the integer domain is used as input for an instruction in the floating point domain.

    However, such delays are few on the Skylake processor. I found no such delays in the following cases:

    • when a floating point Boolean instruction, such as ORPSis used with integer data
    • when a wrong type of move instruction is used, e.g. MOVPS or MOVDQA
    • when a wrong type of shuffle instruction is used, e.g. SHUFPS or PFHUFD
    • when a wrong type of blend instruction is used, e.g. VPBLENDD or BLENDPS”
     ↩︎