Performance impact of split locks

The x86-64 architecture allows unaligned memory access. It even allows for atomic operations on data split across two cache lines. This type of atomic operation is called a “split lock”. The name likely comes from the LOCK prefix that is prepended to CPU instructions to make them atomic. There are some references to split locks in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, but it’s performance impact is not quantified. I found the best official Intel documentation on split locks in a archived Intel forum post:

A common atomic lock will not be transferred to a bus lock except in exceptional circumstances of either the memory of the lock residing in uncacheable memory or if the lock extends beyond a cache line boundary splitting cache lines.

Bus locks impact all architectures on all OS’s. Bus locks have a very high performance penalty of ~1000 cycles. It is highly recommended to avoid locks in uncacheable memory and to make sure the memory addresses of the locked are aligned.

The LWN.net article “Detecting and handling split locks” also has additional information and discussion on split locks.

Since split locks causes the whole memory bus to be locked as opposed to only the affected cache line for normal atomic operations. This affects memory accesses for the whole system and not only the cores issuing split lock operations.

Measuring split locks performance

Let’s measure how a split lock affects a simple atomic counter. We make a small program that atomically increments a integer one million times and measure how long the operation takes. For aligned atomic access:

std::atomic<uint32_t> v;
auto start = std::chrono::system_clock::now();
for (std::size_t i = 0; i < niter; ++i) {
    v.fetch_add(1, std::memory_order_acquire);
}
auto stop = std::chrono::system_clock::now();
std::cout << (stop - start).count() / niter << std::endl;

Running this on my computer I get 6 nanoseconds per operation.

For unaligned “split lock” atomic accces:

char buf[128] = {};
auto *v = reinterpret_cast<std::atomic<uint32_t>*>((uint64_t)buf | 61);
auto start = std::chrono::system_clock::now();
for (std::size_t i = 0; i < niter; ++i) {
    v->fetch_add(1, std::memory_order_acquire);
}
auto stop = std::chrono::system_clock::now();
std::cout << (stop - start).count() / niter << std::endl;

Running the above on my computer I get 1194 nanoseconds per operation. That’s two hundred times slower, two orders of magnitude!

Detecting split locks

Intel CPUs have a performance counter SQ_MISC.SPLIT_LOCK measuring number of split lock operations. On linux systems you can check this with the perf command:

$ perf stat -e sq_misc.split_lock ./a.out

 Performance counter stats for './a.out':

         1,000,000      sq_misc.split_lock:u                                        

       1.203341403 seconds time elapsed

       1.199243000 seconds user
       0.000000000 seconds sys

In the above case the program issued 1,000,000 split lock operations.

Conclusion

Avoid split lock atomic operations. They are two orders of magnitude slower than aligned atomic operations.