Latency implications of virtual memory
This is a short guide describing the latency implications of the virtual memory abstraction. If you are building systems requiring low and predictable latency such as realtime audio processing, control and high frequency trading (HFT) / algorithmic trading systems this guide will be useful to you. It is written from the perspective Linux kernel running on AMD64 / x86-64 architecture, but the general concepts applies to most operating systems and CPU architectures.
In summary to minimize latency introduced by the virtual memory abstraction you should:
- Minimize page faults by pre-faulting, locking and pre-allocating needed memory. Disable swap.
- Reduce TLB misses by minimizing your working set memory and utilizing huge pages.
- Prevent TLB shootdowns by not modifying your programs page tables after startup.
- Prevent stalls due to page cache writeback by not creating file backed writable memory mappings.
- Disable Linux transparent huge pages (THP).
- Disable Linux kernel samepage merging (KSM).
- Disable Linux automatic NUMA balancing.
When reading or writing to file backed memory that is not in the page
cache1 or to anonymous memory2 that has been swapped out,
the kernel must first load the data from the underlying storage device. This is
called a major page fault and incurs a similar overhead as issuing a
write system call.
If the page is already in the page cache you will still incur a minor page fault
on first access after calling
mmap, during which the page table is
updated to point to the correct page3. For anonymous memory there
will also be a minor page fault on first write access, when a anonymous page is
allocated, zeroed and the page table updated2. Basically memory mappings
are lazily initialized on first use. Note also that access to the page table
during a page fault is protected by locks leading to scalability issues in
multi-threaded applications4. On systems with non-uniform
memory access (NUMA) the
automatic NUMA memory balancing
will also cause page faults.
To avoid page faults you can pre-fault and disable page cache eviction of the
needed memory using the
mlock system call or the
MAP_POPULATE flags to
mmap. You can also disable swap system wide to
prevent anonymous memory from being swapped to disk. Automatic NUMA memory
balancing can be disabled using the following command:
echo 0 > /proc/sys/kernel/numa_balancing
You can monitor number of page faults using
ps -eo min_flt,maj_flt,cmd
perf stat -e faults,minor-faults,major-faults
The translation lookaside buffer (TLB) is a on CPU cache that maps virtual to physical addresses. These mappings are maintained for pages typically of size 4 KiB, 2/4 MiB or 1 GiB. Usually there are separate TLBs for data (DTLB) and instructions (ITLB) with a shared second level TLB (STLB)5. The TLB has a limited number of entries and if a address is not found in the TLB or STLB, the page table data in the CPU caches or main memory needs to be referenced, this is called a TLB miss6. The same as a CPU cache miss is more expensive than a cache hit, a TLB miss is more expensive than a TLB hit.
You can minimize TLB misses by reducing your working set size, making sure to pack your data into as few pages as possible. Additionally you can utilize larger page sizes than the default 4 KiB. These larger pages are called huge pages7 and allows you to reference more data using fewer pages.
TLB usage can be monitored using:
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses
If the above command shows that your workload produces a large fraction of TLB misses huge pages will help reduce that.
Most processors do not provide coherence guarantees for TLB mappings. Instead the kernel provides this guarantee using a mechanism called a TLB shootdown. It operates by sending inter-processor interrupts (IPIs) that runs kernel code to invalidate the stale TLB entries8. TLB shootdowns causes each affected core to context switch into the kernel and thus causes latency spikes for the process running on the affected cores. It will also cause TLB misses when a address with an invalidated page table entry is subsequently accessed.
Any operation that narrows a process’ access to memory like
mprotect will cause a TLB shootdown. Calls to the C standard library allocator
free, etc) will call
but not necessarily on each invocation. Other causes of TLB shootdowns are:
transparent huge pages (THP), memory compaction, kernel samepage
merging (KSM), automatic NUMA memory balancing, page migration
and page cache writeback.
To avoid TLB shootdowns you can map all needed memory at program startup and
avoid calling any functions that modifies the page table after that. The
mimalloc allocator can be tuned to
allocate huge pages at program startup (
never return memory to the OS (
You can monitor the number of TLB shootdowns in
I wrote a test program
tlbshootdown.c to demonstrate how
munmap triggers TLB shootdowns:
perf stat -e tlb:tlb_flush ./tlbshootdown 100000 Performance counter stats for './tlbshootdown 100000': 100,016 tlb:tlb_flush 0.260283596 seconds time elapsed 0.017426000 seconds user 0.232625000 seconds sys
In a multi-threaded program this would have triggered 100000 TLB shootdowns.
Page cache writeback
When a page in the page cache1 has been modified it is marked as
dirty and needs to be eventually written back to disk. This process is called
writeback and is triggered automatically on a timer or when specifically
requested using the system calls
msync, and others. If any of the dirty pages are part of a writable memory
mapping, the writeback process must first update the page table to mark the page
as read-only before writing it to disk. Any subsequent memory write to the page
will cause a page fault, letting the kernel update the page cache state to dirty
and mark the page writable again. In practice this means that writeback causes
TLB shootdowns and that writes to pages that are currently being written to disk
must stall until the disk write is complete. This leads to latency spikes for
any process that is using file backed writable memory mappings.
To avoid latency spikes due to page cache writeback you cannot create any
file backed (or more precisely page cache backed) writable memory mappings.
Creating anonymous writable memory mappings using
mmap(...MAP_ANONYMOUS) or by
mapping files on Linux
filesystem is fine.
I wrote a
writeback.cpp to demonstrate this effect. On
my Ryzen 3900X with Samsung 970 EVO NVME SSD running Linux 5.7 I get latency
spikes of hundreds of microseconds due to writeback activity:
# ./writeback 9 $HOME/junkfile.bin threshold: 17792 ns pid: 382129 jitter 900985 ns jitter 19036 ns jitter 33283 ns jitter 18515 ns jitter 777301 ns jitter 715154 ns jitter 118063 ns jitter 661983 ns jitter 18676 ns
Transparent huge pages
Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages. Huge pages reduces TLB pressure, but THP support can introduce latency spikes.
THP will try to automatically promote pages into huge pages at the time of allocation during minor page faults. Additionally the khugepaged daemon runs in the background and will try to promote continuos ranges of virtual memory into huge pages. If no huge pages are available when requested, the kernel will try to compact memory to make huge pages available.
Both the background promotion of pages by khugepaged and on-demand compaction of pages by kcompactd causes latency spikes since they need to move data around and update the page tables. There is also ongoing work to enable proactive memory compaction which would become another source of latency spikes.
To avoid these latency spikes I recommend disabling THP and instead relying on
manually requesting huge pages. THP can be disabled by supplying the kernel
command line parameter
transparent_hugepage=never or running the following
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Kernel samepage merging
Linux kernel samepage merging (KSM) is a feature that de-duplicates
memory pages that contains identical data. The merging process needs to lock the
page tables and issue TLB shootdowns, leading to unpredictable memory access
latencies. KSM only operates on memory pages that has been opted in to samepage
madvise(...MADV_MERGEABLE). If needed KSM can be disabled system
wide by running the following command:
echo 0 > /sys/kernel/mm/ksm/run
NUMA and page migration
Non-uniform memory access (NUMA) occurs when the memory access time varies with memory location and processor core. You need to take this into account when designing your system.
Additionally Linux supports automatic page fault based NUMA memory balancing and manual page migration of memory between NUMA nodes. Migration of memory pages between NUMA nodes will cause TLB shootdowns and page faults for applications using the affected memory.
Automatic NUMA memory balancing can be disabled with the following command:
echo 0 > /proc/sys/kernel/numa_balancing
Also make sure to disable the numad user space NUMA memory balancing service.
- Ulrich Drepper (2007). “What Every Programmer Should Know About Memory”. https://www.akkadia.org/drepper/cpumemory.pdf, https://lwn.net/Articles/250967/
- Stack Overflow. “What Every Programmer Should Know About Memory?". https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory
- “The Linux Kernel documentation”. https://www.kernel.org/doc/html/latest/index.html
- “AMD64 Architecture Programmer’s Manual”. https://developer.amd.com/resources/developer-guides-manuals/
- “Intel® 64 and IA-32 Architectures Software Developer Manuals”. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
- Félix Cloutier. “x86 and amd64 instruction reference”. https://www.felixcloutier.com/x86/
- Nitin Gupta. “Proactive compaction for the kernel”. https://lwn.net/Articles/817905/
- Nitin Gupta. “Proactive Compaction for the Linux kernel”. https://nitingupta.dev/post/proactive-compaction/
“The physical memory is volatile and the common case for getting data into the memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on the subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually gets into the backing storage device. The written pages are marked as dirty and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data.” https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html?#page-cache ↩︎
“The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.” https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#anonymous-memory ↩︎
Travis Downs provided an interesting observation on anonymous vs file backed memory: “One note, although it might be a bit obscure for the article, is that there is a quite a difference in minor fault behavior when mapping in pages depending on whether they are file backed or not. If a page is backed by a file, it is subject to “fault around” which means that the kernel will map in nearby pages if they are also present in the page cache (i.e., it was a soft fault). You can tune how many pages are faulted in with /sys/kernel/debug/fault_around_bytes, which by default is 64k (16 pages).
The upshot of all this is reading a mapped file (for the first time) can be faster than reading from anonymous memory, since you get 1/16th of the minor faults when you are reading the file.”
You can find even more details on fault-around here: https://www.realworldtech.com/forum/?threadid=185310&curpostid=185310 ↩︎
The TLB hierarchy for Intel microarchitectures are documented in the “Intel® 64 and IA-32 Architectures Optimization Reference Manual” https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#optimization
“What happens after a L2 TLB miss?". https://stackoverflow.com/questions/32256250/what-happens-after-a-l2-tlb-miss ↩︎
“The address translation requires several memory accesses and memory accesses are slow relatively to CPU speed. To avoid spending precious processor cycles on the address translation, CPUs maintain a cache of such translations called Translation Lookaside Buffer (or TLB). Usually TLB is pretty scarce resource and applications with large memory working set will experience performance hit because of TLB misses.
Many modern CPU architectures allow mapping of the memory pages directly by the higher levels in the page table. For instance, on x86, it is possible to map 2M and even 1G pages using entries in the second and the third level page tables. In Linux such pages are called huge. Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.” https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html#huge-pages ↩︎
On the ARM architecture TLB entries are invalidated using the