Erik Rigtorp

Low latency tuning guide

This guide describes how to tune your AMD64/x86_64 hardware and Linux system for running real-time or low latency workloads. Example workloads where this type of tuning would be appropriate:

The term latency in this context refers to the time between receiving some event and the time when the event was processed. For example:

  • The time between a network packet was received by a NIC until an application finished processing the packet.
  • The time between a request was submitted to a queue and the worker thread finished processing the request.

To achieve low latency this guide describes how to:

  • Maximize per core performance by maximizing CPU frequency and disabling power saving features.
  • Minimize jitter caused by interrupts, timers and other applications interfering with your workload.

You can measure the reduced system jitter using my tool hiccups. In this example it shows how core 3 was isolated and experienced a maximum jitter of 18 us:

$ hiccups | column -t -R 1,2,3,4,5,6
cpu  threshold_ns  hiccups  pct99_ns  pct999_ns    max_ns
  0           168    17110     83697    6590444  17010845
  1           168     9929    169555    5787333   9517076
  2           168    20728     73359    6008866  16008460
  3           168    28336      1354       4870     17869

Hardware tuning

Enable performance mode

The systems UEFI or BIOS usually have a setting for energy profile that adjusts available CPU power states, you should set this to “maximum performance” or equivalent.

Disable hyper-threading

Hyper-threading (HT) or Simultaneous multithreading (SMT) is a technology to maximize processor resource usage for workloads with low instructions per cycle (IPC). Since HT/SMT increases contention on processor resources it’s recommended to turn it off if you want to reduce jitter introduced by contention on processor resources.

HT/SMT can be turned off in your system’s UEFI or BIOS firmware settings.

Enable Turbo Boost

Intel Turbo Boost and AMD Turbo Core technologies allows the processor to automatically overclock itself as long as it stays within some power and thermal envelope. If you have good cooling (set fan speed to max in BIOS), disable unused cores in BIOS or using the CPU hotplug functionality it’s possible to run your application cores continuously at the higher boost frequency.

Check if turbo boost is enabled:

$ cat /sys/devices/system/cpu/intel_pstate/no_turbo

Output should be 0 if turbo boost is enabled.

Alternatively you can use cpupower to check the status of turbo boost:

$ cpupower frequency-info

Use the turbostat tool to verify the clock frequency of each core. Note that turbostat will cause scheduling jitter and should not be used during production.

References:

Overclocking

Consider overclocking your processors. Running your processor at higher frequency will reduce jitter and latency. It’s not possible to overclock Intel Xeon server processors, but you can overclock Intel’s consumer gaming processors and AMD’s processors.

Kernel tuning

Use the performance CPU frequency scaling governor

Use the performance CPU frequency scaling governor to maximize core frequency.

Set all cores to use the performance governor:

# find /sys/devices/system/cpu -name scaling_governor -exec sh -c 'echo performance > {}' ';'

This can also be done by using the tuned performance profile:

# tuned-adm profile latency-performance

Verify that the performance governor is used with cpupower:

$ cpupower frequency-info

References:

Isolate cores

By default the kernel scheduler will load balance all threads across all available cores. To stop system threads from interfering with your application threads from you can use the kernel command line option isolcpus. It disables scheduler load balancing for the isolated cores and causes threads to be restricted to the non-isolated cores by default. Note that your critical application threads needs to be specifically pinned to the isolated cores in order to run there.

For example to isolate cores 1 through 7 add isolcpus=1-7 to your kernel command line.

When using isolcpus the kernel will still create several kernel threads on the isolated cores. Some of these kernel threads can be moved to the non-isolated cores.

Try to move all kernel threads to core 0:

# pgrep -P 2 | xargs -i taskset -p -c 0 {}

Alternatively use tuna move all kernel threads away from cores 1-7:

# tuna --cpus=1-7 --isolate

Verify by using the tuna command to show CPU affinities for all threads:

$ tuna -P

Additionally kernel workqueues needs to be moved away from isolated cores. To move all work queues to core 0 (cpumask 0x1):

# find /sys/devices/virtual/workqueue -name cpumask  -exec sh -c 'echo 1 > {}' ';'

Verify by listing current workqueue affinities:

$ find /sys/devices/virtual/workqueue -name cpumask -print -exec cat '{}' ';'

Finally verify if cores were successfully isolated by checking how many thread context switches are occurring per core:

# perf stat -e 'sched:sched_switch' -a -A --timeout 10000

The isolated cores should show a very low context switch count.

There is a work in progress patch set to improve task isolation even further, see A full task-isolation mode for the kernel.

References:

Reducing timer tick interrupts

The scheduler runs regularly on each core in order to switch between runnable threads. This will introduce jitter for latency critical applications. If you have isolated your application cores and are running a single application thread per isolated core you can use the nohz_full kernel command line option in order to suppress the timer interrupts.

For example to enable nohz_full on cores 1-7 add nohz_full=1-7 rcu_nocbs=1-7 to your kernel command line.

It’s important to note that the timer tick is only disabled when there is only a single runnable thread scheduled on the core. You can see the number of runnable threads per core in /proc/sched_debug.

The virtual memory subsystem runs a per core statistics update task every 1 second by default. You can reduce this interval by setting vm.stat_interval to a higher value:

# sysctl vm.stat_interval=120

Finally you can verify if the timer interrupts are suppressed by inspecting /proc/interrupts or running:

# perf stat -e 'irq_vectors:local_timer_entry' -a -A --timeout 1000

Expect isolcpus + nohz_full cores to show a timer interrupt every other second or so.

Interrupt affinity

Reduce jitter from interrupt processing by changing the CPU affinity of the interrupts. This can easily be done by running irqbalance. By default it will automatically isolate the cores specified by the kernel command line parameter isolcpus. You can also specify cores to isolate using the IRQBALANCE_BANNED_CPUS environment variable.

To isolate cores specified in isolcpus:

$ irqbalance --foreground --oneshot

To isolate core 3 (hexadecimal bitmask 0x8):

$ IRQBALANCE_BANNED_CPUS=8 irqbalance --foreground --oneshot

List CPU affinity for all IRQs:

$ find /proc/irq/ -name smp_affinity_list -print -exec cat '{}' ';'

Finally verify that isolated cores are not receiving interrupts by monitoring /proc/interrupts:

$ watch cat /proc/interrupts

References:

Disable mitigations for CPU vulnerabilities

This is application dependent, but consider disabling mitigations for CPU vulnerabilities. The mitigations can have considerable performance impact on system performance. Add mitigations=off to your kernel command line to disable all mitigations.

Also consider using older CPU microcode without the microcode mitigations for CPU vulnerabilities.

References:

Use cache partitioning

If your processor supports cache partitioning (Intel Cache Allocation Technology) consider using it to allocate most of the last-level cache (LLC) to your application.

References:

Application design and tuning

Consider NUMA topology

TODO

Use huge pages

The translation lookaside buffer (TLB) has a limited number of entries. If your application tries to access a memory page that is missing in the TLB, it causes a “TLB miss” requiring the MMU to fetch that page table entry from memory. The default page size is 4096 bytes, by using huge pages of 2 MB or 1 GB you can reduce the amount of TLB misses for the same amount of actively used RAM.

You can monitor TLB misses with the perf tool:

# perf stat -e 'dTLB-load-misses,dTLB-store-misses,iTLB-load-misses' -a -A --timeout 10000

Many low latency tuning guides recommend to turn off transparent huge pages (THP), but that is not necessarily a win. jemalloc can be configured to use THP and it can lead to a performance boost. With mimalloc you can directly use 1 GiB pages without THP enabled.

References:

TLB shootdowns

Each process has a page table mapping virtual address to physical address whenever this mapping changes the TLB needs to be flushed on all cores running that process. This is called a “TLB shootdown” and happens during dirty page writeback and when you call mmap(), munmap(), etc. The TLB shootdown is implemented as a inter-processor interrupt (IPI) that will introduce jitter to your running application. In addition the subsequent TLB misses introduces memory access latency.

Avoid TLB shootdowns by not memory mapping disk backed files and pre-allocating memory.

You can view the number of TLB shootdowns by core in /proc/interrupts.

Monitor number of TLB flushes:

# perf stat -e 'tlb_flush.dtlb_thread' -a -A --timeout 10000

References: