Using huge pages on Linux
In this article I will explain when and how to use huge pages.
Workloads that performs random memory access' to a big working set can be limited by translation lookaside buffer (TLB) misses. You can reduce TLB misses by utilizing page sizes larger than the default 4 KiB. These larger pages are often referred to as huge pages. I will focus on how to utilize huge pages for program data, but huge pages can also be used for instruction data.
Determining if huge pages are beneficial
To demonstrate when the TLB can become a bottleneck I wrote a sample program hugepages.cpp that uses a large hash table. Since each bucket is randomly accessed this program will perform random memory access' across a large working set.
#include <absl/container/flat_hash_map.h>
#include <nmmintrin.h> // _mm_crc32_u64
int main(int argc, char *argv[]) {
struct hash {
size_t operator()(size_t h) const noexcept { return _mm_crc32_u64(0, h); }
};
size_t iters = 10000000;
absl::flat_hash_map<size_t, size_t, hash> ht;
ht.reserve(iters);
for (size_t i = 0; i < iters; ++i) {
ht.try_emplace(i, i);
}
return 0;
}
Compile the sample program:
g++ -std=c++17 -O3 -mavx -DNDEBUG hugepages.cpp /usr/lib64/libabsl_*
Run it with perf to collect relevant performance counters:
$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' ./a.out
Performance counter stats for './a.out':
70,080 faults:u
20,802,877 dTLB-loads:u
19,436,707 dTLB-load-misses:u # 93.43% of all dTLB cache hits
32,872,323 cache-misses:u # 52.279 % of all cache refs
62,878,289 cache-references:u
0.708913859 seconds time elapsed
0.623564000 seconds user
0.076544000 seconds sys
I ran this on a AMD Ryzen 3900X CPU which has a 2048 entry L2 TLB. Using the default 4 KiB page size the TLB can handle a working set size of up to 8 MiB (2048 * 4 KiB). We see here that almost every last level cache (LLC) (also commonly referred to as L3 cache) miss results in a TLB miss. This is because our working set is much larger than 8 MiB.
When you see large TLB miss to LLC miss ratios it’s time to investigate if huge pages can help improve performance.
Using transparent huge pages (THP)
Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages.
By default THP is madvise
mode in which case it’s necessary to call
madvise(...MADV_HUGEPAGE)
to specifically enable THP for a range of
memory.
Linux transparent huge page (THP) support does not guarantee that huge pages
will be allocated. If you use posix_memalign
to allocate
memory aligned to the huge page size it’s more likely that memory will be backed
by huge pages:
void *ptr = nullptr;
posix_memalign(&ptr, huge_page_size, n);
madvise(ptr, n, MADV_HUGEPAGE);
In C++ we can achieve this using a custom allocator:
#include <stdlib.h> // posix_memalign
#include <sys/mman.h> // madvise
template <typename T> struct thp_allocator {
constexpr static std::size_t huge_page_size = 1 << 21; // 2 MiB
using value_type = T;
thp_allocator() = default;
template <class U>
constexpr thp_allocator(const thp_allocator<U> &) noexcept {}
T *allocate(std::size_t n) {
if (n > std::numeric_limits<std::size_t>::max() / sizeof(T)) {
throw std::bad_alloc();
}
void *p = nullptr;
posix_memalign(&p, huge_page_size, n * sizeof(T));
madvise(p, n * sizeof(T), MADV_HUGEPAGE);
if (p == nullptr) {
throw std::bad_alloc();
}
return static_cast<T *>(p);
}
void deallocate(T *p, std::size_t n) { std::free(p); }
};
See hugepages.cpp for example usage.
Using this posix_memalign
trick only makes sense for large allocations that
are at least one huge page in size. If you want use THP with your C standard
library (libc) allocator you can enable THP system wide for all allocations with
the following command:
echo always >/sys/kernel/mm/transparent_hugepage/enabled
If you know the addresses and sizes of your heap allocator’s arenas you could
use madvise(...MADV_HUGEPAGE)
to enable THP for the arenas.
There are potential performance issues to using THP. You can read about it in my article on virtual memory.
Allocating huge pages using mmap
You can allocate huge pages using mmap(...MAP_HUGETLB)
. Either as a
anonymous mapping or a named mapping on hugetlbfs. The size of a huge page
mapping needs to be a multiple of the page size. For example to create an
anonymous mapping of 8 huge pages of the default size of 2 MiB on AMD64 /
x86-64:
void *ptr = mmap(NULL, 8 * (1 << 21), PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0))
If you need a named huge page mapping you instead mmap
a file descriptor
referring to a file on a hugetlbfs filesystem.
Huge pages are allocated from a reserved pool. You can reserve huge pages using
the kernel command line parameter hugepages
or at runtime using a procfs or
sysfs interface. Read the Linux kernel documentation on huge
pages for more information on how to reserve huge pages. The
simplest way to reserve huge pages of the default size is to use the procfs
interface:
echo 20 > /proc/sys/vm/nr_hugepages
In C++ we can use a custom allocator to enable containers to use huge pages:
#include <sys/mman.h> // mmap, munmap
template <typename T> struct huge_page_allocator {
constexpr static std::size_t huge_page_size = 1 << 21; // 2 MiB
using value_type = T;
huge_page_allocator() = default;
template <class U>
constexpr huge_page_allocator(const huge_page_allocator<U> &) noexcept {}
size_t round_to_huge_page_size(size_t n) {
return (((n - 1) / huge_page_size) + 1) * huge_page_size;
}
T *allocate(std::size_t n) {
if (n > std::numeric_limits<std::size_t>::max() / sizeof(T)) {
throw std::bad_alloc();
}
auto p = static_cast<T *>(mmap(
nullptr, round_to_huge_page_size(n * sizeof(T)), PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0));
if (p == MAP_FAILED) {
throw std::bad_alloc();
}
return p;
}
void deallocate(T *p, std::size_t n) {
munmap(p, round_to_huge_page_size(n));
}
};
See hugepages.cpp for example usage.
Using the above allocator only makes sense with std::vector
,
absl::flat_hash_map
, rigtorp::HashMap
and similar
containers that performs few but big allocations. The C++ standards proposal
P0401R3 Providing size feedback in the Allocator
interface
would make this type of custom allocator more useful by returning the actual
page size rounded size. I have implemented support for this proposal in my
concurrent ring buffer library
SPSCQueue.
Using huge pages for heap allocations with mimalloc
mimalloc is a recently developed (as of 2020) general purpose
allocator that has excellent performance and great support for huge pages. By
default it enables transparent huge pages (THP) using
madvise(...MADV_HUGEPAGE)
and supports enabling use of 2 MiB and 1 GiB huge
pages on AMD64 / x86-64.
Running the original example using 2 MiB pages:
$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
env MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_LARGE_OS_PAGES=1 LD_PRELOAD=./libmimalloc.so ./a.out
Performance counter stats for './a.out':
658 faults:u
8,717,125 dTLB-loads:u
6,320 dTLB-load-misses:u # 0.07% of all dTLB cache hits
23,104,208 cache-misses:u # 64.034 % of all cache refs
36,081,035 cache-references:u
0.543847504 seconds time elapsed
0.511986000 seconds user
0.029820000 seconds sys
We can see a huge reduction TLB misses and the program completed faster. On the same AMD Ryzen 3900X CPU as before the L2 TLB can hold 2048 entries of 2 MiB pages. This translates to a working set size of 4 GiB (2048 * 2 MiB) when using 2 MiB pages.
Running the original example using 1 GiB pages:
$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
env MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_RESERVE_HUGE_OS_PAGES=4 LD_PRELOAD=libmimalloc.so ./a.out
Performance counter stats for './a.out':
532 faults
639,907 dTLB-loads
7,869 dTLB-load-misses # 1.23% of all dTLB cache hits
25,401,262 cache-misses # 35.908 % of all cache refs
70,739,506 cache-references
0.598358478 seconds time elapsed
0.506471000 seconds user
0.089488000 seconds sys
To reserve 1 GiB pages I had to add hugepagesz=1G hugepages=4
to the kernel
command line. In this case there were no improvement since the working set
already fits in the L2 TLB when using 2 MiB pages.
See also
- libhugetlbfs can be used to put instruction data (as opposed to only program data) on huge pages.
- jemalloc seems to have some support for THP.
- https://wiki.debian.org/Hugepages
- How do I check for hugepages usage and what is using it?