Erik Rigtorp

Using huge pages on Linux

In this article I will explain when and how to use huge pages.

Workloads that performs random memory access' to a big working set can be limited by translation lookaside buffer (TLB) misses. You can reduce TLB misses by utilizing page sizes larger than the default 4 KiB. These larger pages are often referred to as huge pages. I will focus on how to utilize huge pages for program data, but huge pages can also be used for instruction data.

Determining if huge pages are beneficial

To demonstrate when the TLB can become a bottleneck I wrote a sample program hugepages.cpp that uses a large hash table. Since each bucket is randomly accessed this program will perform random memory access' across a large working set.

#include <absl/container/flat_hash_map.h>
#include <nmmintrin.h> // _mm_crc32_u64

int main(int argc, char *argv[]) {
  struct hash {
    size_t operator()(size_t h) const noexcept { return _mm_crc32_u64(0, h); }
  };

  size_t iters = 10000000;
  absl::flat_hash_map<size_t, size_t, hash> ht;
  ht.reserve(iters);
  for (size_t i = 0; i < iters; ++i) {
    ht.try_emplace(i, i);
  }

  return 0;
}

Compile the sample program:

g++ -std=c++17 -O3 -mavx -DNDEBUG hugepages.cpp /usr/lib64/libabsl_*

Run it with perf to collect relevant performance counters:

$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' ./a.out

 Performance counter stats for './a.out':

            70,080      faults:u                                                    
        20,802,877      dTLB-loads:u                                                
        19,436,707      dTLB-load-misses:u        #   93.43% of all dTLB cache hits 
        32,872,323      cache-misses:u            #   52.279 % of all cache refs    
        62,878,289      cache-references:u                                          

       0.708913859 seconds time elapsed

       0.623564000 seconds user
       0.076544000 seconds sys

I ran this on a AMD Ryzen 3900X CPU which has a 2048 entry L2 TLB. Using the default 4 KiB page size the TLB can handle a working set size of up to 8 MiB (2048 * 4 KiB). We see here that almost every last level cache (LLC) (also commonly referred to as L3 cache) miss results in a TLB miss. This is because our working set is much larger than 8 MiB.

When you see large TLB miss to LLC miss ratios it’s time to investigate if huge pages can help improve performance.

Using transparent huge pages (THP)

Linux transparent huge page (THP) support allows the kernel to automatically promote regular memory pages into huge pages.

By default THP is madvise mode in which case it’s necessary to call madvise(...MADV_HUGEPAGE) to specifically enable THP for a range of memory.

Linux transparent huge page (THP) support does not guarantee that huge pages will be allocated. If you use posix_memalign to allocate memory aligned to the huge page size it’s more likely that memory will be backed by huge pages:

void *ptr = nullptr;
posix_memalign(&ptr, huge_page_size, n);
madvise(ptr, n, MADV_HUGEPAGE);

In C++ we can achieve this using a custom allocator:

#include <stdlib.h>   // posix_memalign
#include <sys/mman.h> // madvise

template <typename T> struct thp_allocator {
  constexpr static std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  thp_allocator() = default;
  template <class U>
  constexpr thp_allocator(const thp_allocator<U> &) noexcept {}

  T *allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T)) {
      throw std::bad_alloc();
    }
    void *p = nullptr;
    posix_memalign(&p, huge_page_size, n * sizeof(T));
    madvise(p, n * sizeof(T), MADV_HUGEPAGE);
    if (p == nullptr) {
      throw std::bad_alloc();
    }
    return static_cast<T *>(p);
  }

  void deallocate(T *p, std::size_t n) { std::free(p); }
};

See hugepages.cpp for example usage.

Using this posix_memalign trick only makes sense for large allocations that are at least one huge page in size. If you want use THP with your C standard library (libc) allocator you can enable THP system wide for all allocations with the following command:

echo always >/sys/kernel/mm/transparent_hugepage/enabled

If you know the addresses and sizes of your heap allocator’s arenas you could use madvise(...MADV_HUGEPAGE) to enable THP for the arenas.

There are potential performance issues to using THP. You can read about it in my article on virtual memory.

Allocating huge pages using mmap

You can allocate huge pages using mmap(...MAP_HUGETLB). Either as a anonymous mapping or a named mapping on hugetlbfs. The size of a huge page mapping needs to be a multiple of the page size. For example to create an anonymous mapping of 8 huge pages of the default size of 2 MiB on AMD64 / x86-64:

void *ptr = mmap(NULL, 8 * (1 << 21), PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                 -1, 0))

If you need a named huge page mapping you instead mmap a file descriptor referring to a file on a hugetlbfs filesystem.

Huge pages are allocated from a reserved pool. You can reserve huge pages using the kernel command line parameter hugepages or at runtime using a procfs or sysfs interface. Read the Linux kernel documentation on huge pages for more information on how to reserve huge pages. The simplest way to reserve huge pages of the default size is to use the procfs interface:

echo 20 > /proc/sys/vm/nr_hugepages

In C++ we can use a custom allocator to enable containers to use huge pages:

#include <sys/mman.h> // mmap, munmap

template <typename T> struct huge_page_allocator {
  constexpr static std::size_t huge_page_size = 1 << 21; // 2 MiB
  using value_type = T;

  huge_page_allocator() = default;
  template <class U>
  constexpr huge_page_allocator(const huge_page_allocator<U> &) noexcept {}

  size_t round_to_huge_page_size(size_t n) {
    return (((n - 1) / huge_page_size) + 1) * huge_page_size;
  }

  T *allocate(std::size_t n) {
    if (n > std::numeric_limits<std::size_t>::max() / sizeof(T)) {
      throw std::bad_alloc();
    }
    auto p = static_cast<T *>(mmap(
        nullptr, round_to_huge_page_size(n * sizeof(T)), PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0));
    if (p == MAP_FAILED) {
      throw std::bad_alloc();
    }
    return p;
  }

  void deallocate(T *p, std::size_t n) {
    munmap(p, round_to_huge_page_size(n));
  }
};

See hugepages.cpp for example usage.

Using the above allocator only makes sense with std::vector, absl::flat_hash_map, rigtorp::HashMap and similar containers that performs few but big allocations. The C++ standards proposal P0401R3 Providing size feedback in the Allocator interface would make this type of custom allocator more useful by returning the actual page size rounded size. I have implemented support for this proposal in my concurrent ring buffer library SPSCQueue.

Using huge pages for heap allocations with mimalloc

mimalloc is a recently developed (as of 2020) general purpose allocator that has excellent performance and great support for huge pages. By default it enables transparent huge pages (THP) using madvise(...MADV_HUGEPAGE) and supports enabling use of 2 MiB and 1 GiB huge pages on AMD64 / x86-64.

Running the original example using 2 MiB pages:

$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
      env MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_LARGE_OS_PAGES=1 LD_PRELOAD=./libmimalloc.so ./a.out

 Performance counter stats for './a.out':

               658      faults:u                                                    
         8,717,125      dTLB-loads:u                                                
             6,320      dTLB-load-misses:u        #    0.07% of all dTLB cache hits 
        23,104,208      cache-misses:u            #   64.034 % of all cache refs    
        36,081,035      cache-references:u                                          

       0.543847504 seconds time elapsed

       0.511986000 seconds user
       0.029820000 seconds sys

We can see a huge reduction TLB misses and the program completed faster. On the same AMD Ryzen 3900X CPU as before the L2 TLB can hold 2048 entries of 2 MiB pages. This translates to a working set size of 4 GiB (2048 * 2 MiB) when using 2 MiB pages.

Running the original example using 1 GiB pages:

$ perf stat -e 'faults,dTLB-loads,dTLB-load-misses,cache-misses,cache-references' \
    env MIMALLOC_EAGER_COMMIT_DELAY=0 MIMALLOC_RESERVE_HUGE_OS_PAGES=4 LD_PRELOAD=libmimalloc.so ./a.out

 Performance counter stats for './a.out':

               532      faults                                                      
           639,907      dTLB-loads                                                  
             7,869      dTLB-load-misses          #    1.23% of all dTLB cache hits 
        25,401,262      cache-misses              #   35.908 % of all cache refs    
        70,739,506      cache-references                                            

       0.598358478 seconds time elapsed

       0.506471000 seconds user
       0.089488000 seconds sys

To reserve 1 GiB pages I had to add hugepagesz=1G hugepages=4 to the kernel command line. In this case there were no improvement since the working set already fits in the L2 TLB when using 2 MiB pages.

See also