jeudi 26 novembre 2020

Fast memory allocation of big memory chunk

I've been measuring the performance of different C/C++ allocation (and initiation) techniques for big chunks of continuous memory. To do so, I tried to allocate (and write to) 100 randomly selected sizes, using uniform distribution and range of 20 to 4096 MB, and measured the time using std::chrono high_resolution_clock. Each measurement is done by a separate execution of a program, i.e. there should be no memory reuse (at least within the process).

madvise ON refers to calling madvise with MADV_HUGEPAGE flag, i.e. enabling transparent huge pages (2MB in case of my systems).

Using a single 16GB module of DDR4 with a clock speed of 2400 MT/s and a data width of 64 bits, I've got a theoretical maximal speed of 17.8 GB/s.

On Ubuntu 18.04.05 LTS (4.15.0-118-generic), memset of the already allocated memory block gets close to the theoretical limit, but the page_aligned allocation + memset is somewhat slower, as expected. New() is very slow, probably due to its internal overhead (values in GB/s):

method              madvise     mean    std
memset              madvise OFF 17.3    0.32
page_aligned+memset madvise ON  11.4    0.21
mmap+memset         madvise ON  11.3    0.23
new<double>[]()     madvise ON  3.2     0.06

Using two modules, I was expecting near to double performance (say 35 GB/s) due to dual-channel, at least for the write operation:

method              madvise     mean    std
memset              madvise OFF 28.0    0.23
mmap+memset         madvise ON  14.5    0.18
page_aligned+memset madvise ON  14.4    0.17

How you can see, memset reaches only 80% of the theoretical speed. Memory allocation + write speed increases only by 3 GB/s, reaching only 40% of the theoretical speed of the memory.

To make sure that I did not mess up something in the OS (I use it for a few years now), I installed fresh Ubuntu 20.04 (dual boot) and repeated the experiment. The fastest operations were these:

method              madvise     mean    std
memset              madvise OFF 29.1    0.86
page_aligned+memset madvise ON  10.5    0.27
mmap+memset         madvise ON  10.5    0.31

As you can see, the results are reasonably similar for memset, but actually even worse for allocation + write operations.

Are you aware of a faster way of allocating (and initializing) big chunks of memory? For the record, I have tested combinations of malloc, new<float/double>, calloc, operator new, mmap and page_aligned for allocation, and memset and for loop for writing, together with the madvise flag.

Aucun commentaire:

Enregistrer un commentaire