c++11: Performance of big slabs due to allocation, initialization and deallocation overhead costs [C++11]

mercredi 6 février 2019

Performance of big slabs due to allocation, initialization and deallocation overhead costs [C++11]

I have found in running performance analysis of various string classes, that the allocation, initialization and deallocation overhead costs due to fine-grained allocators more tuned for thread globals, scales very badly.

A wonderful example of one of these fine grained allocators is Makalu, and the paper describes in great detail how that is done: Makalu: Fast Recoverable Allocation of Non-volatile Memory

This kind of use of thread globals for caching deallocated items to avoid context switches is a great thing, but not for all programs.

I have a graph for the a comparison of std::string and three types of strings I have constructed that are faster: Direct, Symbiont and Head, that compares the allocation, initialization and deallocation overhead costs (ADIDOC, for short) for sorting these strings in simple arrays using slab allocation on the stack.

The std::string class has a head with a pointer to the tail that is copy-on-write (COW) in C++11 (that COW is changing, I think in a future release, but the extra fine-grained allocation will remain, I'm sure.) The heads of std::string are allocated in a single slab array, but the tails are all allocated fine-grained at various locations in the memory pool.

The three other classes are all allocated as a single slab array, with no fine-grained allocation, whatsoever. Here are three graphs for that overhead in the process of creating initializing and deallocating 8-byte, 128-byte and 1024=byte strings for sorting (the sorting costs are measured separately and are analyzed in another posting).

NOTES:

All of the performance data presented is run on release builds, OF COURSE.
All slabs are allocated on the stack as automatic variables in code blocks which deallocate the memory in between these runs in one process execution. So the stack pointer moves up some gigabytes and then back down.
Were this all done in an mmap, one would start the process and do the mmap with the MAP_POPULATE flag before doing any runs, to level the playing field. Since this is on the stack, an attempt is made to accomplish the same thing by allocating a huge array and writing to the end of it, and that seems to set up all the memory allocation overhead (page table entry/PTE allocation and TLB miss overhead) for the subsequent runs during the process. See the slides from hotstorage 2017, for details at: Efficient Memory Mapped File I/O for In-Memory File Systems
The SortBench.cpp program used measures different strings and containers against identical template sort code: all strings and containers experience the same number of compare calls and sorts and run fast or slow depending on the underlying physics of those strings and containers. Sorting is just a method for seeing the scalability of strings and containers.
For these measurements, the sorting numbers don't matter, because it's the setup for sorting and the exit after sorting that is being measured. For the record in each set of runs, the Direct strings are executed first and would take the memory allocation hit, and std::strings are executed last, as is seen in the data files.

Quick sort allocation, initialization and deallocation overhead costs (ADIDOC) for 8-byte strings

Quick sort ADIDOC for 128-byte strings

Quick sort ADIDOC for 1024-byte strings

The allocation, initialization and deallocation overhead cost measurements were all done using the nanosecond timer. The code for SortBench.cpp is located in the GitHub: [Kaldanes R&D] (https://github.com/charlesone/Kaldanes)

The first issue we notice is, this appears to scale very well for one of them. The blue line is the Direct string class, which is just a block of string bytes in a struct as an array element. That is the one class that has no initialization at all, just storing the string. After it reaches the scale of 128,000 8-byte strings (1MB) the overhead cost is scaling arbitrarily flat. Is that the operating system slab allocator at work? That was my working hypothesis, but I don't think that's where the fairy dust is being applied, it seems like the kernel is doing something special for massive objects.

The second issue is that at the last point of 134,217,728 8-byte strings (after which std::string [yellow line] runs out of memory on my 15GB machine, from only storing 1GB of string data,) the std::string class is 5 orders of magnitude worse than the best slab-allocated class with no initialization.

If the Direct strings use of the slab allocator remains flat at the terabyte level that would make for an eight order of magnitude difference. And another three at the petabyte scale makes an eight orders of magnitude difference (308 days, nearly a year). Using 7400 cores in parallel might bring that down to an hour, while slab allocation, if flat at that scale would remain at 1/10 of a millisecond in one core. This needs to be tested at scale on a Gen-Z or Intel XPoint system. I would presume (but have not tested) that the memory with no file backing version of mmap (presumably with flags MAP_NORESERVE and MAP_POPULATE).

My question is about the C++ runtime and the kernel, why is this kind of enormous allocation so arbitrarily fast, and at what size does it stop being constant time?

c++11

mercredi 6 février 2019

Performance of big slabs due to allocation, initialization and deallocation overhead costs [C++11]

Aucun commentaire:

Enregistrer un commentaire