jeudi 21 octobre 2021

Why does my CPU incur a mysterious factor 3 slowdown in a simple hash table implementation?

In short:

I have implemented a simple (multi-key) hash table with buckets (containing several elements) that exactly fit a cacheline. Inserting into a cacheline bucket is very simple, and the critical part of the main loop.

I have implemented three versions that produce the same outcome and should behave the same.

The mystery

However, I'm seeing wild performance differences by a surprisingly large factor 3, despite all versions having the exact same cacheline access pattern and resulting in identical hash table data.

The best implementation insert_ok suffers around a factor 3 slow down compared to insert_bad & insert_alt on my CPU (i7-7700HQ). One variant insert_bad is a simple modification of insert_ok that adds an extra unnecessary linear search within the cacheline to find the position to write to (which it already knows) and does not suffer this x3 slow down.

While the exact same executable shows insert_ok a factor 1.6 faster compared to insert_bad & insert_alt on other CPUs (AMD 5950X, Intel i7-11800H).

# see https://github.com/cr-marcstevens/hashtable_mystery
$ ./test.sh
model name      : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
==============================
CXX=g++    CXXFLAGS=-std=c++11 -O2 -march=native -falign-functions=64
tablesize: 117440512 elements: 67108864 loadfactor=0.571429
- test insert_ok : 11200ms
- test insert_bad: 3164ms
  (outcome identical to insert_ok: true)
- test insert_alt: 3366ms
  (outcome identical to insert_ok: true)

tablesize: 117440813 elements: 67108864 loadfactor=0.571427
- test insert_ok : 10840ms
- test insert_bad: 3301ms
  (outcome identical to insert_ok: true)
- test insert_alt: 3579ms
  (outcome identical to insert_ok: true)

My analysis

I've tried various modifications to the loop C++, but inherently it's so simple, the compiler will produce the same assembly. It's really not obvious from the resulting assembly what the factor 3 loss might cause. I've tried measuring with perf, but I can't seem to pinpoint any meaningful difference.

Comparing the assembly of the 3 versions which are all just relatively small loops, there is nothing that suggests anything close that may cause a factor 3 loss between these versions.

Hence, I presume the 3x slow down is a weird effect of automatic prefetching, or branch prediction, or instruction/jump alignment or maybe a combination of those.

Does anybody have better insights or ways to measure what effects might actually be at play here?

Details

I've created a small working C++11 example that demonstrates the problem. The code is available at https://github.com/cr-marcstevens/hashtable_mystery

This also includes my own static binaries that demonstrate this problem on my CPU, as different compilers may produce different code. As well as dumped assembly code for all three hash table versions.

Background

I'm implementing a cryptanalytic attack in C++11 and need to find many collisions between two large lists (both generated on the fly). A crucial part of the attack thus just consists of two critical loops:

  1. first populating a hash table with one list
  2. then matching the other list against the hash table.

The hash table operations are thus performance critical, and a factor 3 slow down means the attack is 3x slower.

Regarding design: Besides trying to minimize memory usage, I'm also trying to have a typical hash table operation operate on just a single cacheline. As I expect that will increase overall attack performance, especially when running the attack on all CPU cores.

Aucun commentaire:

Enregistrer un commentaire