mardi 26 juin 2018

Inserting multiple not-a-numbers into a std::unordered_set

One of consequences of the IEEE 754 standard is the non-intuitive behavior of std::unordered_set<double>, when not-a-number elements (NANs) are inserted.

Due to the fact that NAN!=NAN, after the following sequence:

#include <iostream>
#include <cmath>
#include <unordered_set>

int main(){
    std::unordered_set<double> set;
    set.insert(NAN);
    set.insert(NAN);
    std::cout<<"Number of elements "<<set.size()<<"\n";  //there are 2 elements!
}

there are two elements in the set(see it live): NAN and NAN!

Mine main issue with this is, that when N NANs are inserted into the hash-set, they all hit the same hash-bucket and the performance of N inserts into the hash-set degenerates to the worst-case running time - O(N^2).

For an example, see the listing at the end of the question or here live - inserting NAN takes some order of magnitude more time than a "normal" floating number.

My question: is it possible (and if yes - how) to tweak std::unordered_set<double> in such a way, that there is at most one NAN-element in the set, no matter the flavor of inserted NANs (NAN, -NAN and so on)?


Listing:

#include <iostream>
#include <cmath>
#include <unordered_set>
#include <chrono>

constexpr int N=5000;
void test_insert(double value)
{
    std::unordered_set<double> s;
    auto begin = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < N; i++) {
        s.insert(value);
    }
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Duration: " << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / 1e9) << "\n";
    std::cout << "Number of elements: "<<s.size()<<"\n";
}

int main(){
    std::cout << "Not NAN\n";
    test_insert(1.0);           //takes 0.0001 s
    std::cout << "NAN\n";
    test_insert(NAN);           //takes 0.2 s
}

Aucun commentaire:

Enregistrer un commentaire