This question already has an answer here:
I have 4e7 std::strings, each about 3 to 30 characters long, with many duplicates. I'm putting them in a std::set.
Calling set::insert for each string becomes intractably slow well before it would complete with about 1e7 unique strings. So instead I push_back each string into a vector, sort() and unique() that, and then move the strings into a set.
It's still slow, but at least it finishes: 10 seconds to accumulate the vector, 60 more for sort(), 10 more for unique().
The bottleneck is sort(). But I don't need the strings to be lexicographically sorted! I just need duplicate strings to be contiguous, for unique(). Their order is irrelevant. Is there a simpler, faster string comparison function for sort() that I could use instead of its default one?
Or should I build the set faster by iterating over the vector with a hash table on the side, to skip duplicates? Or replace set with hash_set or unordered_set?
Aucun commentaire:
Enregistrer un commentaire