samedi 23 juin 2018

C++ backward regex search

I need to build an ultra-efficient log parser (~1GB/s). I implemented Hyperscan library (https://www.hyperscan.io) from Intel, and it works well to:

  • count a number of occurence of specified events
  • give the end position of the matches

One of the limitation is that no capture groups can be reported, only end offsets. For most matches, I only use the count, but for 10% of them, the line must be parsed to compute further statistics.

The challenge is to efficiently run a regex to get the line that matched the Hyperscan regex, knowing only the end offset. Currently, I tried:

string data(const string * block) const {
   std::regex nlexpr("\n(.*)\n$");
   std::smatch match;
   std::regex_search((*block).begin(), (*block).begin() + end, match, nlexpr);
   return match[1];
}

*block points to the file loaded in memory (2GB, so no copy possible). end is the known offset matching the regex.

But it is extremely inefficient when the string to match is far in the block. I would have expected the "$" to make the operation very quick as the offset is known the match should be direct starting from the end, but it is definitely not. The operation take ~1s if end = 100000000.

It is possible to get the start of the matches from Hyperscan, however performance impact is very high (approximately divided per 2 after testing), so that is not an option.

Any idea how to achieve this ? I am using C++ 11 (so std implements the boost regex).

Best regards

Aucun commentaire:

Enregistrer un commentaire