c++11: run-length std::string compression with std::adjacent

mardi 15 août 2017

run-length std::string compression with std::adjacent_find

I put together an algorithm on Coliru to run-length encode a std::string for a protocol I am using. I need help with the inner loop of the while loop to make it more efficient. The code here does not actually perform the encoding, it just separates the runs from the non runs (laying the foundation for encoding the string). The algorithm uses the std::adjacent_find algorithm to locate identical character run boundaries.

In the case of the protocol I am using, if the run is at least 3 or more matching adjacent characters, I can run-length encode these. The problem is that the algorithm is inefficient when it comes to allocating and appending elements that are not part of a run. In this case I allocate a std::unique_ptr<std::string> to contain the non run characters, then as soon as I encounter the next run, I store away the non run characters, reset the pointer and save the run of identical characters.

I think it should be possible to tweak the algorithm's inner loop with std::next(adjIter) or something along those lines to efficiently find the next runs and more importantly to skip over the non run characters without going through the inefficient appending code - but I cannot figure it out.

int main(int argc, char *argv[])
{
    // I want to convert this string containing adjacent runs of characters
    std::string testString("---thisssss---is-a---tesst--");

    std::vector<std::string> tokens;
    auto adjIter = testString.begin();
    auto lastIter = adjIter;
    std::unique_ptr<std::string> nonRunChars;
    while ((adjIter = std::adjacent_find(
        adjIter, testString.end(), std::not_equal_to<>())) !=
        testString.end()) {
        auto next = std::string(lastIter, adjIter + 1);
        // append if < run threshold
        if (next.length() < 3) {
            if (!nonRunChars) {
                nonRunChars = std::make_unique<std::string>();
            }
            *nonRunChars += next;
        } else {
            // save non-run temp in the tokens
            if (nonRunChars) {
                tokens.push_back(*nonRunChars);
                nonRunChars.reset();
            }
            tokens.push_back(next);
        }
        lastIter = adjIter + 1;
        adjIter = adjIter + 1;
    }

    if (nonRunChars) {
        tokens.push_back(*nonRunChars);
        nonRunChars.reset();
    }

    std::cout << "String to parse" << std::endl;
    std::cout << "===============" << std::endl;
    std::cout << "'" << testString << "'" << std::endl  << std::endl;

    std::cout << "run length tokenization" << std::endl;
    std::cout << "=======================" << std::endl;
    //std::copy(tokens.begin(), tokens.end(), std::experimental::make_ostream_joiner(std::cout, "\n"));
    std::copy(tokens.begin(), tokens.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
}

c++11

mardi 15 août 2017

run-length std::string compression with std::adjacent_find

Aucun commentaire:

Enregistrer un commentaire