lundi 2 janvier 2017

How to avoiding unnecessary copies with `boost::variant` recursive XML structure

Let me refer to the following example from the boost spirit tutorial about parsing a "mini XML" data structure. My question doesn't actually have anything to do with spirit, it's really about boost::variant, and making efficient "recursive" variant structures.

Here's the code:

struct mini_xml;

typedef
    boost::variant<
        boost::recursive_wrapper<mini_xml>
      , std::string
    >
mini_xml_node;

struct mini_xml
{
    std::string name;                           // tag name
    std::vector<mini_xml_node> children;        // children
};

In the tutorial, they go on to show how to "adapt" the struct mini_xml for boost::fusion, and then write spirit grammars that load data into it.

However, it occurred to me recently that there's a subtle issue in this example that may lead to significant overhead.

The issue is that the variant type, mini_xml_node, is not no-throw move constructible in this example. The reason is that it contains boost::recursive_wrapper. The recursive_wrapper<T> always represents a heap-allocated instance of T, and it does not have an empty state. When a variant containing recursive_wrapper is moved from, a new dynamic allocation is made and a new T is move constructed from the old T -- we can't simply take ownership of the old T, because that would leave the old variant in an empty state. (I only looked into this after I was trying to implement my own variant type -- this is indeed what boost::variant does afaict, and it is indeed not no-throw move constructible in this example.)

Because mini_xml_node is not no-throw move constructible, the container std::vector<mini_xml_node> children will be on the "slow path" -- when it uses std::move_if_noexcept, it will elect to copy the elements rather than move them. This happens for instance every time the vector capacity increases. And when we copy a mini_xml_node, we copy all of its children.

So for instance, if I want to parse from disk an xml tree which has depth d and branching factor b, I estimate that we will end up copying each leaf of the tree about d * log b times, since pushing back into a vector b times causes reallocation about log b times, and it will happen for each ancestor of the leaf.

I don't have an actual application right now where I care about this overhead, but it's easy for me to imagine that I might. For instance I might want to write a high-performance parser using spirit for some small utility program that will check that e.g. a data file has a certain form, or count some statistic of it or something. And then it might very well be that the time to parse dominates the overall runtime of the utility, so these copies might be performance critical.

The question is, what is the simplest and cleanest way to change the code to prevent these copies from happening? I thought of a few approaches but maybe you see a better way.

  1. Quick and dirty: Write a wrapper class that contains a mini_xml_node but the move constructor is explicitly marked noexcept, even though it isn't really noexcept. This will cause termination if an exception is thrown, but that should be pretty rare...

  2. Use an alternative to std::vector that doesn't use std::move_if_noexcept, and instead just uses move. If I understand right, boost::vector does this. The tradeoff is it doesn't have strong exception safety, but it's not clear if that's really a problem here.

  3. (This one doesn't actually work.) Add boost::blank as one of the options for mini_xml_node.

When blank is one of the value types, boost::variant has special code that adapts its behavior -- it will use blank as a natural empty state when default constructing and when making type-changing assignments.

However, it appears that putting blank into boost::variant doesn't go so far as allowing me to no-throw move construct the variant even when it contains a boost::recursive_wrapper type. Tested on coliru with boost 1.63:

http://ift.tt/2iXFwG2

(Maybe this would be a useful feature for boost::variant in the future?)

Aucun commentaire:

Enregistrer un commentaire