mardi 2 octobre 2018

Is the creation time of c++11 std::thread dependent on the payload executed?

I want to know the time overhead to execute a method in a C++11 std::thread (or std::async) compared to direct execution. I know that thread pools can significantly reduce or even completely avoid this overhead. But I'd still like to get a better feeling at what computational cost the thread creation pays off, and at what cost the pooling pays off.

I implemented a simple benchmark myself, that boils down to:

void PayloadFunction(double* aInnerRuntime, const size_t aNumPayloadRounds) {
    double vComputeValue = 3.14159;

    auto vInnerStart = std::chrono::high_resolution_clock::now();
    for (size_t vIdx = 0; vIdx < aNumPayloadRounds; ++vIdx) {
        vComputeValue = std::exp2(std::log1p(std::cbrt(std::sqrt(std::pow(vComputeValue, 3.14152)))));
    }
    auto vInnerEnd = std::chrono::high_resolution_clock::now();
    *aInnerRuntime += static_cast<std::chrono::duration<double, std::micro>>(vInnerEnd - vInnerStart).count();

    volatile double vResult = vComputeValue;
}

int main() {
    double vInnerRuntime = 0.0;
    double vOuterRuntime = 0.0;

    auto vStart = std::chrono::high_resolution_clock::now();
    for (size_t vIdx = 0; vIdx < 10000; ++vIdx) {
        std::thread vThread(PayloadFunction, &vInnerRuntime, cNumPayloadRounds);
        vThread.join();
    }
    auto vEnd = std::chrono::high_resolution_clock::now();
    vOuterRuntime = static_cast<std::chrono::duration<double, std::micro>>(vEnd - vStart).count();

    // normalize away the robustness iterations:
    vInnerRuntime /= static_cast<double>(cNumRobustnessIterations);
    vOuterRuntime /= static_cast<double>(cNumRobustnessIterations);

    const double vThreadCreationCost = vOuterRuntime - vInnerRuntime;
}

This works quite well and I can get typical thread creation costs of ~20-80 microseconds (us) on Ubuntu 18.04 with a modern Core i7-6700K.

Now comes the curious part: the thread overhead seems to depend very reproducible on the time spent in the payload method! This makes no sense to me, but it reproducible happens on six different hardware machines with various flavors of Ubuntu and CentOS!

  1. If I spend between 1 and 100us inside PayloadFunction, the typical thread creation cost is around 20us.
  2. When I increase the time spent in PayloadFunction to 100-1000us, the thread creation cost linearly increases to around 40us.
  3. A further increase to more then 10000us in PayloadFunction again linearly increases the thread creation cost to around 80us.

I did not go to larger ranges, but I can clearly see a linear dependency between payload runtime and thread creation overhead (as computed above). Since I can not explain this behavior, I assume there must be a pitfall. Can somebody shed some light?

Aucun commentaire:

Enregistrer un commentaire