mardi 11 janvier 2022

`std::condition_var::notify_all` deadlocks

I have cpp code where one thread produces, pushing data into a queue and another consumes it before passing it to other libraries for processing.

std::mutex lock;
std::condition_variable new_data;
std::vector<uint8_t> pending_bytes;
bool data_done=false;

// producer
  void add_bytes(size_t byte_count, const void *data)
  {
    if (byte_count == 0)
        return;

    std::lock_guard<std::mutex> guard(lock);
    uint8_t *typed_data = (uint8_t *)data;
    pending_bytes.insert(pending_bytes.end(), typed_data,
                               typed_data + byte_count);

    new_data.notify_all();
  }

  void finish()
  {
    std::lock_guard<std::mutex> guard(lock);

    data_done = true;
    new_data.notify_all();
  }

// consumer
Result *process(void)
{
  data_processor = std::unique_ptr<Processor>(new Processor());

  bool done = false;
  while (!done)
  {
    std::unique_lock<std::mutex> guard(lock);
    new_data.wait(guard, [&]() {return data_done || pending_bytes.size() > 0;});

    size_t byte_count = pending_bytes.size();
    std::vector<uint8_t> data_copy;
    if (byte_count > 0)
    {
      data_copy = pending_bytes; // vector copies on assignment
      pending_bytes.clear();
    }

    done = data_done;
    guard.unlock();

    if (byte_count > 0)
    {
      data_processor->process(byte_count, data_copy.data());
    }
  }

  return data_processor->finish();
}

Where Processor is a rather involved class with a lot of multi-threaded processing, but as far as I can see it should be separated from the code above.

Now sometimes the code deadlocks, and I'm trying to figure out the race condition. My biggest clue is that the producer threads appears to be stuck under notify_all(). In GDB I get the following backtrace, showing that notify_all is waiting on something:

[Switching to thread 3 (Thread 0x7fffe8d4c700 (LWP 45177))]

#0  0x00007ffff6a4654d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff6a44240 in pthread_cond_broadcast@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x00007ffff67e1b29 in std::condition_variable::notify_all() () from /lib64/libstdc++.so.6
#3  0x0000000001221177 in add_bytes (data=0x7fffe8d4ba70, byte_count=256,
    this=0x7fffc00dbb80) at Client/file.cpp:213

while also owning the lock

(gdb) p lock
$12 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 1, __count = 0, __owner = 45177, __nusers = 1, __kind = 0,
        __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},

with the other thread waiting in the condition variable wait

[Switching to thread 5 (Thread 0x7fffe7d4a700 (LWP 45180))]
#0  0x00007ffff6a43a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007ffff6a43a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007ffff67e1aec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2  0x000000000121f9a6 in std::condition_variable::wait<[...]::{lambda()#1}>(std::
unique_lock<std::mutex>&, [...]::{lambda()#1}) (__p=..., __lock=...,
    this=0x7fffc00dbb28) at /opt/rh/devtoolset-9/root/usr/include/c++/9/bits/std_mutex.h:104

There are two other threads running under the Process data part, which also hang on pthread_cond_wait, but as far as I'm aware they do not share any synchronization primities (and are just waiting for calls to processor->add_data or processor->finish) Any ideas what notify_all is waiting for? or ways of finding the culprit?

Edit: I reproduced the code with a dummy processor here: https://onlinegdb.com/lp36ewyRSP But, pretty much as expected, this doesn't reproduce the issue, so I assume there is something more intricate going on. Possibly just different timings, but maybe some interaction between condition_variable and OpenMP (used by the real processor) could cause this?

Aucun commentaire:

Enregistrer un commentaire