c++11: How to achieve a StoreLoad barrier in C++11?

mardi 4 février 2020

How to achieve a StoreLoad barrier in C++11?

I want to write portable code (Intel, ARM, PowerPC...) which solves a variant of a classic problem:

Initially: X=Y=0

Thread A:
  X=1
  if(!Y){ do something }
Thread B:
  Y=1
  if(!X){ do something }

in which the goal is to avoid a situation in which both threads are doing something. Please correct me, if you see some flaws in my reasoning below.

I am aware, that I can achieve the goal with memory_order_seq_cst atomic stores and loads as follows:

std::atomic<int> x{0},y{0};
void thread_a(){
  x.store(1);
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!x.load()) bar();
}

which achieves the goal, because there must be some single total order on {x.store(1),y.store(1),y.load(),x.load()} events, which must agree with program order "edges":

x.store(1) "in TO is before" y.load()
y.store(1) "in TO is before" x.load()

and if foo() was called, then we have additional edge:

y.load() "reads value before" y.store(1)

and if bar() was called, then we have additional edge:

x.load() "reads value before" x.store(1)

and all these edges combined together would form a cycle:

x.store(1) "in TO is before" y.load() "reads value before " y.store(1) "in TO is before" x.load() "reads value before" x.store(true)

which violates the fact that orders have no cycles.

I intentionally use non-standard terms "in TO is before" and "reads value before" as opposed to standard terms like happens-before, because I want to solicit feedback about correctness of my assumption that these edges indeed imply happens-before relation, can be combined together in single graph, and the cycle in such combined graph is forbidden. I am not sure about that. What I know is this code produces correct barriers on Intel gcc & clang and on ARM gcc

Now, my real problem is a bit more complicated, because I have no control over "X" - it's hidden behind some macros, templates etc. I don't even know if "X" is a single variable, or some other concept. All I know, is that I have two macros set() and check() such that check() returns true "after" another thread has called set(). So conceptually set() is somewhat like "X=1" and check() is like "X", but I have no direct access to atomics involved, if any.

void thread_a(){
  set();
  if(!y.load()) foo();
}
void thread_b(){
  y.store(1);
  if(!check()) bar();
}

I'm worried, that set() might be internally implemented as x.store(1,std::memory_order_release) and check() might be x.load(std::memory_order_acquire). If this is the case, then check()'s body can be inlined and "reordered" before y.store(true) (See Alex's answer where they demonstrate that this happens on PowerPC). This would be really bad, as now it is possible that thread_b() first loads the old value of x (which is 0), then thread_a() executes everything including foo(), then thread_b() executes everything including bar(). So, both foo() and bar() got called, which I had to avoid.

What are my options to prevent that?

Option A

Try to force Store-Load barrier. This, in practice, can be achieved by std::atomic_thread_fence(std::memory_order_seq_cst); - as explained by Alex in a different answer all tested compilers emitted a full fence:

x86_64: MFENCE

PowerPC: hwsync

Itanuim: mf

ARMv7 / ARMv8: dmb ish

MIPS64: sync

The problem with this approach is, that I could not find any guarantee in C++ rules, that std::atomic_thread_fence(std::memory_order_seq_cst) must translate to full memory barrier. Actually, the concept of atomic_thread_fences in C++ seems to be at a different level of abstraction than the assembly concept of memory barriers and deals more with stuff like "what atomic operation synchronizes with what". Is there any theoretical proof that below implementation achieves the goal?

void thread_a(){
  set();
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!y.load()) foo();
}
void thread_b(){
  y.store(true);
  std::atomic_thread_fence(std::memory_order_seq_cst)
  if(!check()) bar();
}

Option B

Use control we have over Y to achieve synchronization, by using read-modify-write memory_order_acq_rel operations on Y:

void thread_a(){
  set();
  if(!y.fetch_add(0,std::memory_order_acq_rel)) foo();
}
void thread_b(){
  y.exchange(1,std::memory_order_acq_rel);
  if(!check()) bar();
}

The idea here is that accesses to a single atomic (y) must be form a single order on which all observers agree, so either fetch_add is before exchange or vice-versa.

If fetch_add is before exchange then the "release" part of fetch_add synchronizes with the "acquire" part of exchange and thus all side effects of set() have to be visible to code executing check(), so bar() will not be called.

Otherwise, exchange is before fetch_add, then the fetch_add will see 1 and not call foo(). So, it is impossible to call both foo() and bar(). Is this reasoning correct?

Option C

Use dummy atomics, to introduce "edges" which prevent disaster. Consider following approach:

void thread_a(){
  std::atomic<int> dummy1{};
  set();
  dummy1.store(13);
  if(!y.load()) foo();
}
void thread_b(){
  std::atomic<int> dummy2{};
  y.store(1);
  dummy2.load();
  if(!check()) bar();
}

If you think the problem here is atomics are local, then imagine moving them to global scope, in the following reasoning it does not appear to matter to me, and I intentionally wrote the code in such a way to expose how funny it is that dummy1 and dummy2 are completely separate.

Why on Earth this might work? Well, there must be some single total order of {dummy1.store(13),y.load(),y.store(1),dummy2.load()} which has to be consistent with program order "edges":

dummy1.store(13) "in TO is before" y.load()
y.store(1) "in TO is before" dummy2.load()

Now, we have two cases to consider: either y.store(1) is before y.load() or after in the total order.

If y.store(1) is before y.load() then foo() will not be called and we are safe.

If y.load() is before y.store(1), then combining it with the two edges we already have in program order, we deduce that:

dummy1.store(13) "in TO is before" dummy2.load()

Now, the dummy1.store(13) is a release operation, which releases effects of set(), and dummy2.load() is an acquire operation, so check() should see the effects of set() and thus bar() will not be called and we are safe. Is it correct here, to think that check() will see the results of set()? Can I combine the "edges" of various kinds ("program order", "total order", "before release", "after acquire") like that? I have serious doubts about this: C++ rules seem to talk about "synchronizes-with" relations between store and load on same location - here there is no such situation.

You can play with my implemenation of Options A,B,C at https://godbolt.org/z/u3dTa8

c++11

mardi 4 février 2020

How to achieve a StoreLoad barrier in C++11?

Aucun commentaire:

Enregistrer un commentaire