Comparing 2 method of C++ ref-counting

147 Views Asked by At

To write a ref-counted class, I have seen 2 different approaches:

Approach 1:

struct RefCounting1 {
  void ref_up() {
    m_ref.fetch_add(1, std::memory_order_relaxed);
  }

  void release() {
    if (m_ref.fetch_sub(1, std::memory_order_acq_rel) == 1) {
      delete this;
    }
  }

private:
    std::atomic_int m_ref{1};
};

Approach 2:

struct RefCounting2 {
  void ref_up() {
    m_ref.fetch_add(1, std::memory_order_relaxed);
  }

  void release() {
    int refcnt = m_ref.fetch_sub(1, std::memory_order_relaxed) - 1;
    if (refcnt == 0)
      std::atomic_thread_fence(std::memory_order_acquire);

    if (refcnt > 0)
      return;
        
    delete this;
  }

private:
    std::atomic_int m_ref{1};
};

As you see, first one uses acquire-release memory order for decreasing reference counter, but second one uses relaxed decreasing and uses an acquire fence to protect non-atomic data. I wanted to know if these 2 methods can create same effect of not. Is there any benefit for one over other or not?

2

There are 2 best solutions below

8
curiousguy On

As you see, first one uses acquire-release memory order for decreasing reference counter, but second one uses relaxed decreasing and uses an acquire fence to protect non-atomic data.

You can't have acquire semantics without release semantics: you can only acquire (obtain a secure view of) what others have released. Release memory order is like "publish finished memory operations of the current thread" and acquire memory order is like "get the latest view of finished memory operations by another thread".

[Why do you call your reset function release in a discussion of memory orders? Not nice!]

I wanted to know if these 2 methods can create same effect of not. Is there any benefit for one over other or not?

They don't have the same effect: the first one, the classical RC implementation (RefCounting1) is valid and your optimized alternative (RefCounting2) is broken.

You have to express the minimum synchronization needed, the axioms of the reference counted shared resource. The resource must be deallocated when the last manager class destructor (or reset member function) runs, that is, after the other destructors. There is an ordering implied: the event RC reaches zero must be after all others. That doesn't mean anyone cares that each RC atomic operation is well ordered WRT to all others, just that the last one is the last one.

RefCounting1 makes sure that is the case by ordering all the reset operations, which is overkill.

The following should create the necessary ordering:

void reset() { // not "release", please
    int refcnt = m_ref.fetch_sub(1, std::memory_order_release) - 1;
    if (refcnt == 0) {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete this;
    }
}

Here the last acquire is paired with all previous releases.

Note: lack of ordering on count increase can make the exact multiplicity of the value of use_count compiler and machine dependent in MT programs if one thread makes many local copies (and the compiler can accurately track these with escape analysis); in an extreme case, the compiler could remove additional redundant thread local shared_ptr instances that do nothing but change the count, with the transformation of these spread out actions:

m_ref.fetch_add(1, std::memory_order_relaxed); // N times
m_ref.fetch_sub(1, std::memory_order_release); // N times

to:

/* m_ref.fetch_add(1, std::memory_order_relaxed); // 0 times */
m_ref.fetch_sub(0, std::memory_order_release); // 1 time

Assuming no operation with a memory order in between.

Note:

  • m_ref.fetch_sub(0, std::memory_order_release) can be compiled as a simple memory fence, but one might want to keep the explicit operation on a specific object in intermediate code for as long as possible, until all optimizing phases involving those are finished;
  • the compiler can try to push m_ref.fetch_sub(0, std::memory_order_release) as late as possible in program order, until it reaches the needed emission of release fence, and so simply remove the fetch_sub operation.

The optimization is trivially sound, and clearly a win, the difficulty is mostly following all function called to see that there is nothing "in between" that breaks the optimization.

Note: To avoid breaking progress bars and similar, and even more critical, time dependent programs (think: heartbeat), such optimizations should be avoided in functions doing heavy computations, anything that runs for long enough to notice the reordering.

The possible optimization makes the value of use_count less precise but not totally random and unreliable. There is an hard lower bound of use_count in any shared (between threads) shared_ptr family (those that are copies of each others and share a control block), if the program is correctly synchronized.

Contrapositive: if you can't prove there is a lower bound on such weakly synchronized reference count in a MT program, your program may lack synchronization on shared_ptr objects.

In other words: your program must contain synchronization to share shared_ptr families between threads, because the only way to do that is to share the value of one particular shared_ptr instance between threads.

0
mpoeter On

The first implementation is correct while the second one is most likely incorrect. "Most likely" because theoretically there can be some synchronization code that is not shown here. But even then it is bad design since we rely on this external synchronization instead of encapsulating it. This is still highly unusual and confusing.

In general acquire/release always come in pairs. An acquire-fence by itself does not achieve anything - it requires a release-operation/release-fence to synchronize-with in order to establish a happens-before relation.

To understand why/where we need acquire/release semantic we need to understand our requirements - which operations do we need to order?

A shared pointer allows to have multiple references to a shared object, potentially by multiple different threads (otherwise it would not make sense to make the ref counting thread safe). All these threads may modify the object. Obviously these modifications must be synchronized somehow, but we don't care about the details how here. However, destruction can not be covered by the same synchronization method since it is handled by the shared pointer itself. So what we have to order is that all modifications (of all threads) happen-before destruction of the object. This can be achieved by using acquire/release operations on the ref count. The one thread that performs the last decrement which causes the ref count to drop to zero has to synchronize-with all previous decrements. So that last thread had to use acquire while all other threads have to use release. Since we do not need beforehand whether we will be the last thread the simplest solution is to use memory_order_acq_rel for all. Alternatively one can do this to make this more explicit:

    if (m_ref.fetch_sub(1, std::memory_order_release) == 1) {
      m_ref.load(std::memory_order_acquire); // synchronize with all previous fetch_sub operations
      delete this;
    }

One often overlooked but nevertheless important detail is that this only works because we have a release sequence.

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation is an atomic read-modify-write operation.

Usually a an acquire-load synchronizes-with the release-store from which it takes its value, ie., this is a synchronization between two threads. In this case however we need to synchronize with all threads which at some point held a reference.

Consider the following modification order of our ref count:

T1: store(1)
T2: fetch_add(1, relaxed)
T1: fetch_sub(1, release) <- start of release sequence 1
T3: fetch_add(1, relaxed)
T2: fetch_sub(1, release) <- start of release sequence 2
T3: fetch_sub(1, release) <- start of release sequence 3

Thread T3 performs the last decrement that causes the ref count to drop to zero. Since all modifications except for the initial store are done via read-modify-write operations, every release-operation is the head of a release sequence that extends until the end of the modification order. In this example we have three active release sequences. When thread T3 then performs the acquire-load this will synchronize-with the heads of all active release-sequences, ie., with all threads that previously performed a release-fetch_sub.

I am not a fan of reasoning about reorderings or other potential compiler optimizations. The standard does not say anything about such optimizations, but instead defines an abstract machine with a defined (observable) behavior. With respect to concurrent execution, the formal concept that defines this behavior is the happens-before relation. Virtually all optimizations the compiler may perform are based on what is commonly referred to as the "as-if rule", which basically says that the compiler is free to perform all kinds optimizations or code transformations, as long as it does not change the observable behavior.
In other words, if you correctly identify your required happens-before relations and ensure that they are established correctly, then you don't need to care about compiler optimizations at all.

If thinking about these optimizations helps with your mental model, that is fine. Just be aware that ultimately what counts is that the necessary happens-before relations are in place, because this is also the model that the compiler will use to analyze the code and decide which optimizations it may perform.

In the comments it is mentioned that the second implementation is used in opensssl, but there we have the following comment:

/*
 * Changes to shared structure other than reference counter have to be
 * serialized. And any kind of serialization implies a release fence. This
 * means that by the time reference counter is decremented all other
 * changes are visible on all processors. Hence decrement itself can be
 * relaxed. In case it hits zero, object will be destructed. Since it's
 * last use of the object, destructor programmer might reason that access
 * to mutable members doesn't have to be serialized anymore, which would
 * otherwise imply an acquire fence. Hence conditional acquire fence...
 */

A shared_ptr allows multiple threads to keep a reference to a shared resource, but modifications of that resource must still be synchronized. My understanding of this comment is that they rely on the release-semantic of this synchronization, so that the acquire-fence can synchronize with it. I find this rather suspect TBH. I haven't looked at the rest of the code, so it might actually be correct, but it is at least highly unusual!