To write a ref-counted class, I have seen 2 different approaches:
Approach 1:
struct RefCounting1 {
void ref_up() {
m_ref.fetch_add(1, std::memory_order_relaxed);
}
void release() {
if (m_ref.fetch_sub(1, std::memory_order_acq_rel) == 1) {
delete this;
}
}
private:
std::atomic_int m_ref{1};
};
Approach 2:
struct RefCounting2 {
void ref_up() {
m_ref.fetch_add(1, std::memory_order_relaxed);
}
void release() {
int refcnt = m_ref.fetch_sub(1, std::memory_order_relaxed) - 1;
if (refcnt == 0)
std::atomic_thread_fence(std::memory_order_acquire);
if (refcnt > 0)
return;
delete this;
}
private:
std::atomic_int m_ref{1};
};
As you see, first one uses acquire-release memory order for decreasing reference counter, but second one uses relaxed decreasing and uses an acquire fence to protect non-atomic data. I wanted to know if these 2 methods can create same effect of not. Is there any benefit for one over other or not?
You can't have acquire semantics without release semantics: you can only acquire (obtain a secure view of) what others have released. Release memory order is like "publish finished memory operations of the current thread" and acquire memory order is like "get the latest view of finished memory operations by another thread".
[Why do you call your
resetfunctionreleasein a discussion of memory orders? Not nice!]They don't have the same effect: the first one, the classical RC implementation (
RefCounting1) is valid and your optimized alternative (RefCounting2) is broken.You have to express the minimum synchronization needed, the axioms of the reference counted shared resource. The resource must be deallocated when the last manager class destructor (or
resetmember function) runs, that is, after the other destructors. There is an ordering implied: the event RC reaches zero must be after all others. That doesn't mean anyone cares that each RC atomic operation is well ordered WRT to all others, just that the last one is the last one.RefCounting1makes sure that is the case by ordering all the reset operations, which is overkill.The following should create the necessary ordering:
Here the last acquire is paired with all previous releases.
Note: lack of ordering on count increase can make the exact multiplicity of the value of
use_countcompiler and machine dependent in MT programs if one thread makes many local copies (and the compiler can accurately track these with escape analysis); in an extreme case, the compiler could remove additional redundant thread localshared_ptrinstances that do nothing but change the count, with the transformation of these spread out actions:to:
Assuming no operation with a memory order in between.
Note:
m_ref.fetch_sub(0, std::memory_order_release)can be compiled as a simple memory fence, but one might want to keep the explicit operation on a specific object in intermediate code for as long as possible, until all optimizing phases involving those are finished;m_ref.fetch_sub(0, std::memory_order_release)as late as possible in program order, until it reaches the needed emission of release fence, and so simply remove thefetch_suboperation.The optimization is trivially sound, and clearly a win, the difficulty is mostly following all function called to see that there is nothing "in between" that breaks the optimization.
Note: To avoid breaking progress bars and similar, and even more critical, time dependent programs (think: heartbeat), such optimizations should be avoided in functions doing heavy computations, anything that runs for long enough to notice the reordering.
The possible optimization makes the value of
use_countless precise but not totally random and unreliable. There is an hard lower bound ofuse_countin any shared (between threads)shared_ptrfamily (those that are copies of each others and share a control block), if the program is correctly synchronized.Contrapositive: if you can't prove there is a lower bound on such weakly synchronized reference count in a MT program, your program may lack synchronization on
shared_ptrobjects.In other words: your program must contain synchronization to share
shared_ptrfamilies between threads, because the only way to do that is to share the value of one particularshared_ptrinstance between threads.