When linked "properly" (explained further), both function calls below block indefinitely on pthread calls implementing cv.notify_one and cv.wait_for:
// let's call it odr.cpp, which forms libodr.so
std::mutex mtx;
std::condition_variable cv;
bool ready = false;
void Notify() {
std::chrono::milliseconds(100);
std::unique_lock<std::mutex> lock(mtx);
ready = true;
cv.notify_one();
}
void Get() {
std::unique_lock<std::mutex> lock(mtx);
cv.wait_for(lock, std::chrono::milliseconds(300));
}
when shared library above is used in following application:
// let's call it test.cpp, which forms a.out
int main() {
std::thread thr([&]() {
std::cout << "Notify\n";
Notify();
});
std::cout << "Before Get\n";
Get();
std::cout << "After Get\n";
thr.join();
}
Problem reproduces only when linking libodr.so:
- with g++
- with gold linker
- providing
-lpthreadas dependency
with following versions of relevant tools:
Linux Mint 18.3 Sylviabinutils 2.26.1-1ubuntu1~16.04.6g++ 4:5.3.1-1ubuntu1libc6:amd64 2.23-0ubuntu10
so that we end up with:
__pthread_key_createdefined as WEAK symbol in PLT- no
libpthread.soas dependency in ELF
as shown here:
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
10: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create
On the other hand, with any of the following we experience no bug:
- clang++
- bfd linker
- no explicit
-lpthread -lpthreadbut with-Wl,--no-as-needed
note: this time we have either:
NOTYPEand nolibpthread.sodependencyWEAKandlibpthread.sodependency
as shown here:
$ clang++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
24: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create@GLIBC_2.2.5 (7)
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=bfd -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
14: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __pthread_key_create
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
18: 0000000000000000 0 NOTYPE WEAK DEFAULT UND __pthread_key_create
$ g++ -fPIC -shared -o build/libodr.so build/odr.cpp.o -fuse-ld=gold -Wl,--no-as-needed -lpthread && readelf -d build/libodr.so | grep Shared && readelf -Ws build/libodr.so | grep -m1 __pthread_key_create && ./a.out
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
10: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_key_create@GLIBC_2.2.5 (4)
Complete example to compile/run can be found here: https://github.com/aurzenligl/study/tree/master/cpp-pthread
What breaks shlib using pthread when __pthread_key_create is WEAK and no libpthread.so dependency in ELF can be found? Does the dynamic linker take the pthread symbols from libc.so (stubs) instead of libpthread.so?
There's a lot happening here: differences between gcc and clang, differences between gnu ld and gold, the
--as-neededlinker flag, two different failure modes, and maybe even some timing issues.Let's start with how to link a program using POSIX threads.
The compiler's
-pthreadflag is all you should need. It's a compiler flag, so you should use it both when compiling code that uses threads and when linking the final executable. When you use-pthreadon the link step, the compiler will provide the-lpthreadflag automatically, and in the right place in the link line.Typically, you would only use it when linking the final executable, and not when linking a shared library. If you simply want to make your library thread safe, but don't want to force every program that uses your library to link with pthreads, you'd want to use a runtime check to see if the pthreads library is loaded, and call the pthread APIs only if it is. On Linux, this is typically done by checking a "canary" -- for example, make a weak reference to an arbitrary symbol like
__pthread_key_create, which will only be defined if the library is loaded, and will have the value 0 if the program was linked without it.In your case, however, your library
libodr.sopretty much depends on threads, so it's reasonable to link it with the-pthreadflag.That brings us to the first failure mode: if you use g++ and gold for both link steps, the program throws
std::system_errorand says you need to enable multithreading. This is due to the--as-neededflag. GCC passes--as-neededto the linker by default, while clang (apparently) does not. With--as-needed, the linker will only record library dependencies that resolve a strong reference. Since all the references to pthread APIs are weak, none of them are sufficient to tell the linker that libpthread.so should be added to the dependency list (via aDT_NEEDEDentry in the dynamic table). Changing to clang or adding a-Wl,--no-as-neededflag solves this problem, and the program will load the pthread library.But, wait, why don't you need to do this when using the Gnu linker? It uses the same rule: only a strong reference causes the library to be recorded as a dependency. The difference is that Gnu ld also considers references from other shared libraries, while gold only considers references from regular object files. It turns out that the pthread library provides overriding definitions of several libc symbols, and there are strong references from
libstdc++.soto some of those symbols (e.g.,write). Those strong references are enough to get Gnu ld to recordlibpthread.soas a dependency. This is more of an accident than design; I don't think changing gold to consider references from other shared libraries would actually be a robust fix. I think the proper solution is for GCC to put--no-as-neededin front of the-lpthreadflag when you use-pthread.This begs the question of why this issue doesn't come up all the time when using POSIX threads and the gold linker. But this is a small test program; a larger program is almost certain to contain strong references to some of those libc symbols that
libpthread.sooverrides.Now let's look at the second failure mode, where both
Notify()andGet()block indefinitely if you linklibodr.sowith g++, gold and-lpthread.In
Notify(), you're holding the lock through the end of the function, while you callcv.notify_one(). You really only need to hold the lock to set the ready flag; if we change it so that we release the lock before that, then the thread callingGet()will timeout after 300 ms, and does not block. So it's really the call tonotify_one()that's blocking, and the program is deadlocking becauseGet()is waiting on that same lock.So why does it block only when
__pthread_key_createisFUNCinstead ofNOTYPE? I think the type of the symbol is a red herring, and that the real problem is caused by the fact that gold doesn't record the symbol versions for references resolved by a library that isn't added as a needed library. The implementation ofwait_forcallspthread_cond_timedwait, which has two versions in bothlibpthreadandlibc. It's possible that the loader is binding the reference to the wrong version, causing a deadlock by failing to unlock the mutex. I made a temporary patch to gold to record those versions, and that made the program work. Unfortunately, that's not a solution, as that patch can cause ld.so to crash under other circumstances.I tried changing
cv.wait_for(...)tocv.wait(lock, []{ return ready; }), and the program runs perfectly in all scenarios, which further suggests that the problem is withpthread_cond_timedwait.The bottom line is that adding the
--no-as-neededflag will fix the problem for this very small test case. Anything larger is likely to work without the extra flag, as you'll be increasing the odds of making a strong reference to a symbol inlibpthread. (For example, adding a call tostd::this_thread::sleep_foranywhere inodr.cppadds a strong reference tonanosleep, which putslibpthreadin the needed list.)Update: I've verified that the failing program is linking to the wrong version of
pthread_cond_timedwait. For glibc 2.3.2, thepthread_cond_ttype was changed, and the old versions of the APIs that use the type were changed to dynamically allocate a new (bigger) structure and store a pointer to it in the original type. So now, if the consuming thread reachescv.wait_forbefore the producing thread reachescv.notify_one, the implementation ofcv.wait_forcalls the old version ofpthread_cond_timedwait, which initializes what it thinks is an oldpthread_cond_tincvwith a pointer to a newpthread_cond_t. After that, when the other thread reachescv.notify_one, its implementation assumes thatcvcontains a new-stylepthread_cond_trather than a pointer to one, so it callspthread_mutex_lockwith the pointer to the newpthread_cond_tinstead of the pointer to the mutex. It locks that would-be mutex, but it never gets unlocked because the other thread unlocks the real mutex.