ReentrantReadWriteLock hang when used within ConcurrentHashMap::compute()

437 Views Asked by At

TL;DR - In my application many threads grab a ReentrantReadWriteLock in READ mode while they are inserting entries into a ConcurrentHashMap via the compute() method, and release the READ lock once the lamdba passed to compute() has finished. There's a separate thread that grabs the ReentrantReadWriteLock in WRITE mode and very (very) quickly releases it. While all this is happening, the ConcurrentHashMap is resizing (growing AND shrinking). I encounter a hang and I always see ConcurrentHashMap::transfer(), which is called during resizing, in the stack traces. All threads are blocked waiting to grab MY ReentrantReadWriteLock. Reproducer at: https://github.com/rumpelstiltzkin/jdk_locking_bug

Am I doing something wrong as per documented behavior, or is this a JDK bug? Note that I'm NOT asking for other ways to implement my application.


Details: Here's some context around why my application is doing what it's doing. The reproducer code is a pared down version to demonstrate the problem.

My application has a write-through cache. Entries get inserted into the cache with a timestamp of when they are inserted and separate flusher-thread iterates the cache to find entries that were created after the last time the flusher-thread persisted entries to disk i.e. after last-flush-time. The cache is nothing but a ConcurrentHashMap.

Now, a race is possible whereby an entry gets constructed with a timestamp tX and while it is being inserted into the ConcurrentHashMap the flusher-thread iterates the cache and does not find the entry (it is still being inserted so not yet visible in the flusher-thread's Map::Iterator) and so it does not persist it and bumps last-flush-time to tY such that tY > tX. The next time the flusher-thread iterates the cache it will not deem the tX-timestamped entry as needing to be flushed and we will miss persisting it. Eventually tX will be a very old timestamp and the cache will evict it permanently losing that update.

To get around this problem, threads that update the cache with new entries grab a ReentrantReadWriteLock in READ mode within the lambda that constructs the cache entry inside the ConcurrentHashMap::compute() method, and the flusher-thread grabs the same ReentrantReadWriteLock in WRITE mode when grabbing its last-flush-time. This ensures that when the flusher-thread grabs a timestamp all objects are "visible" in the Map, and have a timestamp <= the last-flush-time.


Reproduction on my system:

$> java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
$> ./runtest.sh 
seed is 1571855560640
Main spawning 100 readers
Main spawned 100 readers
Main spawning a writer
Main spawned a writer
Main waiting for threads ... <== hung

All threads (readers and the writer) blocked waiting for 0x00000000c6511648

$> ps -ef | grep java | grep -v grep
user   54896  54895  0 18:32 pts/1    00:00:07 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main

$> jstack -l 54896 > jstack.1

$> grep -B3 'parking to wait for  <0x00000000c6511648>' jstack.1  | grep tid | head -10
"WRITER" #109 ...
"READER_99" ...
...

'top' shows my java process has been sleeping for minutes (it uses a tiny bit of CPU incrementally for possibly context switching and what not - see top's man page for more explanation why this happens)

$> top -p 54896
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                               
 54896 user      20   0 4630492 103988  12628 S   0.3  2.7   0:07.37 java -ea -cp target/*:target/lib/* com.hammerspace.jdk.locking.Main
1

There are 1 best solutions below

8
Elisha Ebenezer On

Note: Summarized below is the list of observations, suggested approach and an advice to file a bug with Oracle. Not the solution.

Observations

  1. Concurrent Maps has their on inbuilt locking mechanism we are not needed to acquire one ourselves

  2. Atomic* classes return within a "single" cpu cycle and hence no need of acquiring a lock when dealing with them

  3. In Cache.java, you are obtaining (your own) ReadLock for updating the cache (Line 34), and (your own) WriteLock for reading from the map (line 58), and not acquiring any lock when you are actually removing the mapping (line 71).

  4. Concurrent Maps's iterators are weakly consistent and they will not see your updates, even if the insertion is completed. it is by design.

  5. I've restored AtomicInteger, as I did not want to use Holder (coming from jax-ws), and I was not able to reproduce your thread blocking.

  6. Given that you are starting ReadLock acquiring threads prior to starting WriteLock acquiring threads. WriteLock acquiring threads will never get a chance to run as there are already a bunch of already threads that have already acquired Read Locks.

  7. Introducing a 1 sec sleep in Cache#update method after releasing the ReadLock gave chance for the WriteLock acquiring threads to run.

  8. I've reverted my updates and am able to reproduce your issue. But I did see a pattern.

    a. Using Holder for lockCount got the system to crawl within no time.

    b. Using AtomicInteger for lockCount prolonged the life for a few more seconds

    c. Introducing a console statement about acquiring and releasing locks with runnable's id prolonged the life by a minute or two.

    d. Replacing the Id with current thread's name in console output has completely taken away the problem.

With this, it clearly looks like a timing issue where race is happening for acquiring readlock and writelock as both readers and writer are waiting for acquiring respective locks resulting in a deadlock, and delay introduced with the extra statements is reducing the chances of this.

Suggested Approach

  1. Given that ConcurrentHashMap comes with its own locking mechanism, you can stop using your own Reentrant lock while dealing with it.

  2. Update your code to allow the WriteLock acquirers a chance to run :)

  3. Check your version of Java, as I never got to a blocked state while running on Java 1.8.0_201