We have a JAVA application that is crashing our redhat server (30 cores / 512Go ram) by consuming some (unknown?) ressource preventing other components from creating new threads, we're currently working around this by killing the process that is spamming the threads each time the problem apears and that's about every 15 days, we attempted to set huge values on /etc/security/limits.conf but we get the problem way before reaching that limit.
I counted the threads last time it happend using ps -efL | wc -l , is 10000 thread a lot for our beast knowing that the CPU/RAM consumption was low at that moment? I used gstack to try to figure out where it is stuck but since it is a JAVA program idk if the output is meaningful? but i could identify a pattern there: most of the 9000 threads look like this:
Thread 9049 (Thread 0x7f43d5087700 (LWP 123925)):
#0 0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2 0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5 0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6 0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f43d722f1ad in clone () from /lib64/libc.so.6
Thread 9048 (Thread 0x7f43d4f86700 (LWP 123926)):
#0 0x00007f43d791e705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007f43d6a94f33 in os::PlatformEvent::park() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#2 0x00007f43d6a58e67 in Monitor::IWait(Thread*, long) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#3 0x00007f43d6a59786 in Monitor::wait(bool, long, bool) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#4 0x00007f43d6c48e1b in GangWorker::loop() () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#5 0x00007f43d6a9bd48 in java_start(Thread*) () from /opt/3pp/jdk1.8.0_25/jre/lib/amd64/server/libjvm.so
#6 0x00007f43d791adf5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f43d722f1ad in clone () from /lib64/libc.so.6
Also before killing the process I used gcore -o /tmp/dump.txt , is it a correct way to get a corefile of a java process?
When i attempt to take a look using gdb I get no debugging symbols and not a core dump, is this the right way to check this kind of files?
M1:~# gdb /opt/3pp/jre/bin/java /tmp/dump.txt.123913
GNU gdb (GDB)
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/3pp/jre/bin/java...(no debugging symbols
"/tmp/dump.txt.123913" is not a core dump: File format not recognized
Missing separate debuginfos, use: debuginfo-install jre1.8.0_25-1.8.0_25-fcs.x86_64
Thanks in advance for your time.
It's a not an insignificant number of threads, but no, 10K threads is not that much, especially for a 30 core machine. The 4 core Windows desktop I'm currently on has ~3K.
I never tried debugging Java using native thread stacks, but that stack trace, to me, looks like a "parked" thread. In other words, a thread in some thread pool that has nothing to do, so it's waiting for work. See this answer for more details.
It probably has some value, but I would suggest using java-specific tools for the job. The first thing that comes to mind is
jcmdwhich comes with the JDK. Here's a link to get you started. Java 9's version has some nicer documentation, and is very similar.What I'd specifically do is use the
Thread.printcommand ofjcmdto print java-level stack traces andGC.heap_dumpto dump of the entire java heap into an.hproffile which can later be analyzed by tools such as MAT.If you're using a JDK 8 with "Commercial Features", you could also enable the JFR (Java Flight Recorder which tracks the execution of the process. The files created by JFR can be opened either using Oracle's "Mission Control", or an alternative "Mission Control", such as the one from Azul, called Zulu.
Finally, you could also try to connect to the process using using jconsole, which is another tool that comes with the JDK.
Good luck.