We are running our Ruby on Rails application on AWS ECS. Application has multiple services per cluster each running X number of tasks that are responsible for jobs in their queue. For queue we are using Resque which uses Redis as it's database. Versions:
- Rails 5.2.6
- Ruby 2.5.9p229
- Resque 2.4.0
Most of the time everything is working fine, but sometimes some of the tasks just get stuck. After some investigation i found out this:
htop command result
8 root 20 0 239M 137M 13928 S 0.0 1.8 0:04.73 `- /usr/local/bundle/bin/rake resque:workers QUEUE=import COUNT=1
11 root 20 0 1091M 377M 32640 S 0.0 4.9 0:21.33 | `- resque-2.4.0: Forked 27 at 1680393903
28 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.11 | | `- ruby-timer-thr
27 root 20 0 1027M 308M 20588 S 0.0 4.0 0:48.67 | | `- resque-2.4.0: Processing import since 1680
32 root 20 0 1027M 308M 20588 S 0.0 4.0 0:00.00 | | | `- connection_poo*
30 root 20 0 1027M 308M 20588 S 0.0 4.0 0:00.00 | | | `- connection_poo*
29 root 20 0 1027M 308M 20588 S 0.0 4.0 0:48.46 | | | `- ruby-timer-thr
26 root 20 0 1091M 377M 32640 S 0.0 4.9 0:08.19 | | `- worker.rb:527
23 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.03 | | `- ruby
20 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.00 | | `- jemalloc_bg_thd
19 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.35 | | `- connection_poo*
18 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.39 | | `- connection_poo*
17 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.49 | | `- connection_poo*
16 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.29 | | `- connection_poo*
15 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.43 | | `- connection_poo*
13 root 20 0 1091M 377M 32640 S 0.0 4.9 0:00.46 | | `- connection_poo*
10 root 20 0 239M 137M 13928 S 0.0 1.8 0:00.00 | `- tasks.rb:32
9 root 20 0 239M 137M 13928 S 0.0 1.8 0:00.00 | `- ruby-timer-thr
when i do strace on PID 27 it returns this
futex(0x7f7027dc52d0, FUTEX_WAIT_PRIVATE, 2, NULL
my understanding is that process is waiting for some resource that is not available, so maybe some of the other processes are using those resources, or it's something regarding remote connection (Redis, Postgres etc) but i'm not really sure how to check that.
I also noticed that we have multiple 'ruby-timer-thr' processes strace on PID 29 returns '-1 EAGIN' every couple of minutes, will post the entire message once some of the tasks get stuck again.
strace on PID 28 returns this every second
restart_syscall(<... resuming interrupted read ...>) = 0
poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
poll([{fd=3, events=POLLIN}], 1, 100) = 0 (Timeout)
I tried to SIGKILL PID 29 (SIGTERM didn't work) and task created new fork of Resque and continued normally.
Does anyone have any idea what is the problem here and is there any other way for me to debug this? Maybe there is a deadlock somewhere?