Describe the bug
There is a possible race condition in the ruby reaper, where a worker is started, the reaper runs right after and reaps the active job.
Expected behavior
Active workers shouldn't be reaped.
Current behavior
Active workers are reaped incorrectly.
Worker class
module Debugging
class ProcessWorker
include Sidekiq::Worker
sidekiq_options queue: "default", retry: 0, lock: :until_executed, on_conflict: :log
def perform
puts "Starting ProcessWorker"
sleep(5)
puts "Finishing ProcessWorker"
end
end
end
Config
SidekiqUniqueJobs.configure do |config|
config.reaper = :ruby
config.reaper_interval = 1
end
Additional context
I've adding some logging to outline where the problem occurs.
log_info("****** valid #{valid}") and log_info("****** workers #{workers}") was added between line 129 and 131 in RubyReaper#active?
I added logging in RubyReaper#belongs_to_job?
def belongs_to_job?(digest)
scheduled = scheduled?(digest)
retried = retried?(digest)
enqueued = enqueued?(digest)
active = active?(digest)
log_info("digest #{digest} scheduled #{scheduled}")
log_info("digest #{digest} retried #{retried}")
log_info("digest #{digest} enqueued #{enqueued}")
log_info("digest #{digest} active #{active}")
scheduled || retried || enqueued || active
end
digests.each { |digest| log_info("digest #{digest} to be deleted") } was added after line 74 in BatchDelete#call
Here's my log output:
2021-01-07T18:22:41.016Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:41.030Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa INFO: start
2021-01-07T18:22:41.335Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Starting ProcessWorker
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: ****** valid true
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: ****** workers {}
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c scheduled false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c retried false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c enqueued false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c active false
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Deleting batch with 1 digests
2021-01-07T18:22:42.053Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: digest uniquejobs:512613aedf8d8099a473a343e0bc352c to be deleted
2021-01-07T18:22:42.056Z 14493 TID-owkkfzd4l INFO: (2021-01-07 13:22:42 -0500) Execution successfully returned 1
2021-01-07T18:22:43.063Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:44.068Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:45.077Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:46.088Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
2021-01-07T18:22:46.336Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c INFO: Finishing ProcessWorker
2021-01-07T18:22:46.338Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa uniquejobs-server DIG-uniquejobs:512613aedf8d8099a473a343e0bc352c WARN: might need to be unlocked manually
2021-01-07T18:22:46.346Z 14493 TID-owkkfzqzt Debugging::ProcessWorker JID-7be382796c87374b89c0cefa INFO: done: 5.316 sec
2021-01-07T18:22:47.098Z 14493 TID-owkkfzd4l uniquejobs=orphan-reaper INFO: Nothing to delete; exiting.
Based on the logs, sidekiq logs that its starting ProcessWorker, then the first puts statement in ProcessWorker is logged, then the reaper starts running. When it checks for active jobs, there are no workers, even though the ProcessWorker has clearly started. Thus, active? comes back false, and the digest gets deleted.
I've exacerbated this problem by setting the reaper interval to 1 second, but I have had this occur in production with the interval set to the default of 6 minutes.
I've been able to replicate this issue about 50% of the time with the above settings/logging.
This issue doesn't seem to occur with the lua reaper. I'm testing our staging environment currently with the lua reaper to ensure it will work for us. Based on the documentation
The :ruby job is much slower but the :lua job locks redis while executing. While doing intense processing it is best to avoid locking redis with a lua script.
I was hesitant to enable because we do some pretty intense processing in sidekiq. I wasn't sure if the "intense processing" applied to what the reaper is doing or overall.
Is this expected behavior in the ruby reaper because it doesn't lock redis and we should use lua instead? Or is this actually a bug for the ruby reaper?
Describe the bug
There is a possible race condition in the ruby reaper, where a worker is started, the reaper runs right after and reaps the active job.
Expected behavior
Active workers shouldn't be reaped.
Current behavior
Active workers are reaped incorrectly.
Worker class
Config
Additional context
I've adding some logging to outline where the problem occurs.
log_info("****** valid #{valid}")andlog_info("****** workers #{workers}")was added between line 129 and 131 in RubyReaper#active?I added logging in
RubyReaper#belongs_to_job?digests.each { |digest| log_info("digest #{digest} to be deleted") }was added after line 74 in BatchDelete#callHere's my log output:
Based on the logs, sidekiq logs that its starting ProcessWorker, then the first puts statement in ProcessWorker is logged, then the reaper starts running. When it checks for active jobs, there are no workers, even though the ProcessWorker has clearly started. Thus,
active?comes back false, and the digest gets deleted.I've exacerbated this problem by setting the reaper interval to 1 second, but I have had this occur in production with the interval set to the default of 6 minutes.
I've been able to replicate this issue about 50% of the time with the above settings/logging.
This issue doesn't seem to occur with the lua reaper. I'm testing our staging environment currently with the lua reaper to ensure it will work for us. Based on the documentation
I was hesitant to enable because we do some pretty intense processing in sidekiq. I wasn't sure if the "intense processing" applied to what the reaper is doing or overall.
Is this expected behavior in the ruby reaper because it doesn't lock redis and we should use lua instead? Or is this actually a bug for the ruby reaper?