Describe the bug
The locks for some jobs were not removed after completing the job. This lead to new jobs getting rejected despite being unique with no other similar jobs in the queue.
This problem only affected some jobs immediately after we did a release adding the lock on the worker. We suspect the bug is related to the fact that we had many duplicate jobs scheduled and enqueued at the time of the release.
Expected behavior
No new jobs should not be rejected when they are unique.
Having duplicate jobs scheduled/enqueued already when introducing the lock should not cause locks to get stuck. Alternatively, there should be a warning in the documentation if this is a known problem.
Current behavior
The situation is similar to #379.
We have a CRON running every 10 minutes, as well as other processes in the app, scheduling jobs like so:
InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_async(inbox_id: inbox.id)
The job also re-enqueues itself like so:
InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(number_of_sec, inbox_id: inbox.id, check_at: check_at)
As explained above, we noticed a number of jobs had not run for hours after introducing the lock. This was critical so we didn't have much time to debug further. We took the following steps to solve the immediate problem:
We rolled back the release, manually removed uniquejobs Redis keys, and released again.
Jobs seem to be running fine for now. However we're not 100% sure what caused the bug and definitely want to prevent jobs from getting stuck this way again.
Worker class
# frozen_string_literal: true
module InboxManagement
module Job
class ProcessSendQueue
include Sidekiq::Worker
sidekiq_options lock: :until_executed, unique_args: ->(args) { args }, on_conflict: :log
# ------------------------------------------------------------------------------
def perform(options)
puts "Calling InboxManagement::Job::ProcessSendQueue with options #{options}"
inbox_id = options.fetch('inbox_id')
inbox = Inbox.find_by(id: inbox_id)
throttle = inbox.get_throttling_time_in_sec
check_at = Time.now.utc.to_i + throttle + 1
if inbox.is_locked?
# Too soon to process. Make sure there is at least 1 sendqueue processing item in the job queue
delay = inbox.seconds_before_can_send + 1
puts "Inbox(#{inbox.id}) #{inbox.email} - Too soon, retry in #{delay} seconds"
inbox.unlock
InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(delay, inbox_id: inbox.id, check_at: check_at)
return
end
# --- Perform processing here ---
# Always have a sendqueue check as last step
puts "Inbox(#{inbox.id}) #{inbox.email} - schedule next processsendqueue in #{throttle}+1"
inbox.unlock
InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(throttle + 1, inbox_id: inbox.id, check_at: check_at)
end
end
end
end
Additional context
gem 'sidekiq', '~> 6.1.2'
gem 'sidekiq-scheduler', '~> 2.2.2'
gem 'sidekiq-status', '~> 1.1.4'
gem 'sidekiq-unique-jobs', '7.0.0.beta27'
We also found many warnings like these in the server logs right after the release:
330 <190>1 2021-01-07T19:03:37.142273+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: d3ec01c74beff41ed222804c)
328 <190>1 2021-01-07T19:03:37.1459+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:37800753c47b7564a86721f0b28bf311 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:37800753c47b7564a86721f0b28bf311, job_id: b26be9241d357a2e043c89b3)
330 <190>1 2021-01-07T19:03:37.245831+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d569e7b8d189a774f2c2e67784c70285 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d569e7b8d189a774f2c2e67784c70285, job_id: c7e29af7ae83436fd3ccccc7)
330 <190>1 2021-01-07T19:03:37.250079+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: 3e8eb90413f35ebc7b6a14d7)
330 <190>1 2021-01-07T19:03:37.278398+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: b5430cfead1ab9cbfda7d698)
330 <190>1 2021-01-07T19:03:37.288918+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: 3d9191a2ca6333480c0a8ee0)
330 <190>1 2021-01-07T19:03:37.298174+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: 4d19b48aeb0e69413577ca04)
330 <190>1 2021-01-07T19:03:37.312854+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: f13c240ee433061cbcd60c13)
330 <190>1 2021-01-07T19:03:37.316287+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: a70cf0d6887a0920d7a1addc)
330 <190>1 2021-01-07T19:03:37.340184+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:f687ad9fffe7614804d7e54799106bbe WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:f687ad9fffe7614804d7e54799106bbe, job_id: a1981f5e045d2e02ee97a1ca)
330 <190>1 2021-01-07T19:03:37.388268+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:0bb75f49f2a1d29faafe872698124f9c WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:0bb75f49f2a1d29faafe872698124f9c, job_id: 39a716b48c4c466dc5025e81)
One of the jobs affected by the bug is listed there, uniquejobs:1c3f9b4b86ec4a62d650a48096f90d97. We found it odd that these warnings do not have the corresponding INFO log saying job will be skipped, and that the lock type until_executed is not displayed before the digest.
Describe the bug
The locks for some jobs were not removed after completing the job. This lead to new jobs getting rejected despite being unique with no other similar jobs in the queue.
This problem only affected some jobs immediately after we did a release adding the lock on the worker. We suspect the bug is related to the fact that we had many duplicate jobs scheduled and enqueued at the time of the release.
Expected behavior
No new jobs should not be rejected when they are unique.
Having duplicate jobs scheduled/enqueued already when introducing the lock should not cause locks to get stuck. Alternatively, there should be a warning in the documentation if this is a known problem.
Current behavior
The situation is similar to #379.
We have a CRON running every 10 minutes, as well as other processes in the app, scheduling jobs like so:
The job also re-enqueues itself like so:
As explained above, we noticed a number of jobs had not run for hours after introducing the lock. This was critical so we didn't have much time to debug further. We took the following steps to solve the immediate problem:
We rolled back the release, manually removed
uniquejobsRedis keys, and released again.Jobs seem to be running fine for now. However we're not 100% sure what caused the bug and definitely want to prevent jobs from getting stuck this way again.
Worker class
Additional context
We also found many warnings like these in the server logs right after the release:
One of the jobs affected by the bug is listed there,
uniquejobs:1c3f9b4b86ec4a62d650a48096f90d97. We found it odd that these warnings do not have the corresponding INFO log saying job will be skipped, and that the lock typeuntil_executedis not displayed before the digest.