Skip to content

Lock not getting properly cleared for some jobs #560

@giuliana-quickmail

Description

@giuliana-quickmail

Describe the bug

The locks for some jobs were not removed after completing the job. This lead to new jobs getting rejected despite being unique with no other similar jobs in the queue.

This problem only affected some jobs immediately after we did a release adding the lock on the worker. We suspect the bug is related to the fact that we had many duplicate jobs scheduled and enqueued at the time of the release.

Expected behavior

No new jobs should not be rejected when they are unique.

Having duplicate jobs scheduled/enqueued already when introducing the lock should not cause locks to get stuck. Alternatively, there should be a warning in the documentation if this is a known problem.

Current behavior

The situation is similar to #379.

We have a CRON running every 10 minutes, as well as other processes in the app, scheduling jobs like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_async(inbox_id: inbox.id)

The job also re-enqueues itself like so:

InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(number_of_sec, inbox_id: inbox.id, check_at: check_at)

As explained above, we noticed a number of jobs had not run for hours after introducing the lock. This was critical so we didn't have much time to debug further. We took the following steps to solve the immediate problem:
We rolled back the release, manually removed uniquejobs Redis keys, and released again.

Jobs seem to be running fine for now. However we're not 100% sure what caused the bug and definitely want to prevent jobs from getting stuck this way again.

Worker class

# frozen_string_literal: true

module InboxManagement
  module Job
    class ProcessSendQueue
      include Sidekiq::Worker

      sidekiq_options lock: :until_executed, unique_args: ->(args) { args }, on_conflict: :log

      # ------------------------------------------------------------------------------
      def perform(options)
        puts "Calling InboxManagement::Job::ProcessSendQueue with options #{options}"

        inbox_id = options.fetch('inbox_id')
        inbox = Inbox.find_by(id: inbox_id)

        throttle = inbox.get_throttling_time_in_sec
        check_at = Time.now.utc.to_i + throttle + 1

        if inbox.is_locked?
          # Too soon to process. Make sure there is at least 1 sendqueue processing item in the job queue
          delay = inbox.seconds_before_can_send + 1
          puts "Inbox(#{inbox.id}) #{inbox.email} - Too soon, retry in #{delay} seconds"
          inbox.unlock
          InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(delay, inbox_id: inbox.id, check_at: check_at)
          return
        end

        # --- Perform processing here ---

        # Always have a sendqueue check as last step
        puts "Inbox(#{inbox.id}) #{inbox.email} - schedule next processsendqueue in #{throttle}+1"
        inbox.unlock
        InboxManagement::Job::ProcessSendQueue.set(queue: 'email').perform_in(throttle + 1, inbox_id: inbox.id, check_at: check_at)
      end
    end
  end
end

Additional context

gem 'sidekiq', '~> 6.1.2'
gem 'sidekiq-scheduler', '~> 2.2.2'
gem 'sidekiq-status', '~> 1.1.4'
gem 'sidekiq-unique-jobs', '7.0.0.beta27'

We also found many warnings like these in the server logs right after the release:

330 <190>1 2021-01-07T19:03:37.142273+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: d3ec01c74beff41ed222804c)
328 <190>1 2021-01-07T19:03:37.1459+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:37800753c47b7564a86721f0b28bf311 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:37800753c47b7564a86721f0b28bf311, job_id: b26be9241d357a2e043c89b3)
330 <190>1 2021-01-07T19:03:37.245831+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d569e7b8d189a774f2c2e67784c70285 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d569e7b8d189a774f2c2e67784c70285, job_id: c7e29af7ae83436fd3ccccc7)
330 <190>1 2021-01-07T19:03:37.250079+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:d921d85f78af446472ea63f84f7fb8c6 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:d921d85f78af446472ea63f84f7fb8c6, job_id: 3e8eb90413f35ebc7b6a14d7)
330 <190>1 2021-01-07T19:03:37.278398+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: b5430cfead1ab9cbfda7d698)
330 <190>1 2021-01-07T19:03:37.288918+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:29ae8d6615f0432bce1c996e045f8dc9 WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:29ae8d6615f0432bce1c996e045f8dc9, job_id: 3d9191a2ca6333480c0a8ee0)
330 <190>1 2021-01-07T19:03:37.298174+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: 4d19b48aeb0e69413577ca04)
330 <190>1 2021-01-07T19:03:37.312854+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: f13c240ee433061cbcd60c13)
330 <190>1 2021-01-07T19:03:37.316287+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:e70378146afc44c9f43f3b54f1ad530b WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:e70378146afc44c9f43f3b54f1ad530b, job_id: a70cf0d6887a0920d7a1addc)
330 <190>1 2021-01-07T19:03:37.340184+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:f687ad9fffe7614804d7e54799106bbe WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:f687ad9fffe7614804d7e54799106bbe, job_id: a1981f5e045d2e02ee97a1ca)
330 <190>1 2021-01-07T19:03:37.388268+00:00 app sidekiq_next.2 - - pid=25 tid=oxf0cgpr5 uniquejobs=client =uniquejobs:0bb75f49f2a1d29faafe872698124f9c WARN: Timed out after 0s while waiting for primed token (digest: uniquejobs:0bb75f49f2a1d29faafe872698124f9c, job_id: 39a716b48c4c466dc5025e81)

One of the jobs affected by the bug is listed there, uniquejobs:1c3f9b4b86ec4a62d650a48096f90d97. We found it odd that these warnings do not have the corresponding INFO log saying job will be skipped, and that the lock type until_executed is not displayed before the digest.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions