The redis/aquire_lock.lua script sets two keys in redis for each unique job; the first has an expiration time:
if redis.pcall('set', unique_key, job_id, 'nx', 'ex', expires) then
redis.pcall('hsetnx', 'uniquejobs', job_id, unique_key)
return 1
else
return 0
end
The redis/release_lock.lua script contains this to delete the same two keys:
if redis.pcall('del', unique_key) then
redis.pcall('hdel', 'uniquejobs', job_id)
return 1
end
However, if the job has already expired by the time this release_lock script is called then the first redis.pcall will return false and the 2nd one never gets executed. This causes the uniquejobs hash to keep some entries forever, and it just gets bigger and bigger, and in my case it was consuming all available memory and causing Sidekiq to reject all additional jobs even though the underlying queues were apparently empty.
One simple fix may be to always remove the key from the uniquejobs hash before testing whether the timed-out key can also be deleted, but I'll leave it to someone who understands the locking mechanism better to decide if it's good:
redis.pcall('hdel', 'uniquejobs', job_id)
if redis.pcall('del', unique_key) then
return 1
end
My workaround (to avoid having to change the library) is to add a configuration setting the increase the value of the default timeout:
SidekiqUniqueJobs.configure do |config|
config.default_queue_lock_expiration = 24 * 60 * 60
end
P.S.: the correct spelling of 'aquire' is 'acquire'
The redis/aquire_lock.lua script sets two keys in redis for each unique job; the first has an expiration time:
The redis/release_lock.lua script contains this to delete the same two keys:
However, if the job has already expired by the time this release_lock script is called then the first
redis.pcallwill return false and the 2nd one never gets executed. This causes the uniquejobs hash to keep some entries forever, and it just gets bigger and bigger, and in my case it was consuming all available memory and causing Sidekiq to reject all additional jobs even though the underlying queues were apparently empty.One simple fix may be to always remove the key from the uniquejobs hash before testing whether the timed-out key can also be deleted, but I'll leave it to someone who understands the locking mechanism better to decide if it's good:
My workaround (to avoid having to change the library) is to add a configuration setting the increase the value of the default timeout:
P.S.: the correct spelling of 'aquire' is 'acquire'