Skip to content

Improve Ruby Reaper performance under heavy load #663

@francesmcmullin

Description

@francesmcmullin

Is your feature request related to a problem? Please describe.
I find the ruby reaper runs slowly, and takes up a lot of memory when I'm processing a large volume of jobs (hundreds of thousands in a few hours). I've updated our configuration so that if the reaper fails entirely, the reaper resurrector brings it back, and it mostly doesn't hit the timeout, but ideally it would run a little bit faster with a smaller footprint.

Describe the solution you'd like
I suggest that the reaper should work through the oldest digests first, and that it should avoid loading all digests in ruby at once. Here's the code I'm interested in:

conn.zrevrange(digests.key, 0, -1).each_with_object([]) do |digest, memo|
next if belongs_to_job?(digest)
memo << digest
break memo if memo.size >= reaper_count
end

Currently, using zrevrange means we go from the highest score to the lowest. As the current timestamp is generally used for a digest, this means going from newest to oldest. It's certainly not perfect, but I suggest a better general guess when seeking stale digests would be to go from oldest to newest - we can do this by using zrange instead of zrevrange.

Second, perhaps more laboriously, I suggest paging through digests rather than loading the whole set. It might look similar to this:

        page = 0
        per = reaper_count * 2
        orphans = []
        digests = conn.zrange(digests.key, page * per, (page + 1) * per)
        while(digests.size > 0)
          digests.each do |digest|
            next if belongs_to_job?(digest)
  
            orphans << digest
            break if orphans.size >= reaper_count
          end

          break if orphans.size >= reaper_count
          page +=1
          digests = conn.zrange(digests.key, page * per, (page + 1) * per)
       end
       orphans

Describe alternatives you've considered
I've considered switching to the Lua reaper, but I was concerned about blocking redis. I'm also thinking about changing some of our application logic so we don't lean quite so heavily on unique jobs, but that will take a bit longer to develop.

Additional context
I'm happy to provide more detail on how we're using sidekiq-unique-jobs in case that's helpful. We tend to process large volumes of jobs (e.g., 300,000) in a short amount of time (e.g., 2 hours) and then have long periods with much less activity.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions