Ludicrous Mode: Lockable resources queue contention reduction and script caching#966
Conversation
…ipt caching What is ludicrous speed? ------------------------ [Obligatory video of ludicrous speed](https://youtu.be/oApAdwuqtn8?si=uJHPP5OXz9GNllY0) <details><summary>Ludicrous speed is faster than light speed</summary> ---  </details> --- Background ---------- I am patching all plugins which affect Jenkins queue maintenance performance as part of an overall initiative to make Jenkins faster for everyone. Your plugin has been identified for adversely affecting queue performance and so I am opening a proposal pull request to fix the performance issue. Please keep the `Ludocrous Mode:` prefix in the pull request title so that patches can be tracked across all plugins. Patch details ------------- `Queue.maintian()` is a critical path in Jenkins core. Any performance delays can significantly impact Jenkins scheduling work on agents. Benchmarking for `Queue.maintain()` should execute in only a few milliseconds even when Jenkins agents reach in the thousands. This patch provides three key features: * Lightweight API calls called by `Queue.maintain()` should instantly return. * Heavyweight API calls are made in the background (e.g. resource evaluation) outside of the `Queue.maintain()` path. * Since `Queue.maintain()` has extreme performance, `scheduleMaintenance()` can be more aggressively called as locked resources become available. --- The lockable-resources-plugin hooks into the Queue via a `QueueTaskDispatcher` — its `canRun(Queue.Item)` method is called **under the Queue lock** for every item on every `Queue.maintain()` cycle. Unlike cloud plugins that use `RetentionStrategy` or `Cloud`, a `QueueTaskDispatcher` sits in the absolute innermost loop of Queue maintenance. Three issues were found: 1. **No `scheduleMaintenance()` triggers anywhere.** When resources are freed (build completes, pipeline lock body finishes, user unreserves via UI), items waiting in the Jenkins Queue for those resources wait up to 5 seconds for the next timer tick. The patch adds immediate `scheduleMaintenance()` calls at 6 resource-freeing events: `unlockResources()`, `unreserve()`, `reset()`, `LockRunListener.onCompleted()`, `LockRunListener.onDeleted()`, and `FreeDeadJobs.freePostMortemResources()`. 2. **`syncResources` lock contention under Queue lock.** The plugin uses a `synchronized(syncResources)` monitor in `tryQueue()` that is also held by threads doing disk I/O (`save()`). When `onCompleted()` holds `syncResources` during a config write, `canRun()` under the Queue lock blocks waiting — transitively extending Queue lock hold time by disk write latency. The patch makes `save()` asynchronous with coalescing and narrows the `syncResources` scope in `tryQueue()` so that candidate resolution (label matching, Groovy script evaluation) runs outside the critical section. 3. **Groovy script and label expression evaluation under Queue lock.** When jobs use `resourceMatchScript`, `canRun()` evaluates a Groovy script for every lockable resource on every cache miss. The existing `cachedCandidates` Guava cache mitigates repeated evaluations, but cache misses (first call, or after resource state change) are heavyweight. Similarly, `isValidLabel()` parses label expressions and allocates `LabelAtom` sets on every call. The patch adds per-resource TTL caches for both script evaluation results and label expression results. Original Patch series --------------------- Some background. This patch is part of a series (ludicrous patch series) which enables my production instance to scale work up to ludicrous amounts of agents. - [ec2 plugin] - [job-restrictions plugin] - [leastload plugin] AI Analysis ----------- See detailed [AI analysis for lockable-resources plugin][1]. You're welcome to copy any documentation from the AI analysis that you find you want included. I purposefully separated the analysis in order to limit how much I am changing in your repository. Testing done ------------ I do rely on this plugin in production. I have not tested this patch, yet. I plan to deploy it to my production instance after non-prod testing completes. I will follow up in this pull request when I'm successfully running this in production. Advisory -------- I plan to update Jenkins core with warnings when there's plugins which adversely affect Jenkins queue performance. Please keep in mind not merging changes with this strategy will involve this plugin being called out for tanking Jenkins performance. This patch solves a critical performance issue for your plugin so that it does not tank Jenkins performance (in how fast Jenkins schedules work on agents). Submitter checklist ------------------- - [x] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch! - [x] Ensure that the pull request title represents the desired changelog entry - [x] Please describe what you did - [ ] Link to relevant issues in GitHub or Jira - [x] Link to relevant pull requests, esp. upstream and downstream changes - [ ] Ensure you have provided tests that demonstrate the feature works or the issue is fixed [1]: https://github.com/samrocketman/jenkins-ai-analysis/tree/main/ludicrous-mode-analysis/lockable-resources-plugin [ec2 plugin]: jenkinsci/ec2-plugin#2000 [job-restrictions plugin]: jenkinsci/job-restrictions-plugin#210 [leastload plugin]: jenkinsci/leastload-plugin#22 Co-authored-by: sgleske@integralads.com
There was a problem hiding this comment.
Pull request overview
This PR proposes performance-focused changes to reduce Jenkins Queue maintenance contention caused by the lockable-resources plugin, primarily by triggering immediate queue maintenance on resource-freeing events, reducing synchronous disk I/O impact, and caching expensive evaluations.
Changes:
- Added
Queue.scheduleMaintenance()triggers on multiple resource-freeing code paths. - Introduced asynchronous/coalesced
save()to avoid disk I/O extendingsyncResourceslock hold time. - Added per-resource TTL caches for Groovy script matching and label expression evaluation.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/main/java/org/jenkins/plugins/lockableresources/util/Constants.java |
Adds system properties controlling async save and cache TTL behavior. |
src/main/java/org/jenkins/plugins/lockableresources/queue/LockRunListener.java |
Schedules queue maintenance after build completion/deletion unlocks. |
src/main/java/org/jenkins/plugins/lockableresources/LockableResourcesManager.java |
Refactors tryQueue, adds queue-maintenance scheduling helper, and implements async/coalesced saving. |
src/main/java/org/jenkins/plugins/lockableresources/LockableResource.java |
Adds per-resource TTL caches for script and label evaluation. |
src/main/java/org/jenkins/plugins/lockableresources/FreeDeadJobs.java |
Schedules queue maintenance after freeing post-mortem resources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public boolean scriptMatches(@NonNull SecureGroovyScript script, @CheckForNull Map<String, Object> params) | ||
| throws ExecutionException { | ||
| if (SCRIPT_CACHE_TTL_MS > 0) { | ||
| String cacheKey = script.getScript(); | ||
| ConcurrentHashMap<String, CachedResult> cache = getScriptCache(); | ||
| CachedResult cached = cache.get(cacheKey); | ||
| if (cached != null && !cached.isExpired(SCRIPT_CACHE_TTL_MS)) { | ||
| return cached.value; | ||
| } | ||
| boolean result = evaluateScript(script, params); | ||
| cache.put(cacheKey, new CachedResult(result)); | ||
| return result; | ||
| } | ||
| return evaluateScript(script, params); |
There was a problem hiding this comment.
There are no tests covering the new script/label TTL caching behavior. In particular, caching script results without considering params can change matching semantics across builds; adding unit/integration tests that queue multiple items with different parameters and a resourceMatchScript would help prevent regressions.
| public void unlockResources(List<LockableResource> resourcesToUnLock, Run<?, ?> build) { | ||
| if (resourcesToUnLock == null || resourcesToUnLock.isEmpty()) { | ||
| return; | ||
| } | ||
| synchronized (syncResources) { | ||
| this.freeResources(resourcesToUnLock, build); | ||
|
|
||
| while (proceedNextContext()) { | ||
| // process as many contexts as possible | ||
| } | ||
|
|
||
| save(); | ||
| } | ||
| scheduleQueueMaintenance(); | ||
| } |
There was a problem hiding this comment.
unlockResources(..) now calls scheduleQueueMaintenance() unconditionally. However, this method is invoked from call sites that already hold synchronized(syncResources) (e.g., steal(..) and recycle(..)), so the queue maintenance trigger can still run while the plugin lock is held—contradicting scheduleQueueMaintenance()'s own Javadoc and risking lock-order inversion with the Jenkins Queue lock. Consider removing the scheduling side effect from unlockResources(..) (or splitting into an internal helper) and only calling scheduleQueueMaintenance() from outermost methods after all syncResources monitors have been released.
| if (asyncSaveEnabled && saveCoalesceMs > 0) { | ||
| if (getSavePending().compareAndSet(false, true)) { | ||
| getSaveExecutor().schedule(this::doSave, saveCoalesceMs, TimeUnit.MILLISECONDS); | ||
| } | ||
| } else { | ||
| doSave(); | ||
| } |
There was a problem hiding this comment.
With async-save enabled, save() may return before the configuration has been persisted. If Jenkins is shutting down or crashes within the coalescing window, the latest lock state changes may never be written. Consider flushing pending saves on shutdown (or when disabling async mode), or ensuring critical state changes force an immediate doSave() when the controller is terminating.
| @@ -1401,9 +1468,20 @@ | |||
|
|
|||
| if (enableSave == 0) return; // saving is disabled | |||
|
|
|||
| synchronized (syncResources) { | |||
| if (BulkChange.contains(this)) return; | |||
| if (BulkChange.contains(this)) return; | |||
|
|
|||
| if (asyncSaveEnabled && saveCoalesceMs > 0) { | |||
| if (getSavePending().compareAndSet(false, true)) { | |||
| getSaveExecutor().schedule(this::doSave, saveCoalesceMs, TimeUnit.MILLISECONDS); | |||
| } | |||
| } else { | |||
| doSave(); | |||
| } | |||
| } | |||
There was a problem hiding this comment.
The async-save/coalescing behavior is new and not covered by tests (e.g., verifying that multiple rapid state changes result in exactly one persisted write after the coalesce delay, and that disabling async mode still saves immediately). Adding tests around save() would help ensure future changes don’t reintroduce Queue-lock contention or cause persistence regressions.
| public boolean scriptMatches(@NonNull SecureGroovyScript script, @CheckForNull Map<String, Object> params) | ||
| throws ExecutionException { | ||
| if (SCRIPT_CACHE_TTL_MS > 0) { | ||
| String cacheKey = script.getScript(); | ||
| ConcurrentHashMap<String, CachedResult> cache = getScriptCache(); | ||
| CachedResult cached = cache.get(cacheKey); | ||
| if (cached != null && !cached.isExpired(SCRIPT_CACHE_TTL_MS)) { | ||
| return cached.value; | ||
| } | ||
| boolean result = evaluateScript(script, params); | ||
| cache.put(cacheKey, new CachedResult(result)); | ||
| return result; |
There was a problem hiding this comment.
scriptMatches caches results by script.getScript() only, but the evaluation also depends on the provided params (build parameters/matrix combinations) and on bound resource fields (description/labels/note). This can return incorrect matches for different queue items within the TTL window. Include the relevant params (e.g., a stable hash of the map entries) and/or the resource fields in the cache key, or only cache when params is null/empty, and ensure caches are invalidated when bound fields change.
| /** Per-resource cache: Groovy script text -> (result, timestamp). */ | ||
| private transient volatile ConcurrentHashMap<String, CachedResult> scriptCache; | ||
| /** Per-resource cache: label expression -> (result, timestamp). */ | ||
| private transient volatile ConcurrentHashMap<String, CachedResult> labelCache; | ||
|
|
There was a problem hiding this comment.
scriptCache/labelCache are unbounded ConcurrentHashMaps and TTL is only checked on reads; expired entries are never evicted. In long-lived controllers with many distinct label expressions or scripts, these maps can grow without bound (per resource). Consider using a bounded cache with time-based eviction (e.g., Caffeine/Guava) or removing entries when they are detected as expired.
| @@ -47,7 +48,11 @@ public static void freePostMortemResources() { | |||
| + " due post mortem job: " | |||
| + resource.getBuildName()); | |||
| resource.recycle(); | |||
| freedAny = true; | |||
| } | |||
| } | |||
| if (freedAny) { | |||
| LockableResourcesManager.scheduleQueueMaintenance(); | |||
| } | |||
There was a problem hiding this comment.
freePostMortemResources() holds lrm.syncResources while calling resource.recycle(). Since recycle() calls back into LockableResourcesManager.recycle(..), which calls unlockResources(..) / unreserve(..) that now trigger scheduleQueueMaintenance(), this can end up attempting to acquire the Jenkins Queue lock while still holding syncResources (lock-order inversion / potential deadlock). Consider collecting the orphan resources under syncResources, releasing the lock, then recycling them and scheduling queue maintenance once afterwards.
| private ScheduledExecutorService getSaveExecutor() { | ||
| ScheduledExecutorService se = saveExecutor; | ||
| if (se == null) { | ||
| synchronized (this) { | ||
| se = saveExecutor; | ||
| if (se == null) { | ||
| saveExecutor = se = Executors.newSingleThreadScheduledExecutor(r -> { | ||
| Thread t = new Thread(r, "lockable-resources-async-save"); | ||
| t.setDaemon(true); | ||
| return t; | ||
| }); | ||
| } |
There was a problem hiding this comment.
The async-save implementation creates a dedicated ScheduledExecutorService but there is no lifecycle management (shutdown) when Jenkins reloads configuration / the plugin is reloaded, which can leave daemon threads around. Prefer using Jenkins' shared timer (jenkins.util.Timer) or add explicit cleanup (e.g., via an @Initializer/@Terminator hook) so executors do not leak across reloads.
| // Resolve candidates outside syncResources when possible — label matching | ||
| // and Groovy script evaluation are heavyweight and should not extend the | ||
| // critical section. | ||
| List<LockableResource> candidates = null; | ||
| if (candidatesByScript || (requiredResources.label != null && !requiredResources.label.isEmpty())) { | ||
| candidates = cachedCandidates.getIfPresent(queueItemId); | ||
| if (candidates == null) { | ||
| candidates = (systemGroovyScript == null) | ||
| ? getResourcesWithLabel(requiredResources.label) | ||
| : getResourcesMatchingScript(systemGroovyScript, params); |
There was a problem hiding this comment.
The comment says candidates are resolved “outside syncResources”, but both getResourcesWithLabel(..) and getResourcesMatchingScript(..) acquire synchronized(syncResources) internally, so label/script evaluation is still performed under the plugin lock. Either update the comment to reflect the actual behavior (lock is released and reacquired) or refactor to evaluate against a snapshot of resources so the heavyweight work can truly happen without holding syncResources.
|
I have deployed this to my staging environment and it seems performant. |
with a @Terminator flush hook Default asyncSaveEnabled to true and add a @Terminator shutdown hook that flushes any pending coalesced save synchronously before Jenkins exits. Also add a null guard in uncacheIfFreeing for defensive safety.
|
I verified cache fields added are marked transient the change is safe for XStream deserialization on Jenkins boot. |
|
Finished staging testing so we're rolling this out to our production. I consider the patches OK |
|
We're operating with this in production. |
|
@samrocketman I happy, that someone else care about this plugin. Please check the comments from copilot (seems to be correct) and add tests to improve code coverage. Thx |
|
✅ Auto-approved: No review received within 3 days. Merging now. |
|
@mPokornyETM I haven't forgot just haven't had a chance to address, yet. A lot of my open source work is in spare time which sometimes has competing priorities. |
|
@samrocketman it is approved and merged. I hope it was not failure ;-) Please take a time and check the open things later to make this plugin awesome |
What is ludicrous speed?
Obligatory video of ludicrous speed
Ludicrous speed is faster than light speed
Background
I am patching all plugins which affect Jenkins queue maintenance performance as part of an overall initiative to make Jenkins faster for everyone.
Your plugin has been identified for adversely affecting queue performance and so I am opening a proposal pull request to fix the performance issue.
Please keep the
Ludocrous Mode:prefix in the pull request title so that patches can be tracked across all plugins.Patch details
Queue.maintian()is a critical path in Jenkins core. Any performance delays can significantly impact Jenkins scheduling work on agents. Benchmarking forQueue.maintain()should execute in only a few milliseconds even when Jenkins agents reach in the thousands.This patch provides three key features:
Queue.maintain()should instantly return.Queue.maintain()path.Queue.maintain()has extreme performance,scheduleMaintenance()can be more aggressively called as locked resources become available.The lockable-resources-plugin hooks into the Queue via a
QueueTaskDispatcher— itscanRun(Queue.Item)method is called under the Queue lock for every item on everyQueue.maintain()cycle. Unlike cloud plugins that useRetentionStrategyorCloud, aQueueTaskDispatchersits in the absolute innermost loop of Queue maintenance.Three issues were found:
No
scheduleMaintenance()triggers anywhere. When resources are freed (build completes, pipeline lock body finishes, user unreserves via UI), items waiting in the Jenkins Queue for those resources wait up to 5 seconds for the next timer tick. The patch adds immediatescheduleMaintenance()calls at 6 resource-freeing events:unlockResources(),unreserve(),reset(),LockRunListener.onCompleted(),LockRunListener.onDeleted(), andFreeDeadJobs.freePostMortemResources().syncResourceslock contention under Queue lock. The plugin uses asynchronized(syncResources)monitor intryQueue()that is also held by threads doing disk I/O (save()). WhenonCompleted()holdssyncResourcesduring a config write,canRun()under the Queue lock blocks waiting — transitively extending Queue lock hold time by disk write latency. The patch makessave()asynchronous with coalescing and narrows thesyncResourcesscope intryQueue()so that candidate resolution (label matching, Groovy script evaluation) runs outside the critical section.Groovy script and label expression evaluation under Queue lock. When jobs use
resourceMatchScript,canRun()evaluates a Groovy script for every lockable resource on every cache miss. The existingcachedCandidatesGuava cache mitigates repeated evaluations, but cache misses (first call, or after resource state change) are heavyweight. Similarly,isValidLabel()parses label expressions and allocatesLabelAtomsets on every call. The patch adds per-resource TTL caches for both script evaluation results and label expression results.Original Patch series
Some background. This patch is part of a series (ludicrous patch series) which enables my production instance to scale work up to ludicrous amounts of agents.
AI Analysis
See detailed AI analysis for lockable-resources plugin. You're welcome to copy any documentation from the AI analysis that you find you want included. I purposefully separated the analysis in order to limit how much I am changing in your repository.
Testing done
I do rely on this plugin in production. I have not tested this patch, yet. I plan to deploy it to my production instance after non-prod testing completes. I will follow up in this pull request when I'm successfully running this in production.
Advisory
I plan to update Jenkins core with warnings when there's plugins which adversely affect Jenkins queue performance. Please keep in mind not merging changes with this strategy will involve this plugin being called out for tanking Jenkins performance. This patch solves a critical performance issue for your plugin so that it does not tank Jenkins performance (in how fast Jenkins schedules work on agents).
Submitter checklist