Skip to content

Ludicrous Mode: Lockable resources queue contention reduction and script caching#966

Merged
github-actions[bot] merged 5 commits intojenkinsci:masterfrom
samrocketman:ludicrous-mode
Mar 31, 2026
Merged

Ludicrous Mode: Lockable resources queue contention reduction and script caching#966
github-actions[bot] merged 5 commits intojenkinsci:masterfrom
samrocketman:ludicrous-mode

Conversation

@samrocketman
Copy link
Copy Markdown
Member

@samrocketman samrocketman commented Mar 25, 2026

What is ludicrous speed?

Obligatory video of ludicrous speed

Ludicrous speed is faster than light speed

spaceballs ludicrous speed


Background

I am patching all plugins which affect Jenkins queue maintenance performance as part of an overall initiative to make Jenkins faster for everyone.

Your plugin has been identified for adversely affecting queue performance and so I am opening a proposal pull request to fix the performance issue.

Please keep the Ludocrous Mode: prefix in the pull request title so that patches can be tracked across all plugins.

Patch details

Queue.maintian() is a critical path in Jenkins core. Any performance delays can significantly impact Jenkins scheduling work on agents. Benchmarking for Queue.maintain() should execute in only a few milliseconds even when Jenkins agents reach in the thousands.

This patch provides three key features:

  • Lightweight API calls called by Queue.maintain() should instantly return.
  • Heavyweight API calls are made in the background (e.g. resource evaluation) outside of the Queue.maintain() path.
  • Since Queue.maintain() has extreme performance, scheduleMaintenance() can be more aggressively called as locked resources become available.

The lockable-resources-plugin hooks into the Queue via a QueueTaskDispatcher — its canRun(Queue.Item) method is called under the Queue lock for every item on every Queue.maintain() cycle. Unlike cloud plugins that use RetentionStrategy or Cloud, a QueueTaskDispatcher sits in the absolute innermost loop of Queue maintenance.

Three issues were found:

  1. No scheduleMaintenance() triggers anywhere. When resources are freed (build completes, pipeline lock body finishes, user unreserves via UI), items waiting in the Jenkins Queue for those resources wait up to 5 seconds for the next timer tick. The patch adds immediate scheduleMaintenance() calls at 6 resource-freeing events: unlockResources(), unreserve(), reset(), LockRunListener.onCompleted(), LockRunListener.onDeleted(), and FreeDeadJobs.freePostMortemResources().

  2. syncResources lock contention under Queue lock. The plugin uses a synchronized(syncResources) monitor in tryQueue() that is also held by threads doing disk I/O (save()). When onCompleted() holds syncResources during a config write, canRun() under the Queue lock blocks waiting — transitively extending Queue lock hold time by disk write latency. The patch makes save() asynchronous with coalescing and narrows the syncResources scope in tryQueue() so that candidate resolution (label matching, Groovy script evaluation) runs outside the critical section.

  3. Groovy script and label expression evaluation under Queue lock. When jobs use resourceMatchScript, canRun() evaluates a Groovy script for every lockable resource on every cache miss. The existing cachedCandidates Guava cache mitigates repeated evaluations, but cache misses (first call, or after resource state change) are heavyweight. Similarly, isValidLabel() parses label expressions and allocates LabelAtom sets on every call. The patch adds per-resource TTL caches for both script evaluation results and label expression results.

Original Patch series

Some background. This patch is part of a series (ludicrous patch series) which enables my production instance to scale work up to ludicrous amounts of agents.

AI Analysis

See detailed AI analysis for lockable-resources plugin. You're welcome to copy any documentation from the AI analysis that you find you want included. I purposefully separated the analysis in order to limit how much I am changing in your repository.

Testing done

I do rely on this plugin in production. I have not tested this patch, yet. I plan to deploy it to my production instance after non-prod testing completes. I will follow up in this pull request when I'm successfully running this in production.

Advisory

I plan to update Jenkins core with warnings when there's plugins which adversely affect Jenkins queue performance. Please keep in mind not merging changes with this strategy will involve this plugin being called out for tanking Jenkins performance. This patch solves a critical performance issue for your plugin so that it does not tank Jenkins performance (in how fast Jenkins schedules work on agents).

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

…ipt caching

What is ludicrous speed?
------------------------

[Obligatory video of ludicrous speed](https://youtu.be/oApAdwuqtn8?si=uJHPP5OXz9GNllY0)

<details><summary>Ludicrous speed is faster than light speed</summary>

---

![spaceballs ludicrous speed](https://github.com/user-attachments/assets/803d4aa5-4677-40b9-b04b-d4f22ea2cbf3)

</details>

---

Background
----------

I am patching all plugins which affect Jenkins queue maintenance performance as part of an overall initiative to make Jenkins faster for everyone.

Your plugin has been identified for adversely affecting queue performance and so I am opening a proposal pull request to fix the performance issue.

Please keep the `Ludocrous Mode:` prefix in the pull request title so that patches can be tracked across all plugins.

Patch details
-------------

`Queue.maintian()` is a critical path in Jenkins core.  Any performance delays can significantly impact Jenkins scheduling work on agents.  Benchmarking for `Queue.maintain()` should execute in only a few milliseconds even when Jenkins agents reach in the thousands.

This patch provides three key features:

* Lightweight API calls called by `Queue.maintain()` should instantly return.
* Heavyweight API calls are made in the background (e.g. resource evaluation) outside of the `Queue.maintain()` path.
* Since `Queue.maintain()` has extreme performance, `scheduleMaintenance()` can be more aggressively called as locked resources become available.

---

The lockable-resources-plugin hooks into the Queue via a `QueueTaskDispatcher` — its `canRun(Queue.Item)` method is called **under the Queue lock** for every item on every `Queue.maintain()` cycle. Unlike cloud plugins that use `RetentionStrategy` or `Cloud`, a `QueueTaskDispatcher` sits in the absolute innermost loop of Queue maintenance.

Three issues were found:

1. **No `scheduleMaintenance()` triggers anywhere.** When resources are freed (build completes, pipeline lock body finishes, user unreserves via UI), items waiting in the Jenkins Queue for those resources wait up to 5 seconds for the next timer tick. The patch adds immediate `scheduleMaintenance()` calls at 6 resource-freeing events: `unlockResources()`, `unreserve()`, `reset()`, `LockRunListener.onCompleted()`, `LockRunListener.onDeleted()`, and `FreeDeadJobs.freePostMortemResources()`.

2. **`syncResources` lock contention under Queue lock.** The plugin uses a `synchronized(syncResources)` monitor in `tryQueue()` that is also held by threads doing disk I/O (`save()`). When `onCompleted()` holds `syncResources` during a config write, `canRun()` under the Queue lock blocks waiting — transitively extending Queue lock hold time by disk write latency. The patch makes `save()` asynchronous with coalescing and narrows the `syncResources` scope in `tryQueue()` so that candidate resolution (label matching, Groovy script evaluation) runs outside the critical section.

3. **Groovy script and label expression evaluation under Queue lock.** When jobs use `resourceMatchScript`, `canRun()` evaluates a Groovy script for every lockable resource on every cache miss. The existing `cachedCandidates` Guava cache mitigates repeated evaluations, but cache misses (first call, or after resource state change) are heavyweight. Similarly, `isValidLabel()` parses label expressions and allocates `LabelAtom` sets on every call. The patch adds per-resource TTL caches for both script evaluation results and label expression results.

Original Patch series
---------------------

Some background.  This patch is part of a series (ludicrous patch series) which enables my production instance to scale work up to ludicrous amounts of agents.

- [ec2 plugin]
- [job-restrictions plugin]
- [leastload plugin]

AI Analysis
-----------

See detailed [AI analysis for lockable-resources plugin][1].  You're welcome to copy any documentation from the AI analysis that you find you want included.  I purposefully separated the analysis in order to limit how much I am changing in your repository.

Testing done
------------

I do rely on this plugin in production.  I have not tested this patch, yet.  I plan to deploy it to my production instance after non-prod testing completes.  I will follow up in this pull request when I'm successfully running this in production.

Advisory
--------

I plan to update Jenkins core with warnings when there's plugins which adversely affect Jenkins queue performance.  Please keep in mind not merging changes with this strategy will involve this plugin being called out for tanking Jenkins performance.  This patch solves a critical performance issue for your plugin so that it does not tank Jenkins performance (in how fast Jenkins schedules work on agents).

Submitter checklist
-------------------

- [x] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [x] Ensure that the pull request title represents the desired changelog entry
- [x] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [x] Link to relevant pull requests, esp. upstream and downstream changes
- [ ] Ensure you have provided tests that demonstrate the feature works or the issue is fixed

[1]: https://github.com/samrocketman/jenkins-ai-analysis/tree/main/ludicrous-mode-analysis/lockable-resources-plugin
[ec2 plugin]: jenkinsci/ec2-plugin#2000
[job-restrictions plugin]: jenkinsci/job-restrictions-plugin#210
[leastload plugin]: jenkinsci/leastload-plugin#22

Co-authored-by: sgleske@integralads.com
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR proposes performance-focused changes to reduce Jenkins Queue maintenance contention caused by the lockable-resources plugin, primarily by triggering immediate queue maintenance on resource-freeing events, reducing synchronous disk I/O impact, and caching expensive evaluations.

Changes:

  • Added Queue.scheduleMaintenance() triggers on multiple resource-freeing code paths.
  • Introduced asynchronous/coalesced save() to avoid disk I/O extending syncResources lock hold time.
  • Added per-resource TTL caches for Groovy script matching and label expression evaluation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/main/java/org/jenkins/plugins/lockableresources/util/Constants.java Adds system properties controlling async save and cache TTL behavior.
src/main/java/org/jenkins/plugins/lockableresources/queue/LockRunListener.java Schedules queue maintenance after build completion/deletion unlocks.
src/main/java/org/jenkins/plugins/lockableresources/LockableResourcesManager.java Refactors tryQueue, adds queue-maintenance scheduling helper, and implements async/coalesced saving.
src/main/java/org/jenkins/plugins/lockableresources/LockableResource.java Adds per-resource TTL caches for script and label evaluation.
src/main/java/org/jenkins/plugins/lockableresources/FreeDeadJobs.java Schedules queue maintenance after freeing post-mortem resources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 405 to +418
public boolean scriptMatches(@NonNull SecureGroovyScript script, @CheckForNull Map<String, Object> params)
throws ExecutionException {
if (SCRIPT_CACHE_TTL_MS > 0) {
String cacheKey = script.getScript();
ConcurrentHashMap<String, CachedResult> cache = getScriptCache();
CachedResult cached = cache.get(cacheKey);
if (cached != null && !cached.isExpired(SCRIPT_CACHE_TTL_MS)) {
return cached.value;
}
boolean result = evaluateScript(script, params);
cache.put(cacheKey, new CachedResult(result));
return result;
}
return evaluateScript(script, params);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests covering the new script/label TTL caching behavior. In particular, caching script results without considering params can change matching semantics across builds; adding unit/integration tests that queue multiple items with different parameters and a resourceMatchScript would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 731 to 745
public void unlockResources(List<LockableResource> resourcesToUnLock, Run<?, ?> build) {
if (resourcesToUnLock == null || resourcesToUnLock.isEmpty()) {
return;
}
synchronized (syncResources) {
this.freeResources(resourcesToUnLock, build);

while (proceedNextContext()) {
// process as many contexts as possible
}

save();
}
scheduleQueueMaintenance();
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unlockResources(..) now calls scheduleQueueMaintenance() unconditionally. However, this method is invoked from call sites that already hold synchronized(syncResources) (e.g., steal(..) and recycle(..)), so the queue maintenance trigger can still run while the plugin lock is held—contradicting scheduleQueueMaintenance()'s own Javadoc and risking lock-order inversion with the Jenkins Queue lock. Consider removing the scheduling side effect from unlockResources(..) (or splitting into an internal helper) and only calling scheduleQueueMaintenance() from outermost methods after all syncResources monitors have been released.

Copilot uses AI. Check for mistakes.
Comment on lines +1473 to +1479
if (asyncSaveEnabled && saveCoalesceMs > 0) {
if (getSavePending().compareAndSet(false, true)) {
getSaveExecutor().schedule(this::doSave, saveCoalesceMs, TimeUnit.MILLISECONDS);
}
} else {
doSave();
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With async-save enabled, save() may return before the configuration has been persisted. If Jenkins is shutting down or crashes within the coalescing window, the latest lock state changes may never be written. Consider flushing pending saves on shutdown (or when disabling async mode), or ensuring critical state changes force an immediate doSave() when the controller is terminating.

Copilot uses AI. Check for mistakes.
Comment on lines 1462 to +1480
@@ -1401,9 +1468,20 @@

if (enableSave == 0) return; // saving is disabled

synchronized (syncResources) {
if (BulkChange.contains(this)) return;
if (BulkChange.contains(this)) return;

if (asyncSaveEnabled && saveCoalesceMs > 0) {
if (getSavePending().compareAndSet(false, true)) {
getSaveExecutor().schedule(this::doSave, saveCoalesceMs, TimeUnit.MILLISECONDS);
}
} else {
doSave();
}
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async-save/coalescing behavior is new and not covered by tests (e.g., verifying that multiple rapid state changes result in exactly one persisted write after the coalesce delay, and that disabling async mode still saves immediately). Adding tests around save() would help ensure future changes don’t reintroduce Queue-lock contention or cause persistence regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 405 to +416
public boolean scriptMatches(@NonNull SecureGroovyScript script, @CheckForNull Map<String, Object> params)
throws ExecutionException {
if (SCRIPT_CACHE_TTL_MS > 0) {
String cacheKey = script.getScript();
ConcurrentHashMap<String, CachedResult> cache = getScriptCache();
CachedResult cached = cache.get(cacheKey);
if (cached != null && !cached.isExpired(SCRIPT_CACHE_TTL_MS)) {
return cached.value;
}
boolean result = evaluateScript(script, params);
cache.put(cacheKey, new CachedResult(result));
return result;
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scriptMatches caches results by script.getScript() only, but the evaluation also depends on the provided params (build parameters/matrix combinations) and on bound resource fields (description/labels/note). This can return incorrect matches for different queue items within the TTL window. Include the relevant params (e.g., a stable hash of the map entries) and/or the resource fields in the cache key, or only cache when params is null/empty, and ensure caches are invalidated when bound fields change.

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +116
/** Per-resource cache: Groovy script text -> (result, timestamp). */
private transient volatile ConcurrentHashMap<String, CachedResult> scriptCache;
/** Per-resource cache: label expression -> (result, timestamp). */
private transient volatile ConcurrentHashMap<String, CachedResult> labelCache;

Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scriptCache/labelCache are unbounded ConcurrentHashMaps and TTL is only checked on reads; expired entries are never evicted. In long-lived controllers with many distinct label expressions or scripts, these maps can grow without bound (per resource). Consider using a bounded cache with time-based eviction (e.g., Caffeine/Guava) or removing entries when they are detected as expired.

Copilot uses AI. Check for mistakes.
Comment on lines 34 to +56
@@ -47,7 +48,11 @@ public static void freePostMortemResources() {
+ " due post mortem job: "
+ resource.getBuildName());
resource.recycle();
freedAny = true;
}
}
if (freedAny) {
LockableResourcesManager.scheduleQueueMaintenance();
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freePostMortemResources() holds lrm.syncResources while calling resource.recycle(). Since recycle() calls back into LockableResourcesManager.recycle(..), which calls unlockResources(..) / unreserve(..) that now trigger scheduleQueueMaintenance(), this can end up attempting to acquire the Jenkins Queue lock while still holding syncResources (lock-order inversion / potential deadlock). Consider collecting the orphan resources under syncResources, releasing the lock, then recycling them and scheduling queue maintenance once afterwards.

Copilot uses AI. Check for mistakes.
Comment on lines +1445 to +1456
private ScheduledExecutorService getSaveExecutor() {
ScheduledExecutorService se = saveExecutor;
if (se == null) {
synchronized (this) {
se = saveExecutor;
if (se == null) {
saveExecutor = se = Executors.newSingleThreadScheduledExecutor(r -> {
Thread t = new Thread(r, "lockable-resources-async-save");
t.setDaemon(true);
return t;
});
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The async-save implementation creates a dedicated ScheduledExecutorService but there is no lifecycle management (shutdown) when Jenkins reloads configuration / the plugin is reloaded, which can leave daemon threads around. Prefer using Jenkins' shared timer (jenkins.util.Timer) or add explicit cleanup (e.g., via an @Initializer/@Terminator hook) so executors do not leak across reloads.

Copilot uses AI. Check for mistakes.
Comment on lines +517 to +526
// Resolve candidates outside syncResources when possible — label matching
// and Groovy script evaluation are heavyweight and should not extend the
// critical section.
List<LockableResource> candidates = null;
if (candidatesByScript || (requiredResources.label != null && !requiredResources.label.isEmpty())) {
candidates = cachedCandidates.getIfPresent(queueItemId);
if (candidates == null) {
candidates = (systemGroovyScript == null)
? getResourcesWithLabel(requiredResources.label)
: getResourcesMatchingScript(systemGroovyScript, params);
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says candidates are resolved “outside syncResources”, but both getResourcesWithLabel(..) and getResourcesMatchingScript(..) acquire synchronized(syncResources) internally, so label/script evaluation is still performed under the plugin lock. Either update the comment to reflect the actual behavior (lock is released and reacquired) or refactor to evaluate against a snapshot of resources so the heavyweight work can truly happen without holding syncResources.

Copilot uses AI. Check for mistakes.
@samrocketman
Copy link
Copy Markdown
Member Author

I have deployed this to my staging environment and it seems performant.

with a @Terminator flush hook

Default asyncSaveEnabled to true and add a @Terminator shutdown hook
that flushes any pending coalesced save synchronously before Jenkins
exits.  Also add a null guard in uncacheIfFreeing for defensive safety.
@sgleske-ias
Copy link
Copy Markdown

I verified cache fields added are marked transient the change is safe for XStream deserialization on Jenkins boot.

@sgleske-ias
Copy link
Copy Markdown

Finished staging testing so we're rolling this out to our production. I consider the patches OK

@samrocketman
Copy link
Copy Markdown
Member Author

We're operating with this in production.

@mPokornyETM
Copy link
Copy Markdown
Contributor

@samrocketman I happy, that someone else care about this plugin. Please check the comments from copilot (seems to be correct) and add tests to improve code coverage. Thx

@github-actions
Copy link
Copy Markdown
Contributor

Auto-approved: No review received within 3 days. Merging now.

@github-actions github-actions Bot added the merged-without-review Auto-approve countdown label Mar 31, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: no review received within 3 days of opening.

@github-actions github-actions Bot enabled auto-merge (squash) March 31, 2026 10:04
@mPokornyETM mPokornyETM removed the merged-without-review Auto-approve countdown label Mar 31, 2026
@samrocketman
Copy link
Copy Markdown
Member Author

samrocketman commented Mar 31, 2026

@mPokornyETM I haven't forgot just haven't had a chance to address, yet. A lot of my open source work is in spare time which sometimes has competing priorities.

@github-actions github-actions Bot merged commit cfceea4 into jenkinsci:master Mar 31, 2026
20 checks passed
@mPokornyETM
Copy link
Copy Markdown
Contributor

@samrocketman it is approved and merged. I hope it was not failure ;-) Please take a time and check the open things later to make this plugin awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants