tektoncd · chmouel · Apr 15, 2026 · Apr 16, 2026
diff --git a/config/302-pac-configmap.yaml b/config/302-pac-configmap.yaml
@@ -173,6 +173,13 @@ data:
   # Default: true
   skip-push-event-for-pr-commits: "true"
 
+  # Selects the concurrency queue backend used by the watcher.
+  # "memory" keeps the existing in-process queue state.
+  # "lease" uses Kubernetes Leases plus PipelineRun claims to recover more safely
+  # from watcher restarts and cluster/API timing issues.
+  # Restart the watcher after changing this setting.
+  concurrency-backend: "memory"
+
   # Configure a custom console here, the driver support custom parameters from
   # Repo CR along a few other template variable, see documentation for more
   # details

diff --git a/docs/content/docs/advanced/concurrency.md b/docs/content/docs/advanced/concurrency.md
@@ -4,36 +4,133 @@ weight: 2
 ---
 This page illustrates how Pipelines-as-Code manages concurrent PipelineRun execution. When you set a concurrency limit on a Repository CR, Pipelines-as-Code queues incoming PipelineRuns and starts them only when capacity allows.
 
+The watcher supports two queue backends controlled by the global `concurrency-backend` setting in the `pipelines-as-code` ConfigMap:
+
+- `memory` keeps queue state in the watcher process. This is the historical behavior and remains the default.
+- `lease` stores queue coordination in Kubernetes using `Lease` objects and short-lived PipelineRun claims. This mode is more resilient when the watcher restarts or the cluster is slow to reconcile updates.
+
+{{< tech_preview "Lease-backed concurrency backend" >}}
+
 ## Flow diagram
 
 ```mermaid
-graph TD
-    A1[Controller] --> B1(Validate & Process Event)
-    B1 --> C1{Is concurrency defined?}
-    C1 -->|Not Defined| D1[Create PipelineRun with state='started']
-    C1 -->|Defined| E1[Create PipelineRun with pending status and state='queued']
-
-    Z[Pipelines-as-Code]
-
-    A[Watcher] --> B(PipelineRun Reconciler)
-    B --> C{Check state}
-    C --> |completed| F(Return, nothing to do!)
-    C --> |queued| D(Create Queue for Repository)
-    C --> |started| E{Is PipelineRun Done?}
-    D --> O(Add PipelineRun in the queue)
-    O --> P{If PipelineRuns running < concurrency_limit}
-    P --> |Yes| Q(Start the top most PipelineRun in the Queue)
-    Q --> P
-    P --> |No| R[Return and wait for your turn]
-    E --> |Yes| G(Report Status to provider)
-    E --> |No| H(Requeue Request)
-    H --> B
-    G --> I(Update status in Repository)
-    I --> J(Update state to 'completed')
-    J --> K{Check if concurrency was defined?}
-    K --> |Yes| L(Remove PipelineRun from Queue)
-    L --> M(Start the next PipelineRun from Queue)
-    M --> N[Done!]
-    K --> |No| N
+flowchart TD
+    A[Webhook event] --> B[Controller resolves Repository CR]
+    B --> C{concurrency_limit set?}
+    C -->|No| D[Create PipelineRun with state=started]
+    C -->|Yes| E[Create PipelineRun with state=queued and spec.status=pending]
+
+    D --> F[Watcher reconciles started PipelineRun]
+    E --> G[Watcher reconciles queued PipelineRun]
+
+    G --> H{Queue backend}
+    H -->|memory| I[Use in-process semaphore]
+    H -->|lease| J[Acquire per-repository Lease and inspect live PipelineRuns]
+
+    I --> K{Capacity available?}
+    J --> K
+    K -->|No| L[Keep PipelineRun queued]
+    K -->|Yes| M[Claim candidate and patch state=started]
+
+    M --> F
+    F --> N{PipelineRun done?}
+    N -->|No| F
+    N -->|Yes| O[Report final status]
+    O --> P[Release slot and try next queued run]
+    P --> G
+```
+
+## Backend selection
+
+To enable the Kubernetes-backed queue coordination, set the global config to:
 
+```yaml
+data:
+  concurrency-backend: "lease"
 ```
+
+Restart the watcher after changing `concurrency-backend`; the backend is selected at startup.
+
+When `lease` mode is enabled, Pipelines-as-Code still uses the existing `queued`, `started`, and `completed` PipelineRun states. The difference is that promotion of the next queued PipelineRun is serialized with a per-repository `Lease`, which reduces queue drift during cluster/API instability.
+
+## How lease promotion works
+
+When the watcher reconciles a queued PipelineRun under the `lease` backend, it follows this sequence:
+
+1. Acquire the per-repository Kubernetes Lease (retry up to 20 times with 100 ms delay).
+2. List live PipelineRuns for that repository.
+3. Separate them into running, claimed, and claimable queued runs.
+4. Compute available capacity: `concurrency_limit - running - claimed`.
+5. Patch one or more queued runs with short-lived claim annotations (`queue-claimed-by`, `queue-claimed-at`).
+6. Release the repository Lease.
+7. Re-fetch the claimed run and patch it to `started`.
+
+If promotion fails at step 7, the watcher records the failure on the PipelineRun, clears the claim, and another reconcile retries later.
+
+Claims expire after **30 seconds**. If a watcher crashes or stalls before completing promotion, another instance can pick up the run once the claim expires.
+
+## Recovery loop
+
+When the `lease` backend is active, the watcher starts a background recovery loop that runs every **31 seconds** (claim TTL + 1 s buffer). It looks for repositories where:
+
+- there is no started PipelineRun
+- there is no queued PipelineRun with an active (unexpired) claim
+- there is still at least one recoverable queued PipelineRun
+
+A queued PipelineRun is recoverable when it has `state=queued`, `spec.status=Pending`, is not done or cancelled, and has a valid `execution-order` annotation.
+
+When a candidate is found, the recovery loop clears stale debug annotations and re-enqueues the oldest recoverable run so normal promotion logic runs again.
+
+## Debugging the Lease Backend
+
+When `concurrency-backend: "lease"` is enabled, queued `PipelineRun`s expose queue debugging state directly in annotations:
+
+- `pipelinesascode.tekton.dev/queue-decision`
+- `pipelinesascode.tekton.dev/queue-debug-summary`
+- `pipelinesascode.tekton.dev/queue-claimed-by`
+- `pipelinesascode.tekton.dev/queue-claimed-at`
+- `pipelinesascode.tekton.dev/queue-promotion-retries`
+- `pipelinesascode.tekton.dev/queue-promotion-last-error`
+
+This makes it possible to diagnose most queue issues with `kubectl` before looking at watcher logs.
+
+### Useful commands
+
+```bash
+kubectl get pipelinerun -n <namespace> <name> -o jsonpath='{.metadata.annotations.pipelinesascode\.tekton\.dev/queue-decision}{"\n"}'
+kubectl get pipelinerun -n <namespace> <name> -o jsonpath='{.metadata.annotations.pipelinesascode\.tekton\.dev/queue-debug-summary}{"\n"}'
+kubectl describe pipelinerun -n <namespace> <name>
+kubectl get events -n <namespace> --field-selector involvedObject.kind=Repository
+```
+
+### Queue decisions
+
+- `waiting_for_slot`: the run is queued and waiting for repository capacity.
+- `claim_active`: another watcher already holds an active short-lived claim on this run.
+- `claimed_for_promotion`: this run has been claimed and is being promoted to `started`.
+- `promotion_failed`: the watcher failed while promoting the run to `started`.
+- `recovery_requeued`: the lease recovery loop noticed this run and enqueued it again.
+- `missing_execution_order`: the run is queued but its execution order annotation does not include itself.
+- `not_recoverable`: the run is still `queued` but is no longer eligible for lease recovery.
+
+### Events
+
+The watcher also emits repository-scoped Kubernetes events for the most important transitions:
+
+- `QueueClaimedForPromotion`
+- `QueuePromotionFailed`
+- `QueueRecoveryRequeued`
+- `QueueLeaseAcquireTimeout`
+
+### Troubleshooting
+
+| Symptom | Queue decision | Likely cause | Action |
+| --- | --- | --- | --- |
+| Run stuck queued, nothing running | `waiting_for_slot` | Completed run was not cleaned up or finalizer is stuck | Check if a `started` PipelineRun still exists for the repo. If it is done but state was not updated, delete it or patch its state to `completed`. |
+| Run stuck queued, another run is running | `waiting_for_slot` | Normal — the run is waiting for the active run to finish. | No action needed unless the running PipelineRun is itself stuck. |
+| Run keeps cycling between queued and claimed | `claim_active` | Two watcher replicas are contending for the same run. | Wait for the claim to expire (30 s). If it persists, check watcher logs for lease acquisition errors. |
+| Run shows promotion failures | `promotion_failed` | The watcher failed to patch the run to `started` (API error, webhook, or admission rejection). | Check `queue-promotion-last-error` and `queue-promotion-retries` annotations. Resolve the underlying API or admission error. |
+| Run was recovered but is stuck again | `recovery_requeued` | The recovery loop re-enqueued the run but promotion failed again on the next attempt. | Check for repeated `QueuePromotionFailed` events on the repository. The underlying issue (permissions, resource quota, webhook) must be fixed. |
+| Run is queued but marked not recoverable | `not_recoverable` | The run was cancelled, completed, or lost its `execution-order` annotation. | Inspect the PipelineRun — if it should still run, re-apply the `execution-order` annotation manually. |
+
+If the queue decision and events do not explain the behavior, switch the watcher to debug logging and grep for the repository key and PipelineRun key. The lease backend logs include lease acquisition attempts, active claim evaluation, queue-state snapshots, and recovery loop selections.
diff --git a/docs/content/docs/api/configmap.md b/docs/content/docs/api/configmap.md
@@ -345,6 +345,20 @@ skip-push-event-for-pr-commits: "true"
 
 {{< /param >}}
 
+{{< param name="concurrency-backend" type="string" default="memory" id="param-concurrency-backend" >}}
+Selects the queue coordination backend used by the watcher. Supported values:
+
+- `memory`: in-process queue tracking. This is the default and matches the historical behavior.
+- `lease`: Kubernetes-backed coordination using `Lease` objects and short-lived PipelineRun claims for improved recovery during watcher restarts or API instability. This backend is Technology Preview.
+
+Restart the watcher after changing this setting.
+
+```yaml
+concurrency-backend: "memory"
+```
+
+{{< /param >}}
+
 ## Complete Example
 
 ```yaml
@@ -381,6 +395,7 @@ data:
   remember-ok-to-test: "true"
   require-ok-to-test-sha: "false"
   skip-push-event-for-pr-commits: "true"
+  concurrency-backend: "memory"
 ```
 
 ## Updating configuration

diff --git a/docs/content/docs/guides/repository-crd/concurrency.md b/docs/content/docs/guides/repository-crd/concurrency.md
@@ -3,26 +3,113 @@ title: Concurrency
 weight: 2
 ---
 
-This page explains how to limit the number of concurrent PipelineRuns for a Repository CR and how to integrate with Kueue for Kubernetes-native job queueing. Use concurrency limits when you need to control resource consumption or prevent PipelineRuns from overwhelming your cluster.
+Use `spec.concurrency_limit` on a Repository CR to cap how many `PipelineRun`s may run at once for that repository.
+This is useful when you need to control cluster usage, preserve ordering for related runs, or avoid a burst of webhook events starting too many `PipelineRun`s at once.
 
-Set the `concurrency_limit` field to define the maximum number of PipelineRuns running at any time for a Repository CR. This prevents resource exhaustion and ensures predictable scheduling when multiple events arrive in rapid succession.
+## Repository setting
+
+Set the `concurrency_limit` field on the Repository CR:
 
 ```yaml
 spec:
   concurrency_limit: <number>
 ```
 
-When multiple PipelineRuns match the event, Pipelines-as-Code starts them in alphabetical order by PipelineRun name.
+When a webhook event produces multiple `PipelineRun`s for the same repository:
+
+- the controller creates them with an `execution-order` annotation
+- runs that cannot start immediately are created as `state=queued` with Tekton `spec.status=pending`
+- the watcher promotes queued runs to `state=started` only when repository capacity is available
+
+If `concurrency_limit: 1`, only one run for that repository is active at a time and the rest stay queued until the watcher promotes them.
+
+## End-to-end flow
+
+1. The controller decides whether the repository is concurrency-limited.
+2. If there is no limit, it creates `PipelineRun`s directly in `started`.
+3. If there is a limit, it creates `PipelineRun`s in `queued` and records `execution-order`.
+4. The watcher reconciles every `PipelineRun` that has a Pipelines-as-Code state annotation.
+5. For queued runs, the watcher asks the selected queue backend whether a slot is available.
+6. If a run is selected, the watcher patches it to `started`.
+7. When a started run finishes, the watcher reports status and asks the backend for the next queued candidate.
+
+## Queue flow diagram
+
+```mermaid
+flowchart TD
+    A[Webhook event] --> B[Controller resolves Repository CR]
+    B --> C{concurrency_limit set?}
+    C -->|No| D[Create PipelineRun with state=started]
+    C -->|Yes| E[Create PipelineRun with state=queued and spec.status=pending]
+
+    D --> F[Watcher reconciles started PipelineRun]
+    E --> G[Watcher reconciles queued PipelineRun]
+
+    G --> H{Queue backend}
+    H -->|memory| I[Use in-process semaphore]
+    H -->|lease| J[Acquire per-repository Lease and inspect live PipelineRuns]
+
+    I --> K{Capacity available?}
+    J --> K
+    K -->|No| L[Keep PipelineRun queued]
+    K -->|Yes| M[Claim candidate and patch state=started]
+
+    M --> F
+    F --> N{PipelineRun done?}
+    N -->|No| F
+    N -->|Yes| O[Report final status]
+    O --> P[Release slot and try next queued run]
+    P --> G
+```
+
+## Backend behavior
+
+The watcher supports two queue backends controlled by the global `concurrency-backend` setting in the `pipelines-as-code` ConfigMap.
+
+### `memory` backend
+
+This is the default.
 
-Example:
+- Each repository gets an in-memory semaphore in the watcher process.
+- The watcher keeps separate running and pending queues.
+- Startup rebuilds queue state from existing `started` and `queued` `PipelineRun`s.
+- Coordination is local to that watcher process.
+
+This backend is simple and fast, but it depends on watcher-local state remaining in sync with the cluster view.
+
+### `lease` backend
+
+{{< tech_preview "Lease-backed concurrency backend" >}}
+
+- Each repository uses a Kubernetes `Lease` as a short critical section.
+- The watcher recomputes queue state from live `PipelineRun`s rather than trusting only process memory.
+- A queued run is considered temporarily reserved when it carries short-lived claim annotations (`queue-claimed-by` and `queue-claimed-at`). If the watcher crashes or stalls, another instance can recover after the claim expires.
+- The watcher sorts candidates using the recorded `execution-order`, then falls back to creation time.
+- A background recovery loop re-enqueues the oldest recoverable queued run when a repository has no active started run and no active claim.
+
+This backend is designed for environments where the watcher may restart, the API server is slow, or promotion to `started` can fail transiently.
+
+For debugging annotations, queue decisions, events, and the full promotion flow see [Advanced Concurrency]({{< relref "/docs/advanced/concurrency" >}}).
+
+## Choosing the backend
+
+Select the global backend in the Pipelines-as-Code ConfigMap:
+
+```yaml
+data:
+  concurrency-backend: "memory"
+```
+
+or:
+
+```yaml
+data:
+  concurrency-backend: "lease"
+```
 
-If you have three PipelineRuns in your `.tekton/` directory and you create a pull
-request with a `concurrency_limit` of 1 in the repository configuration,
-Pipelines-as-Code executes all PipelineRuns in alphabetical order, one after the
-other. At any given time, only one PipelineRun is in the running state,
-while the rest are queued.
+Changing this setting requires restarting the watcher so it can recreate the queue manager with the new backend.
 
-For additional concurrency strategies and global configuration options, see [Advanced Concurrency]({{< relref "/docs/advanced/concurrency" >}}).
+For the global `concurrency-backend` setting itself, see [ConfigMap Reference]({{< relref "/docs/api/configmap" >}}).
 
 ## Kueue - Kubernetes-native Job Queueing
 

diff --git a/hack/gh-workflow-ci.sh b/hack/gh-workflow-ci.sh
@@ -200,6 +200,13 @@ get_tests() {
 
 run_e2e_tests() {
   set +x
+
+  # Enable lease-based concurrency backend for all E2E providers
+  kubectl -n pipelines-as-code patch configmap pipelines-as-code --type merge \
+    -p '{"data":{"concurrency-backend":"lease"}}'
+  kubectl -n pipelines-as-code rollout restart deployment/pipelines-as-code-watcher
+  kubectl -n pipelines-as-code rollout status deployment/pipelines-as-code-watcher --timeout=120s
+
   target="${TEST_PROVIDER}"
   export PAC_E2E_KEEP_NS=true
 

diff --git a/pkg/apis/pipelinesascode/keys/keys.go b/pkg/apis/pipelinesascode/keys/keys.go
@@ -61,6 +61,13 @@ const (
 	LogURL                 = pipelinesascode.GroupName + "/log-url"
 	ExecutionOrder         = pipelinesascode.GroupName + "/execution-order"
 	SCMReportingPLRStarted = pipelinesascode.GroupName + "/scm-reporting-plr-started"
+	QueueClaimedBy         = pipelinesascode.GroupName + "/queue-claimed-by"
+	QueueClaimedAt         = pipelinesascode.GroupName + "/queue-claimed-at"
+	QueueDecision          = pipelinesascode.GroupName + "/queue-decision"
+	QueueDebugSummary      = pipelinesascode.GroupName + "/queue-debug-summary"
+	QueuePromotionRetries  = pipelinesascode.GroupName + "/queue-promotion-retries"
+	QueuePromotionBlocked  = pipelinesascode.GroupName + "/queue-promotion-blocked"
+	QueuePromotionLastErr  = pipelinesascode.GroupName + "/queue-promotion-last-error"
 	SecretCreated          = pipelinesascode.GroupName + "/secret-created"
 	CloneURL               = pipelinesascode.GroupName + "/clone-url"
 	// PublicGithubAPIURL default is "https://api.github.com" but it can be overridden by X-GitHub-Enterprise-Host header.