Skip to content

Fix K8s token refresh by caching K8sClient at executor level#6925

Merged
bentsherman merged 1 commit intomasterfrom
fix/k8s-client-cache-at-executor
Mar 16, 2026
Merged

Fix K8s token refresh by caching K8sClient at executor level#6925
bentsherman merged 1 commit intomasterfrom
fix/k8s-client-cache-at-executor

Conversation

@pditommaso
Copy link
Copy Markdown
Member

Summary

Fixes #6918 — supersedes #6920

Problem

PR #6742 added a Guava cache to K8sConfig.getClient() with a 50-minute expiry to refresh the service account token. However, the cache was never consulted after startup because:

  • K8sExecutor.register() called k8sConfig.getClient() once and stored the result in a private K8sClient field
  • K8sExecutor.getClient() returned that stored field directly
  • K8sTaskHandler stored executor.client in its constructor, bypassing getClient() entirely

This caused 401 Unauthorized errors after ~60 minutes on clusters with short-lived projected service account tokens (e.g. AKS, RKE2).

Why not #6920

PR #6920 correctly identified the root cause but had two issues:

  1. Created a new K8sClient on every getClient() call — the Guava cache in K8sConfig only cached ClientConfig, so every poll/submit/delete call triggered new K8sClient(...) which re-parses SSL certificates and initializes the trust manager
  2. Cache at the wrong levelK8sConfig is a configuration object; the client lifecycle belongs in the executor

Fix

  • Move the Guava cache from K8sConfig to K8sExecutor — cache the K8sClient itself (not just ClientConfig), avoiding redundant SSL setup on every API call
  • K8sConfig.getClient() becomes a plain factory method that creates a fresh ClientConfig (re-reading the token from disk)
  • K8sTaskHandler uses executor.getClient() instead of direct field access
  • Added K8sExecutorTest covering cache hit and expiration behavior

Move the Guava cache from K8sConfig to K8sExecutor so that the
K8sClient object itself is cached (not just the ClientConfig).
This avoids re-creating K8sClient (including SSL setup) on every
invocation while still refreshing the service account token when
the configured interval expires.

Fixes #6918

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 16, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit ce25993
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69b7c0b7755ad50008a0313e

*/
protected K8sClient getClient() {
client
clientCache.get('client', () -> new K8sClient(k8sConfig.getClient()))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a sec to get this, so I'll leave a comment here for future readers. The client config (including token) is cached, but the k8s client itself is recreated on every submit task. This means on a new client is created every time but it's very lightweight. While building a client, it will use the config which is refreshed every 50 minutes (default value).

This is all handled by getClient() reaches into the Guava cache instead of getting a fresh client. Guava cache handles use-or-get logic.

@bentsherman bentsherman merged commit 3d2e4c4 into master Mar 16, 2026
27 checks passed
@bentsherman bentsherman deleted the fix/k8s-client-cache-at-executor branch March 16, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

K8s executor caches client token at wrong layer — PR #6742 token refresh never triggered

3 participants