Skip to content

fix(sagemaker): Add proactive credential refresh for SSO-based SageMaker Space connections#8755

Open
amzn-kvk wants to merge 8 commits into
aws:masterfrom
amzn-kvk:fix/sso-credential-refresh-test
Open

fix(sagemaker): Add proactive credential refresh for SSO-based SageMaker Space connections#8755
amzn-kvk wants to merge 8 commits into
aws:masterfrom
amzn-kvk:fix/sso-credential-refresh-test

Conversation

@amzn-kvk

Copy link
Copy Markdown

Problem

When connecting to a SageMaker Space via IAM Identity Center (SSO), the session disconnects after ~1 hour. Active Jupyter kernels are killed and the user must re-authenticate via browser and reload the window.

The root cause is that persistLocalCredentials() snapshots STS credentials once at connection time and writes them as static strings to ~/.aws/.sagemaker-space-profiles. The detached server's resolveCredentialsFor() returns these values as-is with no expiry check or re-derivation. After ~1 hour (default STS credential TTL), the SSM tunnel drops and reconnection fails because the mapping file contains expired credentials.

This only affects SSO connections:

Solution

Add SsoCredentialRefresher that follows the proven SMUS ProjectRoleCredentialsProvider.startProactiveCredentialRefresh() pattern:

  • Checks every 10 seconds using setTimeout (handles sleep/resume correctly)
  • Refreshes when credentials expire within 5 minutes (safety buffer)
  • Writes fresh credentials to ~/.aws/.sagemaker-space-profiles via setSpaceSsoProfile()

persistLocalCredentials() now starts the refresher for SSO connections after the initial credential write. The detached server and SSH ProxyCommand are unchanged - they continue reading from the mapping file, which now stays fresh.

Files changed:

  • credentialMapping.ts - Added SsoCredentialRefresher class; wired into persistLocalCredentials() for SSO
  • ssoCredentialRefresh.test.ts - 4 tests covering refresh on expiry, skip when fresh, end-to-end read, and SSO/SMUS parity

  • Treat all work as PUBLIC. Private feature/x branches will not be squash-merged at release time.
  • Your code changes must meet the guidelines in CONTRIBUTING.md.
  • License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@amzn-kvk amzn-kvk requested a review from a team as a code owner April 21, 2026 10:27
@amazon-inspector-ohio

Copy link
Copy Markdown

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

@github-actions

Copy link
Copy Markdown
  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

@amazon-inspector-ohio

Copy link
Copy Markdown

✅ I finished the code review, and didn't find any security or code quality issues.

@amzn-kvk amzn-kvk changed the title fix: Add proactive credential refresh for SSO-based SageMaker Space connections fix(sagemaker): Add proactive credential refresh for SSO-based SageMaker Space connections Apr 21, 2026
@amzn-kvk amzn-kvk force-pushed the fix/sso-credential-refresh-test branch 2 times, most recently from ddbcf01 to dc450a2 Compare April 21, 2026 11:18
@yuriivv

yuriivv commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

/retryBuilds

Comment thread packages/core/src/awsService/sagemaker/credentialMapping.ts
@yuriivv yuriivv force-pushed the fix/sso-credential-refresh-test branch from af06e7d to ebebf4d Compare April 21, 2026 15:09
When connecting to a SageMaker Space via IAM Identity Center (SSO),
the Toolkit snapshots STS credentials once at connection time and
never refreshes them. After ~1 hour (default STS credential TTL),
the SSM tunnel drops and reconnection fails because the detached
server reads expired credentials from the mapping file.

This is an asymmetry in the codebase:
- IAM connections: detached server calls fromIni() which resolves
  credentials dynamically on each request
- SMUS connections: persistSmusProjectCreds() calls
  startProactiveCredentialRefresh() which periodically writes fresh
  credentials to the mapping file
- SSO connections: persistLocalCredentials() writes once and returns
  with no refresh mechanism

These tests document the gap. All three will pass once
startProactiveCredentialRefresh() is wired up for SSO connections
using the SSO token cache and GetRoleCredentials as the credential
source.
Add SsoCredentialRefresher that follows the proven SMUS
startProactiveCredentialRefresh() pattern:
- 10s check interval using setTimeout (handles sleep/resume)
- 5min safety buffer before credential expiry
- Writes fresh credentials to ~/.aws/.sagemaker-space-profiles

persistLocalCredentials() now starts the refresher for SSO connections
after the initial credential write. This prevents the ~1h disconnect
caused by stale STS credentials in the mapping file.

The detached server and SSH ProxyCommand are unchanged - they continue
reading credentials from the mapping file, which now stays fresh.
…isposables

Register a disposable in the SageMaker activation subscriptions that
calls stopAllSsoCredentialRefreshers() on extension deactivation,
ensuring all refresh timers are cleaned up on reload or uninstall.
@amzn-kvk amzn-kvk force-pushed the fix/sso-credential-refresh-test branch from 40a9b96 to 0105627 Compare April 23, 2026 10:17
Move markPending() before open(url) in getSessionAsync handler to
prevent browser callback from arriving before pending status is written.
Add downgrade guard in SessionStore.markPending() to never overwrite
a 'fresh' status with 'pending'.

The race condition caused the PS1 reconnect script to poll indefinitely
with 204 responses on Windows, where the browser auth redirect completes
faster than the async file I/O in markPending.

🤖 Assisted by AI

aws#8755
…er stop

Add structured logging to detached server with timestamp, method, path,
port, and response status. Sensitive params (token, ws_url, session) are
redacted. Helps diagnose reconnection issues without exposing secrets.

Handle EPERM error gracefully in stopLocalServer() on Windows where the
server process may be elevated or in a different security context.

🤖 Assisted by AI

aws#8755
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants