Skip to content

feat(sagemaker): support VS Code window restore for SSH connections#8804

Draft
aws-ajangg wants to merge 1 commit into
aws:masterfrom
aws-ajangg:feat-restore
Draft

feat(sagemaker): support VS Code window restore for SSH connections#8804
aws-ajangg wants to merge 1 commit into
aws:masterfrom
aws-ajangg:feat-restore

Conversation

@aws-ajangg

@aws-ajangg aws-ajangg commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

When VS Code restarts and attempts to reconnect to a SageMaker/HyperPod remote SSH session, the connection fails because:

  1. Environment variables (SAGEMAKER_LOCAL_SERVER_FILE_PATH, AWS_SSM_CLI, SAGEMAKER_SERVER_SCRIPT_PATH, SAGEMAKER_NODE_PATH) are not preserved across window restores
  2. The detached server process may have exited and cannot be restarted without knowing the node and script paths
  3. On Windows, remote.SSH.useLocalServer was set to false, which prevents VS Code from restoring remote windows

Solution

Make the SSH connect scripts self-healing so they can recover from a stale environment:

  • Connect scripts (bash/PowerShell): Resolve missing env vars by deriving paths from the script directory and the server info JSON. If the detached server isn't running, restart it using nodePath/serverScriptPath from the info file, waiting up to 10s for it to come online. Locate the SSM plugin from a known relative path when AWS_SSM_CLI is unset.
  • Detached server: Writes nodePath and serverScriptPath into the info JSON so scripts can restart it autonomously.
  • SSH settings (Windows): Set useLocalServer=true for SageMaker/HyperPod hosts to enable window restore. Configure the system SSH path (C:\Windows\System32\OpenSSH\ssh.exe) when unset.
  • Server stop handling: Treat EPERM the same as ESRCH when stopping old server processes to avoid hard failures when the PID is owned by another session.

Testing

  • Added unit tests for startVscodeRemote covering Windows-specific settings behavior (useLocalServer, remotePlatform, SSH path) for all SageMaker host prefixes (sm_, smc_, smhp_, smhpc_) and non-SageMaker hosts
  • Added unit tests for getSmSsmEnv verifying the new env vars (SAGEMAKER_SERVER_SCRIPT_PATH, SAGEMAKER_NODE_PATH) are correctly set
  • Updated stopLocalServer tests for the new EPERM handling behavior
  • Tested locally on Windows and macOS: verified window restore reconnects successfully after VS Code restart

@amazon-inspector-ohio

Copy link
Copy Markdown

⏳ I'm reviewing this pull request for security vulnerabilities and code quality issues. I'll provide an update when I'm done

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
  • This pull request implements a feat or fix, so it must include a changelog entry (unless the fix is for an unreleased feature). Review the changelog guidelines.
    • Note: beta or "experiment" features that have active users should announce fixes in the changelog.
    • If this is not a feature or fix, use an appropriate type from the title guidelines. For example, telemetry-only changes should use the telemetry type.

@amazon-inspector-ohio

Copy link
Copy Markdown

✅ I finished the code review, and didn't find any security or code quality issues.

Enable SageMaker/HyperPod SSH connections to survive VS Code restarts
by making the connect scripts self-healing when environment variables
are unavailable during window restore.

- Connect scripts (bash/PowerShell) resolve missing env vars from the
  server info JSON and script directory, restart the detached server if
  it's not running, and locate the SSM plugin from a known relative path
- Detached server writes nodePath and serverScriptPath to info JSON so
  scripts can restart it autonomously
- Set useLocalServer=true for SageMaker hosts on Windows (enables restore)
  and configure system SSH path when unset
- Handle EPERM same as ESRCH when stopping old server process
@zachliu

zachliu commented Jun 16, 2026

Copy link
Copy Markdown

@aws-ajangg hello 👋 is there a workaround in the meantime. we are having the same issue - vs code is always trying to reconnect to a stale session. how do we close the session cleanly to force a fresh start?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants