Skip to content

Lifecycle scripts should support API-Driven Configuration without provisioning_parameters.json #1028

@KeitaW

Description

@KeitaW

Summary

The HyperPod documentation now recommends API-Driven Configuration over the legacy provisioning_parameters.json approach for Slurm cluster setup. However, the base lifecycle scripts in 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ still require provisioning_parameters.json to function, creating a contradiction between the recommended API-driven path and the scripts that users are directed to use.

Current Behavior

  • lifecycle_script.py reads provisioning_parameters.json and passes its contents to downstream scripts (mount_fsx.sh, start_slurm.sh, setup_mariadb_accounting.sh, etc.)
  • on_create.sh expects provisioning_parameters.json to exist alongside the lifecycle scripts in S3
  • Users following the API-Driven Configuration path (using SlurmConfig and InstanceStorageConfigs in the CreateCluster API) still need to provide a provisioning_parameters.json file if they use the base lifecycle scripts

Expected Behavior

When using API-Driven Configuration, the lifecycle scripts should be able to derive all necessary configuration from resource_config.json (auto-generated by HyperPod) and the API-injected instance metadata, without requiring a separate provisioning_parameters.json.

Relevant Documentation

Suggested Changes

  1. Update lifecycle_script.py to detect whether API-Driven Configuration is in use (e.g., check for SlurmConfig in resource_config.json) and fall back to provisioning_parameters.json only when API-driven config is absent
  2. Update mount_fsx.sh to read FSx mount information from instance metadata or resource_config.json when available via InstanceStorageConfigs
  3. Update start_slurm.sh to derive Slurm node assignments from resource_config.json when SlurmConfig is provided via the API
  4. Maintain backward compatibility — provisioning_parameters.json should continue to work for existing users

Files Affected

  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions