Summary
The HyperPod documentation now recommends API-Driven Configuration over the legacy provisioning_parameters.json approach for Slurm cluster setup. However, the base lifecycle scripts in 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ still require provisioning_parameters.json to function, creating a contradiction between the recommended API-driven path and the scripts that users are directed to use.
Current Behavior
lifecycle_script.py reads provisioning_parameters.json and passes its contents to downstream scripts (mount_fsx.sh, start_slurm.sh, setup_mariadb_accounting.sh, etc.)
on_create.sh expects provisioning_parameters.json to exist alongside the lifecycle scripts in S3
- Users following the API-Driven Configuration path (using
SlurmConfig and InstanceStorageConfigs in the CreateCluster API) still need to provide a provisioning_parameters.json file if they use the base lifecycle scripts
Expected Behavior
When using API-Driven Configuration, the lifecycle scripts should be able to derive all necessary configuration from resource_config.json (auto-generated by HyperPod) and the API-injected instance metadata, without requiring a separate provisioning_parameters.json.
Relevant Documentation
Suggested Changes
- Update
lifecycle_script.py to detect whether API-Driven Configuration is in use (e.g., check for SlurmConfig in resource_config.json) and fall back to provisioning_parameters.json only when API-driven config is absent
- Update
mount_fsx.sh to read FSx mount information from instance metadata or resource_config.json when available via InstanceStorageConfigs
- Update
start_slurm.sh to derive Slurm node assignments from resource_config.json when SlurmConfig is provided via the API
- Maintain backward compatibility —
provisioning_parameters.json should continue to work for existing users
Files Affected
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh
Summary
The HyperPod documentation now recommends API-Driven Configuration over the legacy
provisioning_parameters.jsonapproach for Slurm cluster setup. However, the base lifecycle scripts in1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/still requireprovisioning_parameters.jsonto function, creating a contradiction between the recommended API-driven path and the scripts that users are directed to use.Current Behavior
lifecycle_script.pyreadsprovisioning_parameters.jsonand passes its contents to downstream scripts (mount_fsx.sh,start_slurm.sh,setup_mariadb_accounting.sh, etc.)on_create.shexpectsprovisioning_parameters.jsonto exist alongside the lifecycle scripts in S3SlurmConfigandInstanceStorageConfigsin theCreateClusterAPI) still need to provide aprovisioning_parameters.jsonfile if they use the base lifecycle scriptsExpected Behavior
When using API-Driven Configuration, the lifecycle scripts should be able to derive all necessary configuration from
resource_config.json(auto-generated by HyperPod) and the API-injected instance metadata, without requiring a separateprovisioning_parameters.json.Relevant Documentation
provisioning_parameters.jsonas "Legacy Configuration"provisioning_parameters.jsonfile"provisioning_parameters.jsonfile is not required"Suggested Changes
lifecycle_script.pyto detect whether API-Driven Configuration is in use (e.g., check forSlurmConfiginresource_config.json) and fall back toprovisioning_parameters.jsononly when API-driven config is absentmount_fsx.shto read FSx mount information from instance metadata orresource_config.jsonwhen available viaInstanceStorageConfigsstart_slurm.shto derive Slurm node assignments fromresource_config.jsonwhenSlurmConfigis provided via the APIprovisioning_parameters.jsonshould continue to work for existing usersFiles Affected
1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh