Motivation
The current awsome-distributed-training repository has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:
-
Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like 1.architectures/, 2.ami_and_containers/, 3.test_cases/ suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).
-
Training and inference are split across two separate repositories. The awsome-inference repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.
-
The test_cases/ directory mixes only training workloads today, with no clear place for inference examples. As inference content grows, we need an explicit separation.
Proposed Change
1. Remove numeric prefixes from all directories
| Current |
Proposed |
0.docs/ |
docs/ |
1.architectures/ |
architectures/ |
2.ami_and_containers/ |
ami_and_containers/ |
3.test_cases/ |
test_cases/ |
4.validation_and_observability/ |
validation_and_observability/ |
micro-benchmarks/ |
micro-benchmarks/ (unchanged) |
Also remove numbering from subdirectories:
| Current |
Proposed |
architectures/0.common/ |
architectures/common/ |
architectures/1.vpc_network/ |
architectures/vpc_network/ |
architectures/2.aws-parallelcluster/ |
architectures/aws-parallelcluster/ |
| ... |
... |
validation_and_observability/1.pytorch-env-validation/ |
validation_and_observability/pytorch-env-validation/ |
| ... |
... |
Additionally, to explicitly indicate that 5.sagemaker-hyperpod is for HyperPod Slurm, we rename it to sagemaker-hyperpod-slurm.
2. Merge awsome-inference into this repository
Content from aws-samples/awsome-inference would be absorbed:
| awsome-inference source |
Destination in this repo |
1.infrastructure/ |
architectures/ (merge with existing infra, deduplicate VPC/EKS) |
2.projects/* |
test_cases/inference/<project>/ |
3.use-cases/* |
test_cases/inference/<use-case>/ or a top-level use-cases/ if distinct enough |
The awsome-inference repo would be archived after the merge, with the README pointing here.
3. Split test_cases/ into training/ and inference/
test_cases/
├── training/
│ ├── pytorch/
│ │ ├── FSDP/
│ │ ├── ddp/
│ │ ├── deepspeed/
│ │ ├── torchtitan/
│ │ ├── nvrx/
│ │ ├── picotron/
│ │ └── ...
│ ├── megatron/
│ │ ├── megatron-lm/
│ │ ├── nemo/
│ │ └── ...
│ └── jax/
│ └── ...
├── inference/
│ ├── sglang/
│ ├── trtllm/
│ ├── nims/
│ ├── dynamo/
│ ├── ray-service/
│ ├── triton-trtllm-sagemaker/
│ └── ...
└── README.md
Proposed final top-level structure
awsome-distributed-training/
├── architectures/
├── ami_and_containers/
├── test_cases/
│ ├── training/
│ └── inference/
├── validation_and_observability/
├── micro-benchmarks/
├── docs/
└── README.md
Impact & Scope
- Affected areas: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
- Breaking changes: Yes — all directory paths change. This is a one-time, coordinated change.
- Migration needed: Yes.
- GitHub redirects do not work for path renames within a repo; only repo-level transfers get auto-redirects.
- All cross-references in READMEs, workshop guides, and external docs need updating.
- CI workflows (
.github/workflows/) that use path filters will need path updates.
git mv preserves blame history; we should avoid re-creating files from scratch.
- Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).
HyperPod Console dependency (hard blocker)
The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g., 1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.
The proposed approach uses releases as stable reference points to decouple the service from main branch paths during the transition:
| Step |
Action |
Outcome |
| 1. Create a release |
Tag the current state of awsome-distributed-training (e.g., v2.0.0-pre-reorg) before any renames |
Provides a permanent, immutable snapshot with the old directory structure that the service can pin to |
| 2. Update service → pinned release |
HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of main |
Service is decoupled from main — renames on main no longer affect production |
| 3. Reorganize the repository |
Perform all renames, merges, and restructuring on main |
main now has the new directory structure; the pinned release is unaffected |
| 4. Update service → new paths |
HyperPod team updates the console to reference the new paths on main (or a post-reorg release tag) |
Service is back on main with the new structure |
This approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.
Alternative: Migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup
A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:
- Eliminate the cross-repo dependency — the scripts and the templates that reference them would live in the same repo, owned by the same team.
- Decouple this repo from production service concerns —
awsome-distributed-training becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
- Make future reorganizations safe — no need to coordinate with the HyperPod team for future directory changes in this repo.
If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup → update CloudFormation templates to reference the new location → reorganize this repo freely.
Migration plan (high level)
- Create a tracking issue for every external document/workshop that references current paths.
- Create a release (e.g.,
v2.0.0-pre-reorg) to snapshot the current directory structure.
- Coordinate with HyperPod service team to update console CloudFormation templates to reference the release tag instead of
main (hard blocker — steps 4–5 cannot proceed until this is deployed).
- Perform all renames via
git mv in a single PR to preserve history.
- Merge
awsome-inference content in a follow-up PR (or same PR if manageable).
- Update all internal README links, CI path filters, and Makefile references.
- Coordinate with HyperPod service team to update console references from the release tag to the new paths on
main.
- Archive
awsome-inference with a pointer to this repo.
- Update external docs/workshops.
Alternatives Considered
-
Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.
-
Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The awsome-distributed-training name is well-known enough to keep.
-
Monorepo with workspaces (training/ and inference/ at root level): Considered splitting at the root rather than under test_cases/. Rejected because architectures, AMIs, and validation tooling are shared across training and inference — only the test cases themselves differ.
-
Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.
Feedback Period
4 weeks
CC List
Motivation
The current
awsome-distributed-trainingrepository has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like
1.architectures/,2.ami_and_containers/,3.test_cases/suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).Training and inference are split across two separate repositories. The
awsome-inferencerepo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.The
test_cases/directory mixes only training workloads today, with no clear place for inference examples. As inference content grows, we need an explicit separation.Proposed Change
1. Remove numeric prefixes from all directories
0.docs/docs/1.architectures/architectures/2.ami_and_containers/ami_and_containers/3.test_cases/test_cases/4.validation_and_observability/validation_and_observability/micro-benchmarks/micro-benchmarks/(unchanged)Also remove numbering from subdirectories:
architectures/0.common/architectures/common/architectures/1.vpc_network/architectures/vpc_network/architectures/2.aws-parallelcluster/architectures/aws-parallelcluster/validation_and_observability/1.pytorch-env-validation/validation_and_observability/pytorch-env-validation/Additionally, to explicitly indicate that 5.sagemaker-hyperpod is for HyperPod Slurm, we rename it to sagemaker-hyperpod-slurm.
2. Merge
awsome-inferenceinto this repositoryContent from
aws-samples/awsome-inferencewould be absorbed:1.infrastructure/architectures/(merge with existing infra, deduplicate VPC/EKS)2.projects/*test_cases/inference/<project>/3.use-cases/*test_cases/inference/<use-case>/or a top-leveluse-cases/if distinct enoughThe
awsome-inferencerepo would be archived after the merge, with the README pointing here.3. Split
test_cases/intotraining/andinference/Proposed final top-level structure
Impact & Scope
.github/workflows/) that use path filters will need path updates.git mvpreserves blame history; we should avoid re-creating files from scratch.HyperPod Console dependency (hard blocker)
The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g.,
1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.The proposed approach uses releases as stable reference points to decouple the service from
mainbranch paths during the transition:awsome-distributed-training(e.g.,v2.0.0-pre-reorg) before any renamesmainmain— renames onmainno longer affect productionmainmainnow has the new directory structure; the pinned release is unaffectedmain(or a post-reorg release tag)mainwith the new structureThis approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.
Alternative: Migrate lifecycle scripts to
sagemaker-hyperpod-cluster-setupA cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into
aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:awsome-distributed-trainingbecomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to
sagemaker-hyperpod-cluster-setup→ update CloudFormation templates to reference the new location → reorganize this repo freely.Migration plan (high level)
v2.0.0-pre-reorg) to snapshot the current directory structure.main(hard blocker — steps 4–5 cannot proceed until this is deployed).git mvin a single PR to preserve history.awsome-inferencecontent in a follow-up PR (or same PR if manageable).main.awsome-inferencewith a pointer to this repo.Alternatives Considered
Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.
Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The
awsome-distributed-trainingname is well-known enough to keep.Monorepo with workspaces (training/ and inference/ at root level): Considered splitting at the root rather than under
test_cases/. Rejected because architectures, AMIs, and validation tooling are shared across training and inference — only the test cases themselves differ.Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.
Feedback Period
4 weeks
CC List