Skip to content

[RFC]: Repository reorganization — remove numbering, merge awsome-inference, split test_cases into training/inference #1056

@KeitaW

Description

@KeitaW

Motivation

The current awsome-distributed-training repository has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:

  1. Numbered prefixes impose a rigid ordering that no longer reflects how users discover content. Directories like 1.architectures/, 2.ami_and_containers/, 3.test_cases/ suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).

  2. Training and inference are split across two separate repositories. The awsome-inference repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training and inference on the same cluster shouldn't need two repos.

  3. The test_cases/ directory mixes only training workloads today, with no clear place for inference examples. As inference content grows, we need an explicit separation.

Proposed Change

1. Remove numeric prefixes from all directories

Current Proposed
0.docs/ docs/
1.architectures/ architectures/
2.ami_and_containers/ ami_and_containers/
3.test_cases/ test_cases/
4.validation_and_observability/ validation_and_observability/
micro-benchmarks/ micro-benchmarks/ (unchanged)

Also remove numbering from subdirectories:

Current Proposed
architectures/0.common/ architectures/common/
architectures/1.vpc_network/ architectures/vpc_network/
architectures/2.aws-parallelcluster/ architectures/aws-parallelcluster/
... ...
validation_and_observability/1.pytorch-env-validation/ validation_and_observability/pytorch-env-validation/
... ...

Additionally, to explicitly indicate that 5.sagemaker-hyperpod is for HyperPod Slurm, we rename it to sagemaker-hyperpod-slurm.

2. Merge awsome-inference into this repository

Content from aws-samples/awsome-inference would be absorbed:

awsome-inference source Destination in this repo
1.infrastructure/ architectures/ (merge with existing infra, deduplicate VPC/EKS)
2.projects/* test_cases/inference/<project>/
3.use-cases/* test_cases/inference/<use-case>/ or a top-level use-cases/ if distinct enough

The awsome-inference repo would be archived after the merge, with the README pointing here.

3. Split test_cases/ into training/ and inference/

test_cases/
├── training/
│   ├── pytorch/
│   │   ├── FSDP/
│   │   ├── ddp/
│   │   ├── deepspeed/
│   │   ├── torchtitan/
│   │   ├── nvrx/
│   │   ├── picotron/
│   │   └── ...
│   ├── megatron/
│   │   ├── megatron-lm/
│   │   ├── nemo/
│   │   └── ...
│   └── jax/
│       └── ...
├── inference/
│   ├── sglang/
│   ├── trtllm/
│   ├── nims/
│   ├── dynamo/
│   ├── ray-service/
│   ├── triton-trtllm-sagemaker/
│   └── ...
└── README.md

Proposed final top-level structure

awsome-distributed-training/
├── architectures/
├── ami_and_containers/
├── test_cases/
│   ├── training/
│   └── inference/
├── validation_and_observability/
├── micro-benchmarks/
├── docs/
└── README.md

Impact & Scope

  • Affected areas: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
  • Breaking changes: Yes — all directory paths change. This is a one-time, coordinated change.
  • Migration needed: Yes.
    • GitHub redirects do not work for path renames within a repo; only repo-level transfers get auto-redirects.
    • All cross-references in READMEs, workshop guides, and external docs need updating.
    • CI workflows (.github/workflows/) that use path filters will need path updates.
    • git mv preserves blame history; we should avoid re-creating files from scratch.
    • Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).

HyperPod Console dependency (hard blocker)

The SageMaker HyperPod console (more specifically, its CloudFormation templates, here and here) reads lifecycle scripts directly from this repository using the current directory paths (e.g., 1.architectures/5.sagemaker-hyperpod/...). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.

The proposed approach uses releases as stable reference points to decouple the service from main branch paths during the transition:

Step Action Outcome
1. Create a release Tag the current state of awsome-distributed-training (e.g., v2.0.0-pre-reorg) before any renames Provides a permanent, immutable snapshot with the old directory structure that the service can pin to
2. Update service → pinned release HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of main Service is decoupled from main — renames on main no longer affect production
3. Reorganize the repository Perform all renames, merges, and restructuring on main main now has the new directory structure; the pinned release is unaffected
4. Update service → new paths HyperPod team updates the console to reference the new paths on main (or a post-reorg release tag) Service is back on main with the new structure

This approach ensures zero downtime for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.

Alternative: Migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup

A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into aws/sagemaker-hyperpod-cluster-setup, which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:

  • Eliminate the cross-repo dependency — the scripts and the templates that reference them would live in the same repo, owned by the same team.
  • Decouple this repo from production service concernsawsome-distributed-training becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
  • Make future reorganizations safe — no need to coordinate with the HyperPod team for future directory changes in this repo.

If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to sagemaker-hyperpod-cluster-setup → update CloudFormation templates to reference the new location → reorganize this repo freely.

Migration plan (high level)

  1. Create a tracking issue for every external document/workshop that references current paths.
  2. Create a release (e.g., v2.0.0-pre-reorg) to snapshot the current directory structure.
  3. Coordinate with HyperPod service team to update console CloudFormation templates to reference the release tag instead of main (hard blocker — steps 4–5 cannot proceed until this is deployed).
  4. Perform all renames via git mv in a single PR to preserve history.
  5. Merge awsome-inference content in a follow-up PR (or same PR if manageable).
  6. Update all internal README links, CI path filters, and Makefile references.
  7. Coordinate with HyperPod service team to update console references from the release tag to the new paths on main.
  8. Archive awsome-inference with a pointer to this repo.
  9. Update external docs/workshops.

Alternatives Considered

  1. Keep numbering, just merge inference: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.

  2. Merge both repos into a new repo with a new name: Avoids breaking existing links to this repo, but loses the GitHub star count, issue history, and contributor graph. The awsome-distributed-training name is well-known enough to keep.

  3. Monorepo with workspaces (training/ and inference/ at root level): Considered splitting at the root rather than under test_cases/. Rejected because architectures, AMIs, and validation tooling are shared across training and inference — only the test cases themselves differ.

  4. Do nothing: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.

Feedback Period

4 weeks

CC List

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions