[RFC]: Repository reorganization — remove numbering, merge awsome-inference, split test_cases into training/inference

## Motivation

The current `awsome-distributed-training` repository has grown to cover a broad set of distributed ML workloads on AWS, but the directory layout carries legacy decisions that make it harder to navigate and contribute to:

1. **Numbered prefixes impose a rigid ordering** that no longer reflects how users discover content. Directories like `1.architectures/`, `2.ami_and_containers/`, `3.test_cases/` suggest a sequential workflow, but most users land directly on a specific test case or architecture via search or a shared link. The numbering also creates friction when adding new top-level directories (what number should it get?).

2. **Training and inference are split across two separate repositories.** The [`awsome-inference`](https://github.com/aws-samples/awsome-inference) repo covers inference workloads (SGLang, TRT-LLM, NIMs, Ray Serve, Dynamo, etc.) with its own infrastructure and project structure. Maintaining two repos leads to duplicated IaC (both have VPC/EKS setup), divergent conventions, and a fragmented contributor/user experience. Users doing both training *and* inference on the same cluster shouldn't need two repos.

3. **The `test_cases/` directory mixes only training workloads** today, with no clear place for inference examples. As inference content grows, we need an explicit separation.

## Proposed Change

### 1. Remove numeric prefixes from all directories

| Current | Proposed |
|---|---|
| `0.docs/` | `docs/` |
| `1.architectures/` | `architectures/` |
| `2.ami_and_containers/` | `ami_and_containers/` |
| `3.test_cases/` | `test_cases/` |
| `4.validation_and_observability/` | `validation_and_observability/` |
| `micro-benchmarks/` | `micro-benchmarks/` (unchanged) |

Also remove numbering from subdirectories:

| Current | Proposed |
|---|---|
| `architectures/0.common/` | `architectures/common/` |
| `architectures/1.vpc_network/` | `architectures/vpc_network/` |
| `architectures/2.aws-parallelcluster/` | `architectures/aws-parallelcluster/` |
| ... | ... |
| `validation_and_observability/1.pytorch-env-validation/` | `validation_and_observability/pytorch-env-validation/` |
| ... | ... |

Additionally, to explicitly indicate that 5.sagemaker-hyperpod is for HyperPod Slurm, we rename it to sagemaker-hyperpod-slurm.

### 2. Merge `awsome-inference` into this repository

Content from [`aws-samples/awsome-inference`](https://github.com/aws-samples/awsome-inference) would be absorbed:

| awsome-inference source | Destination in this repo |
|---|---|
| `1.infrastructure/` | `architectures/` (merge with existing infra, deduplicate VPC/EKS) |
| `2.projects/*` | `test_cases/inference/<project>/` |
| `3.use-cases/*` | `test_cases/inference/<use-case>/` or a top-level `use-cases/` if distinct enough |

The `awsome-inference` repo would be archived after the merge, with the README pointing here.

### 3. Split `test_cases/` into `training/` and `inference/`

```
test_cases/
├── training/
│   ├── pytorch/
│   │   ├── FSDP/
│   │   ├── ddp/
│   │   ├── deepspeed/
│   │   ├── torchtitan/
│   │   ├── nvrx/
│   │   ├── picotron/
│   │   └── ...
│   ├── megatron/
│   │   ├── megatron-lm/
│   │   ├── nemo/
│   │   └── ...
│   └── jax/
│       └── ...
├── inference/
│   ├── sglang/
│   ├── trtllm/
│   ├── nims/
│   ├── dynamo/
│   ├── ray-service/
│   ├── triton-trtllm-sagemaker/
│   └── ...
└── README.md
```

### Proposed final top-level structure

```
awsome-distributed-training/
├── architectures/
├── ami_and_containers/
├── test_cases/
│   ├── training/
│   └── inference/
├── validation_and_observability/
├── micro-benchmarks/
├── docs/
└── README.md
```

## Impact & Scope

- **Affected areas**: Every directory, all internal links, README references, CI workflows, external documentation, blog posts, and workshop materials that reference current paths.
- **Breaking changes**: Yes — all directory paths change. This is a one-time, coordinated change.
- **Migration needed**: Yes.
  - GitHub redirects do **not** work for path renames within a repo; only repo-level transfers get auto-redirects.
  - All cross-references in READMEs, workshop guides, and external docs need updating.
  - CI workflows (`.github/workflows/`) that use path filters will need path updates.
  - `git mv` preserves blame history; we should avoid re-creating files from scratch.
  - Consider adding a compatibility script or symlinks in a transitional release (though symlinks don't render on GitHub).

### HyperPod Console dependency (hard blocker)

The **SageMaker HyperPod console (more specifically, its CloudFormation templates, [here](https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/main/eks/cloudformation/lifecycle-script-template.yaml#L14) and [here](https://github.com/aws/sagemaker-hyperpod-cluster-setup/blob/main/slurm/cloudformation/slurm-lifecycle-script-template.yaml#L28)) reads lifecycle scripts directly from this repository using the current directory paths** (e.g., `1.architectures/5.sagemaker-hyperpod/...`). Renaming these directories would break the production HyperPod service's ability to locate and execute lifecycle scripts.

The proposed approach uses **releases as stable reference points** to decouple the service from `main` branch paths during the transition:

| Step | Action | Outcome |
|---|---|---|
| **1. Create a release** | Tag the current state of `awsome-distributed-training` (e.g., `v2.0.0-pre-reorg`) before any renames | Provides a **permanent, immutable snapshot** with the old directory structure that the service can pin to |
| **2. Update service → pinned release** | HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of `main` | Service is **decoupled from `main`** — renames on `main` no longer affect production |
| **3. Reorganize the repository** | Perform all renames, merges, and restructuring on `main` | `main` now has the new directory structure; the pinned release is unaffected |
| **4. Update service → new paths** | HyperPod team updates the console to reference the new paths on `main` (or a post-reorg release tag) | Service is back on `main` with the new structure |

This approach ensures **zero downtime** for HyperPod users — at no point are the paths the service references invalid. The release tag serves as a safety net: even if step 4 is delayed, the service continues to work against the pinned release.

**Alternative: Migrate lifecycle scripts to `sagemaker-hyperpod-cluster-setup`**

A cleaner long-term option is to move the lifecycle scripts out of this repository entirely and into [`aws/sagemaker-hyperpod-cluster-setup`](https://github.com/aws/sagemaker-hyperpod-cluster-setup/tree/main), which is the official repo for HyperPod cluster setup and already contains the CloudFormation templates that reference them. This would:

- **Eliminate the cross-repo dependency** — the scripts and the templates that reference them would live in the same repo, owned by the same team.
- **Decouple this repo from production service concerns** — `awsome-distributed-training` becomes purely a collection of reference examples and best practices, not a dependency of a production AWS service.
- **Make future reorganizations safe** — no need to coordinate with the HyperPod team for future directory changes in this repo.

If this approach is chosen, the migration sequence becomes: migrate lifecycle scripts to `sagemaker-hyperpod-cluster-setup` → update CloudFormation templates to reference the new location → reorganize this repo freely.

### Migration plan (high level)

1. Create a tracking issue for every external document/workshop that references current paths.
2. **Create a release** (e.g., `v2.0.0-pre-reorg`) to snapshot the current directory structure.
3. **Coordinate with HyperPod service team** to update console CloudFormation templates to reference the release tag instead of `main` (hard blocker — steps 4–5 cannot proceed until this is deployed).
4. Perform all renames via `git mv` in a single PR to preserve history.
5. Merge `awsome-inference` content in a follow-up PR (or same PR if manageable).
6. Update all internal README links, CI path filters, and Makefile references.
7. **Coordinate with HyperPod service team** to update console references from the release tag to the new paths on `main`.
8. Archive `awsome-inference` with a pointer to this repo.
9. Update external docs/workshops.

## Alternatives Considered

1. **Keep numbering, just merge inference**: Solves fragmentation but retains the awkward numbering convention. Since we're already doing a disruptive rename for the merge, removing numbers at the same time minimizes total disruption.

2. **Merge both repos into a new repo with a new name**: Avoids breaking existing links to *this* repo, but loses the GitHub star count, issue history, and contributor graph. The `awsome-distributed-training` name is well-known enough to keep.

3. **Monorepo with workspaces (training/ and inference/ at root level)**: Considered splitting at the root rather than under `test_cases/`. Rejected because architectures, AMIs, and validation tooling are shared across training and inference — only the test cases themselves differ.

4. **Do nothing**: Maintain two repos, accept the divergence. This gets worse over time as more inference content is added and infrastructure is duplicated.

## Feedback Period

4 weeks

## CC List

Current	Proposed
`0.docs/`	`docs/`
`1.architectures/`	`architectures/`
`2.ami_and_containers/`	`ami_and_containers/`
`3.test_cases/`	`test_cases/`
`4.validation_and_observability/`	`validation_and_observability/`
`micro-benchmarks/`	`micro-benchmarks/` (unchanged)

Current	Proposed
`architectures/0.common/`	`architectures/common/`
`architectures/1.vpc_network/`	`architectures/vpc_network/`
`architectures/2.aws-parallelcluster/`	`architectures/aws-parallelcluster/`
...	...
`validation_and_observability/1.pytorch-env-validation/`	`validation_and_observability/pytorch-env-validation/`
...	...

awsome-inference source	Destination in this repo
`1.infrastructure/`	`architectures/` (merge with existing infra, deduplicate VPC/EKS)
`2.projects/*`	`test_cases/inference/<project>/`
`3.use-cases/*`	`test_cases/inference/<use-case>/` or a top-level `use-cases/` if distinct enough

Step	Action	Outcome
1. Create a release	Tag the current state of `awsome-distributed-training` (e.g., `v2.0.0-pre-reorg`) before any renames	Provides a permanent, immutable snapshot with the old directory structure that the service can pin to
2. Update service → pinned release	HyperPod team updates the console CloudFormation templates to reference lifecycle scripts from the release tag instead of `main`	Service is decoupled from `main` — renames on `main` no longer affect production
3. Reorganize the repository	Perform all renames, merges, and restructuring on `main`	`main` now has the new directory structure; the pinned release is unaffected
4. Update service → new paths	HyperPod team updates the console to reference the new paths on `main` (or a post-reorg release tag)	Service is back on `main` with the new structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Repository reorganization — remove numbering, merge awsome-inference, split test_cases into training/inference #1056

Motivation

Proposed Change

1. Remove numeric prefixes from all directories

2. Merge `awsome-inference` into this repository

3. Split `test_cases/` into `training/` and `inference/`

Proposed final top-level structure

Impact & Scope

HyperPod Console dependency (hard blocker)

Migration plan (high level)

Alternatives Considered

Feedback Period

CC List

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Repository reorganization — remove numbering, merge awsome-inference, split test_cases into training/inference #1056

Description

Motivation

Proposed Change

1. Remove numeric prefixes from all directories

2. Merge awsome-inference into this repository

3. Split test_cases/ into training/ and inference/

Proposed final top-level structure

Impact & Scope

HyperPod Console dependency (hard blocker)

Migration plan (high level)

Alternatives Considered

Feedback Period

CC List

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Merge `awsome-inference` into this repository

3. Split `test_cases/` into `training/` and `inference/`