-
Notifications
You must be signed in to change notification settings - Fork 183
Pull requests: awslabs/awsome-distributed-training
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
fix(verl/rlvr): Add EFA host mounts, fix data format bug, and add optimized GRPO recipe
#1062
opened Apr 9, 2026 by
paragao
Contributor
Loading…
Utils script to create users on all nodes (login, controller and compute), run from any node
#1061
opened Apr 8, 2026 by
mayankgupta14
Collaborator
Loading…
7 tasks
Bump transformers from 4.53.0 to 5.0.0rc3 in /3.test_cases/pytorch/distillation/src
dependencies
Pull requests that update a dependency file
python
Pull requests that update python code
#1060
opened Apr 8, 2026 by
dependabot
bot
Loading…
Bump transformers from 4.53.0 to 5.0.0rc3 in /3.test_cases/pytorch/FSDP/src
dependencies
Pull requests that update a dependency file
python
Pull requests that update python code
#1059
opened Apr 8, 2026 by
dependabot
bot
Loading…
Bump transformers from 4.48.0 to 5.0.0rc3 in /3.test_cases/pytorch/nvrx
dependencies
Pull requests that update a dependency file
python
Pull requests that update python code
#1057
opened Apr 8, 2026 by
dependabot
bot
Loading…
Add veRL GRPO training recipe for gpt-oss-20b on g5.12xlarge
#1054
opened Apr 4, 2026 by
nkumaraws
Contributor
Loading…
Bump requests from 2.32.3 to 2.33.0 in /3.test_cases/pytorch/nvrx
dependencies
Pull requests that update a dependency file
python
Pull requests that update python code
#1036
opened Mar 25, 2026 by
dependabot
bot
Loading…
Add V-JEPA 2 (Meta FAIR) distributed training test case
#1035
opened Mar 23, 2026 by
paragao
Contributor
Loading…
feat: Add observability IAM permissions for RIG cluster execution role
#1030
opened Mar 20, 2026 by
Madhubalasri-B
Collaborator
Loading…
Add DeepSpeed CI regression tests for QLoRA and GPT-103B
#1029
opened Mar 20, 2026 by
paragao
Contributor
Loading…
Add NeMo RL GRPO training on P5en with EFA RDMA
#1025
opened Mar 17, 2026 by
dmvevents
Loading…
5 of 7 tasks
fix: overhaul CI workflows for FSDP regression tests
#1024
opened Mar 17, 2026 by
paragao
Contributor
Loading…
Updating hyperpod-elastic-agent (HPEA) to v1.1.2 to support torch v2.6+
#1022
opened Mar 13, 2026 by
aravneelaws
Contributor
Loading…
7 tasks done
Add OSMO AMR Navigation test case
#1018
opened Mar 12, 2026 by
KeitaW
Collaborator
Loading…
1 of 3 tasks
docs: add Instance Compatibility Guide with per-test-case configuration tables
#1017
opened Mar 11, 2026 by
nkumaraws
Contributor
Loading…
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS
#1010
opened Mar 9, 2026 by
dmvevents
Loading…
6 tasks
Add optional Training Plan support for HyperPod instance groups
#1004
opened Feb 26, 2026 by
newabdosheham
Loading…
Updating CF stack for GB200 local zone deployments
#968
opened Feb 17, 2026 by
KeitaW
Collaborator
Loading…
Syntax improvements and code quality enhancements for EFA node exporter
#966
opened Feb 17, 2026 by
KeitaW
Collaborator
Loading…
ProTip!
Follow long discussions with comments:>50.