Skip to content

Commit b4c4488

Browse files
authored
Jwilber/update esm2 native te.yaml (#1197)
Update `esm2_native_te` config to have better comments and structure, with better support for overriding script keys and observability stuff to track. Also propagate the `cfg.num_nodes` to parallelism key in `LeptonJobUserSpec` (this was unclear in the python sdk instructions). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Expanded recipe surface with explicit identifiers (variant, framework, precision, parallelism strategy), TE/FP8 toggles, wandb_init_args, script_args, products list, and run_script orchestration. - Improvements - Script-level controls for train steps, micro-batch size, warmup and checkpointing; run commands receive tracking and checkpoint parameters. - Job parallelism now follows configured node/device counts; job names prefixed for clearer tracing; enhanced experiment tracking (mode/project/group/job_type/name). - Removed - Removed a legacy top-level config key from the base configuration. - Chores - Removed non-essential debug shell output. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jared Wilber <jwilber@nvidia.com>
1 parent c5455dd commit b4c4488

5 files changed

Lines changed: 130 additions & 42 deletions

File tree

ci/lepton/model_convergence/configs/base.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
node_group_name: nv-int-multiteam-nebius-h200-01
2-
31
container:
42
image: nvcr.io/nvidia/pytorch:25.06-py3
53
registry_auth: lepton-nvidia

ci/lepton/model_convergence/configs/recipes/esm2_accelerate_te.yaml

Lines changed: 55 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,47 +3,88 @@ defaults:
33
- /base
44
- _self_
55

6-
# lepton info
6+
branch: jwilber/add-accelerate-l1-3b-config
7+
8+
############################################################
9+
# lepton job info
10+
############################################################
711
node_group: yo-bom-lepton-001
812
num_nodes: 2
913
device_type: gpu
10-
num_devices: 2
14+
num_devices: 8
1115
gpu_type: h100-sxm
12-
total_gpus: ${multiply:${num_devices},${num_nodes}}
1316
resource_shape: "${device_type}.${num_devices}x${gpu_type}"
1417

18+
############################################################
1519
# recipe identifiers
20+
# mostly used for logging and observability
21+
############################################################
1622
recipe_subdir: esm2_accelerate_te
1723
model_type: esm2
24+
variant: train # train, finetune
25+
26+
# Core identifiers for filtering
27+
framework: native # native, accelerate
28+
parallelism_strategy: fsdp2 # ddp, fsdp2, mfsdp
29+
precision: fp8 # likely bf16 or fp8
30+
te_enabled: true
31+
fp8_enabled: true
32+
33+
# Catchall for additional features/configs
34+
extras: [] # e.g. [thd]
35+
36+
############################################################
37+
# wandb info (total_gpus used for group name)
38+
############################################################
39+
# `total_gpus` calculated from lepton job info above
40+
total_gpus: ${multiply:${num_devices},${num_nodes}}
1841

19-
# wandb
2042
wandb_init_args:
2143
project: "test_convergence__recipes__${sanitize:${branch}}"
22-
group: "${model_type}__${task_cmd}__${total_gpus}__${sanitize:${gpu_type}}"
44+
group: "${model_type}__${task_cmd}__${total_gpus}gpus__${sanitize:${gpu_type}}"
2345
job_type: "${recipe_subdir}"
2446
name: null
2547

48+
############################################################
49+
# task commands
50+
# shared across all products (if not explicitly overridden)
51+
############################################################
52+
# task_cmd: train_fsdp2 # mfsdp
53+
task_cmd: train
54+
55+
# script overrides
56+
# these should match the keys in the recipe's config file
57+
# model_tag: nvidia/esm2_t36_3B_UR50D
58+
59+
micro_batch_size: 4
60+
# num_warmup_steps: 20_000
2661
# config overrides
2762
trainer:
2863
report_to: "wandb"
2964

30-
# train specific commands
31-
task_cmd: train
32-
stop_after_n_steps: 10
65+
stop_after_n_steps: 100
3366

34-
# configs to run
67+
############################################################
68+
# Each product is a different config to run, alongside
69+
# config-specific arguments. Must have a w`andb_name`.
70+
############################################################
3571
products:
36-
- config: L0_sanity
72+
- config: L1_3B
73+
acc_config: default
3774
wandb_name: "${config}__${now:%Y%m%d-%H%M%S}__${gitsha:}"
3875

39-
# training script to run
76+
############################################################
77+
# run script
78+
# This gets called right after `checkout_script` in the base config.
79+
############################################################
4080
run_script: |
41-
accelerate launch --config_file accelerate_config/default.yaml \
81+
accelerate launch --config_file accelerate_config/${acc_config}.yaml \
4282
${task_cmd}.py \
4383
--config-name=${config} \
4484
stop_after_n_steps=${stop_after_n_steps} \
45-
wandb_init_args.mode=${wandb_init_args.mode} \
85+
+wandb_init_args.mode=${wandb_init_args.mode} \
4686
+wandb_init_args.project=${wandb_init_args.project} \
4787
+wandb_init_args.group=${wandb_init_args.group} \
4888
+wandb_init_args.job_type=${wandb_init_args.job_type} \
49-
wandb_init_args.name=${wandb_name}
89+
wandb_init_args.name=${wandb_name} \
90+
trainer.per_device_train_batch_size=${micro_batch_size}

ci/lepton/model_convergence/configs/recipes/esm2_native_te.yaml

Lines changed: 73 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,101 @@ defaults:
33
- /base
44
- _self_
55

6-
# lepton info
6+
############################################################
7+
# lepton job info
8+
############################################################
79
node_group: yo-bom-lepton-001
810
num_nodes: 1
911
device_type: gpu
1012
num_devices: 2
1113
gpu_type: h100-sxm
12-
total_gpus: ${multiply:${num_devices},${num_nodes}}
1314
resource_shape: "${device_type}.${num_devices}x${gpu_type}"
1415

16+
############################################################
1517
# recipe identifiers
18+
# mostly used for logging and observability
19+
############################################################
1620
recipe_subdir: esm2_native_te
1721
model_type: esm2
22+
variant: train # train, finetune
23+
24+
# Core identifiers for filtering
25+
framework: native # native, accelerate
26+
parallelism_strategy: fsdp2 # ddp, fsdp2, mfsdp
27+
precision: fp8 # likely bf16 or fp8
28+
te_enabled: true
29+
fp8_enabled: true
30+
31+
# Catchall for additional features/configs
32+
extras: [] # e.g. [thd]
33+
34+
############################################################
35+
# wandb info (total_gpus used for group name)
36+
############################################################
37+
# `total_gpus` calculated from lepton job info above
38+
total_gpus: ${multiply:${num_devices},${num_nodes}}
1839

19-
# wandb
2040
wandb_init_args:
2141
project: "test_convergence__recipes__${sanitize:${branch}}"
22-
group: "${model_type}__${task_cmd}__${total_gpus}__${sanitize:${gpu_type}}"
42+
group: "${model_type}__${task_cmd}__${total_gpus}gpus__${sanitize:${gpu_type}}"
2343
job_type: "${recipe_subdir}"
2444
name: null
2545

26-
# train specific commands
27-
task_cmd: train_fsdp2 # mfsdp
28-
num_train_steps: 100
46+
############################################################
47+
# task commands
48+
# shared across all products (if not explicitly overridden)
49+
############################################################
50+
51+
# script overrides
52+
# these should match the keys in the recipe's config file
53+
model_tag: nvidia/esm2_t36_3B_UR50D
54+
# task_cmd: train_fsdp2 # mfsdp
55+
num_train_steps: 10_000
56+
micro_batch_size: 16
57+
num_warmup_steps: 20_000
2958

30-
# configs to run
59+
# checkpoint controls
60+
ckpt_dir: ""
61+
save_checkpoints: false
62+
save_final_model: false
63+
resume_from_checkpoint: false
64+
use_distributed_checkpoint_fsdp2: false
65+
save_every_n_steps: 50
66+
67+
############################################################
68+
# Each product is a different config to run, alongside
69+
# config-specific arguments. Must have a w`andb_name`.
70+
############################################################
3171
products:
32-
- config: L0_sanity
72+
- config: L1_3B
73+
task_cmd: train_fsdp2
74+
wandb_name: "${config}__${now:%Y%m%d-%H%M%S}__${gitsha:}"
75+
- config: L1_3B
76+
task_cmd: train_mfsdp
77+
wandb_name: "${config}__${now:%Y%m%d-%H%M%S}__${gitsha:}"
78+
micro_batch_size: 2
79+
- config: L1_3B
80+
task_cmd: train_ddp
3381
wandb_name: "${config}__${now:%Y%m%d-%H%M%S}__${gitsha:}"
34-
# resource_shape: gpu.2xh200
35-
# # - config: L1_3B
36-
# # resource_shape: gpu.2xh200
37-
# # - config: L1_15B_perf_test
3882

39-
# training script to run
83+
############################################################
84+
# run script
85+
# This gets called right after `checkout_script` in the base config.
86+
############################################################
4087
run_script: |
4188
torchrun ${task_cmd}.py \
4289
--config-name ${config}.yaml \
43-
num_train_steps=${num_train_steps} \
44-
wandb_init_args.mode=${wandb_init_args.mode} \
45-
+wandb_init_args.project=${wandb_init_args.project} \
90+
+wandb_init_args.mode=${wandb_init_args.mode} \
91+
wandb_init_args.project=${wandb_init_args.project} \
4692
+wandb_init_args.group=${wandb_init_args.group} \
4793
+wandb_init_args.job_type=${wandb_init_args.job_type} \
48-
wandb_init_args.name=${wandb_name}
94+
wandb_init_args.name=${wandb_name} \
95+
num_train_steps=${num_train_steps} \
96+
dataset.micro_batch_size=${micro_batch_size} \
97+
lr_scheduler_kwargs.num_warmup_steps=${num_warmup_steps} \
98+
checkpoint.ckpt_dir=${ckpt_dir} \
99+
checkpoint.save_final_model=${save_final_model} \
100+
checkpoint.resume_from_checkpoint=${resume_from_checkpoint} \
101+
checkpoint.save_every_n_steps=${save_every_n_steps} \
102+
+checkpoint.save_checkpoints=${save_checkpoints} \
103+
+checkpoint.use_distributed_checkpoint_fsdp2=${use_distributed_checkpoint_fsdp2}

ci/lepton/model_convergence/scripts/launch_job.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ def launch_single_job(client, cfg: DictConfig):
106106
command=command,
107107
),
108108
completions=cfg.num_nodes,
109-
parallelism=1,
109+
parallelism=cfg.num_nodes,
110110
envs=env_vars,
111111
image_pull_secrets=[cfg.container.registry_auth],
112112
mounts=mounts,
@@ -182,7 +182,7 @@ def main(cfg: DictConfig):
182182

183183
# Create job name as base_recipe_name-config (e.g., "geneformer-10m")
184184
config_name = product_dict["config"].replace("_", "-").replace("/", "-")
185-
product_cfg.job_name = f"{base_recipe_name}-{config_name}".lower()
185+
product_cfg.job_name = f"convtest-{base_recipe_name}-{config_name}".lower()
186186

187187
print(f"\n[{i}/{len(cfg.products)}] Launching: {product_cfg.job_name}")
188188

ci/lepton/model_convergence/scripts/wrap_template.sh

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,6 @@ __SCRIPT__
2020
RC=$?
2121
set -e
2222

23-
echo "pwd"
24-
pwd
25-
26-
echo "ls"
27-
ls
28-
2923
echo "commit in bionemo-framework"
3024
(cd bionemo-framework && git log -1 || true)
3125
# Always grab the exact commit currently checked out in the framework repo

0 commit comments

Comments
 (0)