Skip to content

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035

Open
paragao wants to merge 15 commits intomainfrom
feat/vjepa2-distributed-training
Open

Add V-JEPA 2 (Meta FAIR) distributed training test case#1035
paragao wants to merge 15 commits intomainfrom
feat/vjepa2-distributed-training

Conversation

@paragao
Copy link
Copy Markdown
Contributor

@paragao paragao commented Mar 23, 2026

Summary

  • Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-parameter self-supervised video model as a new PyTorch distributed training test case
  • Includes Slurm (Pyxis/Enroot) and Kubernetes (PyTorchJob) deployment manifests
  • Benchmarked on 8x p5en.48xlarge (64x NVIDIA H200 GPUs)

What is V-JEPA 2?

V-JEPA 2 is Meta FAIR's self-supervised video model that learns visual representations by predicting masked video patches. It achieves state-of-the-art on motion understanding and human action anticipation benchmarks. The ViT-g/16 variant has 1.03B encoder parameters.

Files Added

3.test_cases/pytorch/vjepa2/
├── vjepa2.Dockerfile                      # NVIDIA PyTorch 25.03 base (CUDA 13, Python 3.11)
├── README.md                              # Full walkthrough with benchmark results
├── slurm/
│   ├── benchmark_training.sbatch          # 200-iter benchmark (8 nodes)
│   ├── launch_training.sbatch             # Full 800-epoch pre-training
│   └── download_dataset.sbatch            # SSv2 dataset preparation
├── kubernetes/
│   └── vjepa2-benchmark.yaml              # PyTorchJob for EKS clusters
├── configs/
│   ├── benchmark-vitg-8nodes.yaml         # Quick benchmark config
│   └── pretrain-vitg-256px-16f.yaml       # Full pre-training config
└── scripts/
    ├── run_train.py                       # Thin srun-compatible launcher
    ├── generate_synthetic_dataset.py      # Synthetic video generator
    ├── prepare_ssv2.py                    # SSv2 CSV preparation
    ├── parse_benchmark.py                 # Log parser for throughput/MFU
    └── test_decord.py                     # Verify decord video loading

Key Technical Details

Launch pattern: V-JEPA 2 uses srun directly (not srun + torchrun). The run_train.py launcher calls app.vjepa.train.main() directly, which reads SLURM_LOCALID/SLURM_NTASKS/SLURM_PROCID for distributed setup. This avoids a bug in app/main.py where its subprocess launcher passes world_size=1 regardless of SLURM configuration.

Dataset: Supports both Something-Something v2 (SSv2) real data and synthetic generated videos for benchmarking.

Benchmark Results (8x p5en.48xlarge, 64x H200)

Metric Value
Global batch size 1,536
Precision BF16
Peak GPU memory ~32.9 GB / 143 GB

Testing

Validated on ParallelCluster with 8x p5en.48xlarge nodes running Slurm + Pyxis/Enroot with EFA networking. Job ran 200 iterations to completion with all 64 ranks correctly initialized via NCCL over EFA.

paragao added 2 commits March 23, 2026 12:24
Add V-JEPA 2 (Meta FAIR) ViT-g/16 1B-param self-supervised video model as
a new PyTorch test case with Slurm and Kubernetes support.

Includes:
- Dockerfile based on nvcr.io/nvidia/pytorch:25.03-py3 (CUDA 13 + Python 3.11)
- Slurm sbatch scripts for benchmark (200 iters) and full pre-training (800 epochs)
- Kubernetes PyTorchJob manifest for EKS clusters
- Thin srun-compatible launcher (run_train.py) that calls app.vjepa.train.main()
  directly, avoiding the subprocess world_size=1 bug in app/main.py
- Synthetic dataset generator for benchmarking without SSv2 download
- SSv2 dataset preparation scripts and decord verification
- YAML configs for ViT-g/16 with DDP, BF16, and activation checkpointing
…ining

Add V-JEPA 2.1 (Meta FAIR) ViT-g/16 1B-param benchmark alongside the existing
V-JEPA 2 test case. V-JEPA 2.1 introduces Dense Predictive Loss, Deep
Self-Supervision (4 intermediate layers), doubled predictor depth (24 vs 12),
and image+video co-training with 50/50 rank split.

Includes:
- Dockerfile and Enroot container setup (shared base with V-JEPA 2)
- Slurm sbatch scripts with /workspace code overlay for latest vjepa2 repo
- Kubernetes PyTorchJob manifest for EKS clusters
- Synthetic image generator for co-training benchmarks
- run_train.py launcher using app.scaffold.main() for dynamic dispatch
- YAML configs with img_data, img_mask, and rank_ratio settings

Key discovery: the container must have the latest vjepa2 repo code (post
March 2026) for app/vjepa_2_1/ to be available. The sbatch scripts mount
updated code at /workspace to overlay the container's stale PYTHONPATH.
@paragao paragao force-pushed the feat/vjepa2-distributed-training branch from 11b8971 to 92abb8c Compare March 23, 2026 12:27
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this thorough contribution, Paulo! The utility scripts and READMEs are excellent quality. I have some structural and reproducibility findings below.

Significant code duplication between vjepa2/ and vjepa2.1/

These two directories share a large amount of identical code:

  • scripts/generate_synthetic_dataset.py — identical (same git blob 74f922445)
  • scripts/parse_benchmark.py — identical (same git blob 957b9efdf)
  • scripts/prepare_ssv2.py — identical (same git blob 633288d17)
  • scripts/test_decord.py — identical (same git blob 4881d1647)
  • scripts/run_train.py — nearly identical (V-JEPA 2.1 adds 4 lines)
  • Dockerfiles — nearly identical structure
  • Slurm sbatch scripts — same structure, differing only in paths/config references

The repo convention says to "extend the existing test case — add platform-specific subdirectories, parameterize scripts for additional models, or add configuration variants — rather than creating a parallel directory tree with duplicated Dockerfiles, training scripts, and utilities."

I'd suggest consolidating into a single vjepa2/ directory that supports both V-JEPA 2 and 2.1 via different configs. The run_train.py launcher already dispatches based on the app field in the config (vjepa vs vjepa_2_1), so both versions can share the same launcher, scripts, Dockerfile, and sbatch templates. The V-JEPA 2.1 additions (image co-training, synthetic image generator) would simply add to the existing directory.

Missing license headers on README and config files

Both README.md files and all 4 configs/*.yaml files are missing license headers. The Slurm scripts, Python files, K8s manifests, and Dockerfiles all have them, so this is just an oversight. I'd suggest adding the standard header as a YAML comment in configs and HTML comment in READMEs.

Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Deployment Pipeline

Comment thread 3.test_cases/pytorch/vjepa2/kubernetes/vjepa2-benchmark.yaml Outdated
Comment thread 3.test_cases/pytorch/vjepa2/vjepa2.Dockerfile Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Documentation Consistency


Things That Look Great

  • Comprehensive utility scripts: The synthetic data generators (video and image), SSv2 CSV preparer, benchmark log parser, and decord test script form a complete toolkit that makes this test case truly self-contained.
  • Excellent README documentation: Both READMEs walk through every step from dataset prep to result parsing, with clear architecture notes explaining the srun direct launch pattern and why app/main.py doesn't work with SLURM.
  • Smart launch pattern: Using app.scaffold.main() to dispatch based on the config's app field is elegant and avoids the world_size=1 bug in app/main.py.
  • Proper license headers on most files: Scripts, Dockerfiles, sbatch files, and K8s manifests all have the standard copyright header.
  • HyperPod auto-resume detection: The if [ -d "/opt/sagemaker_cluster" ] pattern in sbatch scripts correctly detects HyperPod clusters and enables auto-resume.
  • Both Slurm and Kubernetes deployment paths: Providing PyTorchJob manifests alongside Slurm scripts makes this accessible to EKS-based clusters too.
  • Well-structured config separation: Benchmark configs (200 iterations, no checkpointing) vs. full pre-training configs (800+ epochs, regular checkpoints) give users clear starting points for different use cases.
  • V-JEPA 2.1 comparison table: The feature comparison table in the V-JEPA 2.1 README clearly explains what changed between versions.

Comment thread 3.test_cases/pytorch/vjepa2/README.md Outdated
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments

paragao added 13 commits March 23, 2026 16:40
The Dockerfile-based container (pytorch:25.03-py3) ships NCCL 2.25 and
an older aws-ofi-nccl plugin that are incompatible with B200 EFA
networking.  The B200 scripts use a NeMo container with NCCL 2.29+ and
a matching OFI/EFA/libfabric stack instead, with V-JEPA dependencies
installed to shared storage and added to PYTHONPATH at runtime.
Benchmarking on B200 revealed that 5,000 synthetic samples caused frequent
data loader re-initialization between epochs, inflating V-JEPA 2.1 iteration
times by up to 4x (15,300ms vs 4,075ms with 50K samples). V-JEPA 2 was less
affected but still improved from 1,637ms to 1,457ms.

Changes:
- Default synthetic video count: 5,000 -> 50,000 (both V-JEPA 2 and 2.1)
- Default synthetic image count: 5,000 -> 50,000 (V-JEPA 2.1)
- Add OpenCV (cv2) fallback for video generation in environments without ffmpeg
- Add dataset sizing guidance to benchmark configs and READMEs
Add rank-selective nsys profiling infrastructure:
- nsys_wrapper.sh: profiles only rank 0 via SLURM_PROCID check
- nsys_profile_b200.sbatch: configurable via NSYS_PROFILE_DIR and CONFIG
  env vars to save each optimization phase to a separate folder
- Document profiling workflow in both READMEs
…n checkpointing

Provide an optimized benchmark config for B200 GPUs:
- compile_model: true for fused kernels (~20% GPU speedup)
- use_activation_checkpointing: false (trades ~95 GB vs ~33 GB memory)
- num_workers: 20 for higher data prefetch
Tested at 1,125 ms/iter vs 1,457 ms baseline (23% improvement).
BF16 has the same dynamic range as FP32, so GradScaler's loss scaling
is pure overhead. Monkey-patch GradScaler to enabled=False in both
run_train.py launchers when meta.dtype is bfloat16, eliminating the
scale/unscale/step/update cycle per iteration.
…higher throughput

Replace DDP with FSDP (SHARD_GRAD_OP / ZeRO-2) for the encoder and
target_encoder in V-JEPA 2.1, sharding gradients and optimizer states
across ranks. This saves ~15 GB/GPU, enabling activation checkpointing
to be disabled on B200 GPUs for higher throughput. The predictor remains
DDP-wrapped (small model, needs find_unused_parameters).
…_train.py

- Fix compile_model placement: move from meta: to model: section where
  upstream train.py actually reads it (was silently never enabled)
- Add env-var-driven optimizations to run_train.py: fused AdamW, TF32,
  compile mode override, gradient_as_bucket_view, prefetch_factor
- Add B200 optimization sweep sbatch (Phase A-D with nsys profiling)
- Add nsys profiling sbatch scripts for H200 (vjepa2 and vjepa2.1)
- Fix container-workdir from /vjepa2 to /workspace in benchmark sbatch
- Add .gitignore to exclude benchmarks/ and profiling/ from repo
…u_type flag

The 6*N*D FLOP formula overestimates training FLOPs by ~2x for JEPA architectures
because the context encoder only processes visible tokens (~15% of the sequence)
while the target encoder runs forward-only (no backward pass).

Replace with samples/sec as the primary throughput metric. Add --gpu_type flag
(h200/b200) with correct BF16 peak specs (989.4 / 2250 TFLOPS). Fix V-JEPA 2.1
script title and update README parse examples to use new flag.
…ctness, pinned versions

- Remove ,eth from NCCL_SOCKET_IFNAME exclusion list for correct TCP bootstrap
- Add missing MIT-0 license headers to .gitignore, README.md, and config YAMLs
- Change set -ex to set -euo pipefail in all sbatch and shell scripts
- Pin EFA_INSTALLER_VERSION to 1.47.0 in both Dockerfiles (was 'latest')
- Replace :latest image tags with :vjepa2 and :vjepa2.1 in K8s manifests
- Use yaml.SafeLoader instead of yaml.FullLoader in run_train.py and run_train_fsdp.py
…lone

- Pin all pip packages to tested versions from cluster container freeze
- Pin vjepa2 git clone to commit 204698b4 (latest as of Mar 23, 2026)
- Fix stale repository URL from aws-samples to awslabs in both READMEs
@paragao
Copy link
Copy Markdown
Contributor Author

paragao commented Apr 15, 2026

@KeitaW please check that everything has been properly addressed so we can merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants