Summary
We are seeing a reproducible CUDA illegal memory access when running Evo2 with FP8 enabled under 4-GPU DeepSpeed training on H100.
The same setup works in two control cases:
- Single-GPU Evo2 FP8: works
- 4-GPU DeepSpeed with
EVO2_DISABLE_FP8=1: works
- 4-GPU DeepSpeed with Evo2 FP8 enabled: fails during the first training step
This suggests the issue is specific to Evo2 FP8 / TransformerEngine execution in distributed multi-GPU training.
Environment
- GPU: NVIDIA H100 80GB HBM3
- Python: 3.11.15
- PyTorch: 2.7.1+cu128
- CUDA: 12.8
- evo2: 0.5.5
- transformer_engine: 2.3.0
- transformer_engine_torch: 2.3.0
- flash-attn: 2.8.0.post2
- deepspeed: 0.14.4
- lightning: 2.3.3
Note: Evo2 prints the following warning at import time:
Supported flash-attn versions are >= 2.1.1, <= 2.7.4.post1. Found flash-attn 2.8.0.post2.
Reproduction
The failing run uses 4 H100 GPUs with DeepSpeed stage 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 \
python -m torch.distributed.run \
--standalone \
--nnodes=1 \
--nproc_per_node=4 \
train.py \
... \
--devices 4 \
--strategy deepspeed_stage_2 \
--max_steps 1
EVO2_DISABLE_FP8 is unset.
The run successfully loads the models, initializes DeepSpeed/NCCL, and enters training, then fails on the first step.
Observed Error
The root error happens inside TransformerEngine GEMM during Evo2 forward:
Training: 0%| | 0/8804
File ".../train.py", line 532, in training_step
File ".../models/dna_llm.py", line 789, in forward
File ".../evo2/models.py", line 103, in forward
File ".../vortex/model/model.py", line 490, in proj_norm
File ".../vortex/model/layers.py", line 82, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 1289, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 276, in forward
File ".../transformer_engine/pytorch/cpp_extensions/gemm.py", line 111, in general_gemm
RuntimeError: CUDA error: an illegal memory access was encountered
The same error appears across multiple ranks; NCCL cleanup errors appear afterward and seem secondary.
Control Runs
Single-GPU FP8 works
CUDA_VISIBLE_DEVICES=4
# EVO2_DISABLE_FP8 unset
# max_steps=1
Result:
train/loss_step=2.840
Trainer.fit stopped: max_steps=1 reached
4-GPU with FP8 disabled works
CUDA_VISIBLE_DEVICES=0,1,2,3
EVO2_DISABLE_FP8=1
# max_steps=1
Result:
train/loss_step=3.250
Trainer.fit stopped: max_steps=1 reached
Expected Behavior
Evo2 FP8 forward should either work under multi-GPU DeepSpeed training, or the documentation should specify the supported distributed-training configuration and recommended package versions.
Workaround
Setting the following environment variable avoids the crash:
This allows 4-GPU DeepSpeed training to proceed, but disables Evo2 FP8.
Questions
- Is Evo2 FP8 currently expected to support multi-GPU distributed training with DeepSpeed?
- Is there a recommended version combination for
torch, transformer_engine, flash-attn, and CUDA on H100?
- Could the unsupported
flash-attn==2.8.0.post2 version plausibly trigger this TransformerEngine FP8 GEMM failure, or is this likely independent?
Summary
We are seeing a reproducible CUDA illegal memory access when running Evo2 with FP8 enabled under 4-GPU DeepSpeed training on H100.
The same setup works in two control cases:
EVO2_DISABLE_FP8=1: worksThis suggests the issue is specific to Evo2 FP8 / TransformerEngine execution in distributed multi-GPU training.
Environment
Note: Evo2 prints the following warning at import time:
Reproduction
The failing run uses 4 H100 GPUs with DeepSpeed stage 2:
EVO2_DISABLE_FP8is unset.The run successfully loads the models, initializes DeepSpeed/NCCL, and enters training, then fails on the first step.
Observed Error
The root error happens inside TransformerEngine GEMM during Evo2 forward:
The same error appears across multiple ranks; NCCL cleanup errors appear afterward and seem secondary.
Control Runs
Single-GPU FP8 works
Result:
4-GPU with FP8 disabled works
CUDA_VISIBLE_DEVICES=0,1,2,3 EVO2_DISABLE_FP8=1 # max_steps=1Result:
Expected Behavior
Evo2 FP8 forward should either work under multi-GPU DeepSpeed training, or the documentation should specify the supported distributed-training configuration and recommended package versions.
Workaround
Setting the following environment variable avoids the crash:
This allows 4-GPU DeepSpeed training to proceed, but disables Evo2 FP8.
Questions
torch,transformer_engine,flash-attn, and CUDA on H100?flash-attn==2.8.0.post2version plausibly trigger this TransformerEngine FP8 GEMM failure, or is this likely independent?