Skip to content

Evo2 FP8 TransformerEngine Linear hits CUDA illegal memory access under 4-GPU DeepSpeed on H100 #220

@ShiLong-CN

Description

@ShiLong-CN

Summary

We are seeing a reproducible CUDA illegal memory access when running Evo2 with FP8 enabled under 4-GPU DeepSpeed training on H100.

The same setup works in two control cases:

  • Single-GPU Evo2 FP8: works
  • 4-GPU DeepSpeed with EVO2_DISABLE_FP8=1: works
  • 4-GPU DeepSpeed with Evo2 FP8 enabled: fails during the first training step

This suggests the issue is specific to Evo2 FP8 / TransformerEngine execution in distributed multi-GPU training.

Environment

  • GPU: NVIDIA H100 80GB HBM3
  • Python: 3.11.15
  • PyTorch: 2.7.1+cu128
  • CUDA: 12.8
  • evo2: 0.5.5
  • transformer_engine: 2.3.0
  • transformer_engine_torch: 2.3.0
  • flash-attn: 2.8.0.post2
  • deepspeed: 0.14.4
  • lightning: 2.3.3

Note: Evo2 prints the following warning at import time:

Supported flash-attn versions are >= 2.1.1, <= 2.7.4.post1. Found flash-attn 2.8.0.post2.

Reproduction

The failing run uses 4 H100 GPUs with DeepSpeed stage 2:

CUDA_VISIBLE_DEVICES=4,5,6,7 \
python -m torch.distributed.run \
  --standalone \
  --nnodes=1 \
  --nproc_per_node=4 \
  train.py \
  ... \
  --devices 4 \
  --strategy deepspeed_stage_2 \
  --max_steps 1

EVO2_DISABLE_FP8 is unset.

The run successfully loads the models, initializes DeepSpeed/NCCL, and enters training, then fails on the first step.

Observed Error

The root error happens inside TransformerEngine GEMM during Evo2 forward:

Training:   0%|          | 0/8804

File ".../train.py", line 532, in training_step
File ".../models/dna_llm.py", line 789, in forward
File ".../evo2/models.py", line 103, in forward
File ".../vortex/model/model.py", line 490, in proj_norm
File ".../vortex/model/layers.py", line 82, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 1289, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 276, in forward
File ".../transformer_engine/pytorch/cpp_extensions/gemm.py", line 111, in general_gemm

RuntimeError: CUDA error: an illegal memory access was encountered

The same error appears across multiple ranks; NCCL cleanup errors appear afterward and seem secondary.

Control Runs

Single-GPU FP8 works

CUDA_VISIBLE_DEVICES=4
# EVO2_DISABLE_FP8 unset
# max_steps=1

Result:

train/loss_step=2.840
Trainer.fit stopped: max_steps=1 reached

4-GPU with FP8 disabled works

CUDA_VISIBLE_DEVICES=0,1,2,3
EVO2_DISABLE_FP8=1
# max_steps=1

Result:

train/loss_step=3.250
Trainer.fit stopped: max_steps=1 reached

Expected Behavior

Evo2 FP8 forward should either work under multi-GPU DeepSpeed training, or the documentation should specify the supported distributed-training configuration and recommended package versions.

Workaround

Setting the following environment variable avoids the crash:

EVO2_DISABLE_FP8=1

This allows 4-GPU DeepSpeed training to proceed, but disables Evo2 FP8.

Questions

  1. Is Evo2 FP8 currently expected to support multi-GPU distributed training with DeepSpeed?
  2. Is there a recommended version combination for torch, transformer_engine, flash-attn, and CUDA on H100?
  3. Could the unsupported flash-attn==2.8.0.post2 version plausibly trigger this TransformerEngine FP8 GEMM failure, or is this likely independent?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions