Evo2 FP8 TransformerEngine Linear hits CUDA illegal memory access under 4-GPU DeepSpeed on H100

## Summary

We are seeing a reproducible CUDA illegal memory access when running Evo2 with FP8 enabled under 4-GPU DeepSpeed training on H100.

The same setup works in two control cases:

- Single-GPU Evo2 FP8: works
- 4-GPU DeepSpeed with `EVO2_DISABLE_FP8=1`: works
- 4-GPU DeepSpeed with Evo2 FP8 enabled: fails during the first training step

This suggests the issue is specific to Evo2 FP8 / TransformerEngine execution in distributed multi-GPU training.

## Environment

- GPU: NVIDIA H100 80GB HBM3
- Python: 3.11.15
- PyTorch: 2.7.1+cu128
- CUDA: 12.8
- evo2: 0.5.5
- transformer_engine: 2.3.0
- transformer_engine_torch: 2.3.0
- flash-attn: 2.8.0.post2
- deepspeed: 0.14.4
- lightning: 2.3.3

Note: Evo2 prints the following warning at import time:

```text
Supported flash-attn versions are >= 2.1.1, <= 2.7.4.post1. Found flash-attn 2.8.0.post2.
```

## Reproduction

The failing run uses 4 H100 GPUs with DeepSpeed stage 2:

```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 \
python -m torch.distributed.run \
  --standalone \
  --nnodes=1 \
  --nproc_per_node=4 \
  train.py \
  ... \
  --devices 4 \
  --strategy deepspeed_stage_2 \
  --max_steps 1
```

`EVO2_DISABLE_FP8` is unset.

The run successfully loads the models, initializes DeepSpeed/NCCL, and enters training, then fails on the first step.

## Observed Error

The root error happens inside TransformerEngine GEMM during Evo2 forward:

```text
Training:   0%|          | 0/8804

File ".../train.py", line 532, in training_step
File ".../models/dna_llm.py", line 789, in forward
File ".../evo2/models.py", line 103, in forward
File ".../vortex/model/model.py", line 490, in proj_norm
File ".../vortex/model/layers.py", line 82, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 1289, in forward
File ".../transformer_engine/pytorch/module/linear.py", line 276, in forward
File ".../transformer_engine/pytorch/cpp_extensions/gemm.py", line 111, in general_gemm

RuntimeError: CUDA error: an illegal memory access was encountered
```

The same error appears across multiple ranks; NCCL cleanup errors appear afterward and seem secondary.

## Control Runs

### Single-GPU FP8 works

```bash
CUDA_VISIBLE_DEVICES=4
# EVO2_DISABLE_FP8 unset
# max_steps=1
```

Result:

```text
train/loss_step=2.840
Trainer.fit stopped: max_steps=1 reached
```

### 4-GPU with FP8 disabled works

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
EVO2_DISABLE_FP8=1
# max_steps=1
```

Result:

```text
train/loss_step=3.250
Trainer.fit stopped: max_steps=1 reached
```

## Expected Behavior

Evo2 FP8 forward should either work under multi-GPU DeepSpeed training, or the documentation should specify the supported distributed-training configuration and recommended package versions.

## Workaround

Setting the following environment variable avoids the crash:

```bash
EVO2_DISABLE_FP8=1
```

This allows 4-GPU DeepSpeed training to proceed, but disables Evo2 FP8.

## Questions

1. Is Evo2 FP8 currently expected to support multi-GPU distributed training with DeepSpeed?
2. Is there a recommended version combination for `torch`, `transformer_engine`, `flash-attn`, and CUDA on H100?
3. Could the unsupported `flash-attn==2.8.0.post2` version plausibly trigger this TransformerEngine FP8 GEMM failure, or is this likely independent?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evo2 FP8 TransformerEngine Linear hits CUDA illegal memory access under 4-GPU DeepSpeed on H100 #220

Summary

Environment

Reproduction

Observed Error

Control Runs

Single-GPU FP8 works

4-GPU with FP8 disabled works

Expected Behavior

Workaround

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evo2 FP8 TransformerEngine Linear hits CUDA illegal memory access under 4-GPU DeepSpeed on H100 #220

Description

Summary

Environment

Reproduction

Observed Error

Control Runs

Single-GPU FP8 works

4-GPU with FP8 disabled works

Expected Behavior

Workaround

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions