smart Schedule中R操作没有和C操作重叠

**Describe the bug**
我使用megatron-LM V2.5 patch ,执行命令为
`FMOE_FASTER_SHADOW_ENABLE=1 FMOE_FASTER_SCHEDULE_ENABLE=1 FMOE_FASTER_GROUP_SIZE=4  bash pretrain_gpt_distributed.sh`
用单机8卡跑gpt2+moe，设置了一共16个expert，在profiler中可以看到每个卡有2个expert，分成2组，每个expert跑2次
![image](https://github.com/user-attachments/assets/7a02fc4e-58a2-4b73-9035-088831d526f8)
但4个R操作是在所有expert的C操作执行完后才一起进行：
![image](https://github.com/user-attachments/assets/4fb86fbd-49c5-4e00-9343-20037fec883c)
这是怎么回事，非常感谢能回答这个问题的人


**Logs**
If applicable, add logs to help explain your problem.

**Platform**
 - Device: [e.g. NVIDIA A100]
 - CUDA version: [12.1]
 - NCCL version: [2.18.1]
 - PyTorch version: [2.1.0]




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smart Schedule中R操作没有和C操作重叠 #213

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

smart Schedule中R操作没有和C操作重叠 #213

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions