Skip to content

Architecture Improvement: Feedforward network part #45

@LoserCheems

Description

@LoserCheems

Shared Expert Isolation of DeepeSeek has show that:

  • Improving the granularity of routing experts can improve the effectiveness
  • Having a shared expert with resident activation reduces redundancy

However, the routing of classical loop traversal strategy will greatly increase the invalid runtime due to the increase of granularity.

The parameter efficient expert composed of Embedding can greatly increase the expert granularity without causing too many invalid computations.

CDMoE combines the characteristics of both shared experts and parametrically efficient experts, but currently it has not been verified for large-scale tensor parallelism, perhaps we need to develop an Embedding parallel strategy.

paper: https://arxiv.org/pdf/2412.11834

Metadata

Metadata

Assignees

Labels

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions