Architecture Improvement: Feedforward network part

Shared Expert Isolation of DeepeSeek has show that:

- Improving the granularity of routing experts can improve the effectiveness
- Having a shared expert with resident activation reduces redundancy

However, the routing of classical loop traversal strategy will greatly increase the invalid runtime due to the increase of granularity.

The parameter efficient expert composed of Embedding can greatly increase the expert granularity without causing too many invalid computations.

[CDMoE](https://github.com/SmallDoges/small-doge/blob/8de0158b168dbe6db4ebea4f16d38b2eeb6bc342/src/small_doge/models/modeling_doge.py#L484-L537) combines the characteristics of both shared experts and parametrically efficient experts, but currently it has not been verified for large-scale tensor parallelism, perhaps we need to develop an Embedding parallel strategy.

paper: [https://arxiv.org/pdf/2412.11834](https://arxiv.org/pdf/2412.11834)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Improvement: Feedforward network part #45

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Architecture Improvement: Feedforward network part #45

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions