Shared Expert Isolation of DeepeSeek has show that:
- Improving the granularity of routing experts can improve the effectiveness
- Having a shared expert with resident activation reduces redundancy
However, the routing of classical loop traversal strategy will greatly increase the invalid runtime due to the increase of granularity.
The parameter efficient expert composed of Embedding can greatly increase the expert granularity without causing too many invalid computations.
CDMoE combines the characteristics of both shared experts and parametrically efficient experts, but currently it has not been verified for large-scale tensor parallelism, perhaps we need to develop an Embedding parallel strategy.
paper: https://arxiv.org/pdf/2412.11834
Shared Expert Isolation of DeepeSeek has show that:
However, the routing of classical loop traversal strategy will greatly increase the invalid runtime due to the increase of granularity.
The parameter efficient expert composed of Embedding can greatly increase the expert granularity without causing too many invalid computations.
CDMoE combines the characteristics of both shared experts and parametrically efficient experts, but currently it has not been verified for large-scale tensor parallelism, perhaps we need to develop an Embedding parallel strategy.
paper: https://arxiv.org/pdf/2412.11834