ICLR2025 Jin: MOE++ zero computation experts

Motivation A fixed amount of experts is activated per task.
Key Insight MoE++ allows the amount of expert distribution to be adaptive.
Method Three key contributions:
zero-computation experts: discarding input E\left(x\right) = 0, copy input E\left(x\right) = x (“skip”), const E(x) = \alpha_{a} x +\alpha_{b} v_{\theta} (plus normallFFN experts) pathway-aware router (with additional loss augmentation where we learn a \tau_{\theta} to decide something else I missed zero-computation experts simple to handle easy tokens quickly new experts is relatively low cost