Motivation A fixed amount of experts is activated per task. Key Insight MoE++ allows the amount of expert distribution to be adaptive. Method Three key contributions: zero-computation experts: discarding input E\left(x\right) = 0, copy input E\left(x\right) = x (“skip”), const E(x) = \alpha_{a} x +\alpha_{b} v_{\theta} (plus normallFFN experts) pathway-aware router (with additional loss augmentation where we learn a \tau_{\theta} to decide something else I missed zero-computation experts simple to handle easy tokens quickly new experts is relatively low cost

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?