motivation Can we directly extract embeddings from MoE forwarding routing weights (i.e., compared to traditional residual stream information)? Key Insight Using residual states vs. forwarding weights as semantic searc embeddings offer complementary strengths (i.e., when one method fails, the other one succeeds more) Method Create an aggregate embedding:
\begin{equation} E_{j} = X_{j} + \alpha W_{j} \end{equation}
where W_{j} is the routing weight of the residual, and X_{j} is the residual.