Sparse routing of tokens through a gating network into top-k experts.
For sparse / efficient LLM papers (Switch Transformer, GShard, Mixtral-style).
Add an inset showing the auxiliary load-balancing loss with a histogram of expert utilization across a batch. Include a short equation: L_aux = alpha * sum_e f_e * P_e.