I recently stumbled upon a research paper called "REAP the Experts: Why Pruning Prevails for One-Shot MoE compression". The paper showed that you could remove up to 50% of experts from massive Sparse Mixture of Experts (SMoE) models with barely any performance loss. My first thought was: "Wait, if half the model is redundant, why even train it that big in the first place?"
That question sent me down a rabbit hole, and this post is basically me working through what I learned.
First, What Even Are SMoE Models?
Before I could understand the pruning results, I had to actually understand the architecture. I'd heard about Mixture of Experts before but never really looked into how they work. Here is what i understood from some basic research:
Normal dense models (like GPT) work pretty straightforwardly:
- Every token goes through the same feedforward layers
- All parameters get activated for every token
- Simple, but expensive at scale
SMoE models do something different:
- Each layer has multiple "expert" networks (like 8 separate feedforward networks)
- A router/gating network looks at each token and decides which 1-2 experts should handle it
- Only chosen experts process that token
- The outputs get combined with a weighted sum
So if you have 8 experts of 10B parameters each, that's 80B total parameters, but only ~20B are active for any given token (if using top-2 routing). This is how models like DeepSeek or Qwen can be so large but still efficient.
Back to the Pruning Paper: The Results That Confused Me
The REAP paper showed results for models like Qwen3-Coder-480B where they could prune 25% or even 50% of experts and maintain almost identical performance on most benchmarks. Some observations:
- At 25% pruning: barely any degradation
- At 50% pruning: most benchmarks only dropped 1-2%
- Some tasks (like agentic benchmarks) were more sensitive, mostly categories that require a multi-step reasoning process as opposed to tasks that can be accomplished with one-shot generation (like multiple choice questions)
This is where my confusion started. If you can throw away half the model with minimal loss, why did they train it so big? Isn't that just wasted compute?
Questions I Had (And What I Learned)
Q1: Does having more experts during training mean each expert becomes less redundant?
I managed to find a paper that answers this question titled "Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning", but in short, yes.
This was a key realization for me. When you train with:
- Few experts (like 16): Each expert has to be a generalist. They all learn to answer various kinds of questions.
- Many experts (like 128): Each expert can specialize narrowly. One could be an expert at coding while another could be great at doing math and neither of them have much capability overlap between them.
More experts = more narrow specialization = less redundancy between them.
This means the experts with the lowest activation frequency (the ones that get pruned) are actually the most specialized ones. They handle edge cases and niche patterns that don't show up often in evaluation (assuming we are using the REAP method to prune the model).
Q2: So as model size and number of experts increase, does pruning tolerance decrease?
That same research paper shows that:
- Models with 128 experts can only be pruned by ~50% before major degradation
- Models with 32 experts can be pruned by ~75% and still perform relatively well
Why? Because smaller models have more redundant, generalist experts. If you remove one, the others can cover for it since they have overlapping skills.
Larger models have highly specialized experts. Remove one, and you've lost a unique capability that no other expert has.
But here's the thing: Even though smaller models can be pruned more aggressively, the larger model pruned to 50% still outperforms the smaller model pruned to 25%.
Q3: If pruning works so well, why train bigger models at all? Why not just train at the size I want to deploy?
This is the question that really made everything click for me.
You can't skip the large training phase. Here's why:
When you train with 480B parameters and lots of experts:
- Competition between experts drives specialization
- More diverse gradient pathways during learning
- Richer representations emerge across the expert pool
- The model learns better patterns overall
Then when you prune to 240B:
- You keep the BEST experts from that larger pool
- Those remaining experts learned better specializations during training
- They benefit from having competed with more experts during training
If you just trained 240B from scratch:
- Each expert is forced to be more general from the start
- Less competition, less specialization
- Lower quality ceiling even when fully trained
The analogy that helped me: If you have 100 people competing to come up with the best ideas versus only 50 people competing, the larger pool will always produce superior results since competition encourages higher quality. Even if you then remove half the people from that first group of 100, their quality remains high (assuming they don't become complacent)
What This All Means
After going through all this, the "paradox" resolved itself. It's not actually wasteful to train with more experts than you deploy. The training capacity is what unlocks the performance ceiling. Training big and pruning aggressively is a viable strategy.
The Bigger Picture
This exploration taught me something important about deep learning: capacity during learning matters even if you don't need that capacity during deployment.
It's similar to how:
- You need a large dataset even if you distill to a smaller model
- You need high-precision training even if you deploy in INT8
- You need many training iterations even if you only use the final checkpoint
The journey matters, not just the destination.
I'm still pretty early in understanding all the nuances of SMoE architectures—there's way more to explore around routing strategies, load balancing, expert merging techniques, and how this all scales to even larger models. But this deep dive into pruning helped me understand why model makers are still pushing for bigger models despite compression working so well.
If you're also learning about this stuff and have insights or corrections, I'd love to hear them!
These are my notes and understanding. If I got something wrong or oversimplified, please let me know—I'm still learning!
← Back to home