Four steps. Zero communication during training.
The entire KALAVAI protocol fits in a paragraph. A coordinator distributes a shared base checkpoint. Each contributor fine-tunes their copy on their own domain — independently, asynchronously, on whatever hardware they have. Nobody shares data, gradients, or activations. When everyone is done, they submit their checkpoints. A lightweight router (a single linear layer, trained for 500 steps on mixed data) learns which expert is best for which token. At inference, all specialists run in parallel and the router combines their outputs.
That's it. The mechanism is the protocol, not the infrastructure. Standard PyTorch. Standard HuggingFace. No custom CUDA kernels, no distributed training framework, no LoRA, no adapters.
# Everyone starts from the same model
base = load("pythia-410m", revision="step10000")
# Each person trains on their domain (independently, no communication)
specialist_code = train(copy(base), code_data, steps=2000)
specialist_science = train(copy(base), science_data, steps=2000)
specialist_fiction = train(copy(base), fiction_data, steps=2000)
# A router learns who's good at what (500 steps, one linear layer)
router = nn.Linear(hidden_size, 3, bias=False)
fused = MoE(specialists=[code, science, fiction], router=router)
train_router(fused, mixed_data, steps=500)
# Result: fused model beats every individual specialist
# +14.2% over best specialist, +14.5% over monolithic training
Core Results
Consistent gains at 410M and 1B. Reduced at 6.9B.
The fused model beats the best individual specialist at every tested scale. The gains at 410M and 1B are robust across three random seeds with near-zero variance. The 6.9B result is smaller — we attribute this to a much weaker per-parameter training signal (17× the parameters but only half the training steps).
Beats equal-compute monolithic training
The natural objection: just train one model on all the data for the same total compute. We tested this directly. A single model fine-tuned on mixed data for 6,000 steps (equal to 3 specialists × 2,000 steps) achieves +6.7% over base. KALAVAI achieves +14.5% over that monolithic model.
| Method | Loss | vs. Base | vs. Monolithic |
|---|---|---|---|
| Base model | 2.248 | — | — |
| Monolithic (6,000 steps mixed) | 2.098 | +6.7% | — |
| Best specialist (code) | 2.089 | +7.1% | +0.4% |
| Weight averaging | 2.158 | +4.0% | — |
| Wider model (3.5× params) | 2.120 | +5.9% | — |
| KALAVAI MoE | 1.793 | +20.2% | +14.5% |
1B results: replication holds
The 1B scale replicates the 410M result with essentially zero variance. The improvement is +14.8% over the best specialist. Three seeds, standard deviation 0.003%. The monolithic baseline at 1B also confirms: KALAVAI beats equal-compute monolithic by ~+14.5% at 1B as well.
6.9B: the scale boundary
At 6.9B parameters the improvement contracts to +2.4%. The mechanism still works — the MoE is definitively better than every individual specialist — but the magnitude is much smaller. The most likely explanation: at 6.9B, each specialist had only 500 fine-tuning steps (budget-constrained by A100 runtime), leaving insufficient domain signal for strong specialisation. A follow-up experiment (B1: 6.9B step-budget sweep) will test whether more training recovers the larger improvement seen at 410M/1B.
Three Governing Conditions
When fusion works — and when it doesn't
The paper's core contribution isn't "MoE is good." It's an empirical characterisation of the conditions under which post-hoc fusion of independently trained specialists succeeds or fails. We identify three.
1. Shared initialisation is necessary
Specialists must start from the same checkpoint. The shared starting point preserves representational compatibility — specialists diverge in what they learn, but the geometry of their representations stays aligned enough for a router to combine them. Initialising from different checkpoints breaks this: their representational spaces are no longer aligned, and the router cannot learn coherent dispatch.
Practical implication: The cooperative coordinator must distribute a single canonical checkpoint. All specialists must start from exactly the same revision — same weights, same tokenizer, same architecture. This is the one non-negotiable constraint of the protocol.
2. Frozen layers become necessary beyond ~10,000 steps
At short training horizons (≤5,000 steps), freezing layers is optional — more plastic representations allow specialists to diverge more effectively. But beyond approximately 10,000 steps, unfrozen specialists over-specialise and become harder to fuse. Freezing the first K layers provides a structural anchor that preserves routing compatibility.
| Steps | Freeze=0 | Freeze=4 | Winner |
|---|---|---|---|
| 500 | +9.9% | +8.9% | Freeze=0 |
| 1,000 | +12.5% | +11.3% | Freeze=0 |
| 2,000 | +15.1% | +13.9% | Freeze=0 |
| 5,000 | +16.4% | +15.8% | Freeze=0 |
| 10,000 | +15.4% | +15.6% | Freeze=4 ← crossover |
| 20,000 | +13.6% | +14.8% | Freeze=4 |
3. All specialists must run at inference
This is the paper's most surprising result. A domain classifier with 99.3% accuracy, routing each input to a single specialist, produces −21.1% degradation relative to base. The MoE running all three specialists and combining outputs produces +14.1% improvement. Same routing accuracy — opposite results. The 35 percentage point gap is the difference between a system that works and one that's worse than doing nothing.
Why does single-expert dispatch fail? Each specialist forgets what it wasn't trained on. The code specialist's loss on science data is worse than the base model. When only one specialist runs, out-of-domain tokens have no fallback. Joint inference restores coverage by letting the router suppress out-of-domain specialists token by token.
The mechanism in one image
The cross-domain evaluation matrix shows why fusion works. Each specialist is best on its own domain (the diagonal) and worst on the others. The MoE router dispatches each token to the right diagonal entry, recovering all specialist gains simultaneously.
Ablations
What doesn't matter (and what does)
Router architecture doesn't matter
A uniform router (fixed 1/N weights, no training) achieves +6.7%. A trained linear router achieves +14.2%. A 2-layer MLP achieves +14.2%. The gap between uniform and learned routing is +7.5pp — entirely explained by the router's ability to suppress out-of-domain specialists. The specific function class is irrelevant; the minimum bar is learnable suppression.
Freeze depth sweep
We swept freeze depth from 0 (no frozen layers) to 12 (half the model). At 2,000 steps, freeze=0 wins. At 10,000 steps, freeze=4 and freeze=8 both outperform freeze=0. Practical guideline: training under 5,000 steps — skip freezing. Over 10,000 steps — freeze 4–8 layers.
Specialist count scales gracefully
Three, four, and five specialists all achieve approximately +14.1% with near-zero variance. The mechanism doesn't degrade as you add more contributors. Two-specialist is slightly higher because the evaluation problem is narrower.
The mechanism survives base model maturity
Fusion improvement is consistent across Pythia checkpoints from 3.5% to 100% of pre-training at 410M (+13.4% to +15.0%) and 1B (+13.8% to +15.9%). Qwen-1.5B at full training is the boundary condition: −1.0%, because its pre-training corpus already covers the specialist domains — leaving no room for meaningful specialisation.
Extra parameters don't explain the gains
Two baselines rule out "more parameters" as the explanation. A wider single model with 3.5× the parameters achieves only +5.9%. A multi-head baseline with identical parameter count to the MoE but hard single-expert routing achieves −21.1%. The gain comes from cooperative specialisation plus joint inference, not raw capacity.
The Router in Action
Token-level routing, not document classification
On hybrid-domain prompts, the router switches experts mid-sentence. The prompt "Derive the equation for protein folding using Python pandas" forces a domain switch within a single sentence: science tokens ("derive," "equation," "protein," "folding") should activate the science specialist; code tokens ("Python," "pandas") should activate the code specialist. The router discovers this structure from the training signal alone — no supervision, no domain labels, no explicit boundaries.
The pattern is robust across multiple hybrid prompt types, including narrative/science, technical/narrative, and multi-domain sentences with three or more domain switches:
Router confidence distribution
In practice, the router operates as a near-hard switch: the highest-weight expert receives over 95% of the routing weight in more than 99.7% of tokens. Crucially, hard routing (argmax dispatch) and soft routing (learned softmax weights) produce identical perplexity — confirming that the value is selection and suppression, not the specific weighting scheme.
Beyond Perplexity
Downstream benchmarks
Perplexity improvements are clear and consistent. Downstream benchmark accuracy is more modest — less than 1pp on standard tasks at 1B. This is expected at these scales: perplexity and benchmark accuracy don't reliably track below 7B parameters. The KALAVAI paper is explicit about this gap.
On the benchmark gap. Standard commonsense benchmarks (ARC, HellaSwag) are factual Q&A, while the training domains are code, science, and fiction — a partial mismatch. A cooperative aligned to benchmark-relevant domains (e.g., world knowledge, reasoning, factual recall) would present a stronger evaluation of downstream gain potential.
Training Dynamics
How specialists diverge during training
Each specialist rapidly improves on its own domain while degrading on out-of-domain content. This divergence is exactly what makes the MoE valuable — if specialists didn't degrade on other domains, the router would have nothing useful to select between. The code specialist's catastrophic forgetting of science is a feature, not a bug.
Boundary Conditions
What we don't claim
The paper is explicit about five things it does not claim.
No inference efficiency. The fused model runs all N specialists in parallel. For N=3, inference overhead is approximately 2.5×. This is a training-time democratisation that trades inference cost for training accessibility.
No universal architecture generality. All primary results use Pythia. Qwen-1.5B at full training shows −1.0% — when the base model's pre-training already covers the specialist domains, fine-tuning produces insufficient differentiation.
No guaranteed downstream gains. Perplexity improvements are clear; benchmark accuracy improvements are modest (<1pp at 1B). Perplexity and downstream accuracy don't reliably track at these scales.
No real cooperative demonstrated. All experiments are simulated cooperatives on single machines. Heterogeneous hardware, asynchronous submission, and contributor reliability are open engineering problems.
No frontier-scale evidence. The 6.9B result (+2.4%) suggests scale-dependent sensitivity requiring further investigation.
Live Experiments
Phase A: the NeurIPS gate
Three experiments stand between this work and NeurIPS submission. Results will be committed to the repository and will update the paper before submission. Gate criterion: if A1 shows >5% MoE vs monolithic at 1B and A3 shows clear degradation with mismatched checkpoints, the paper proceeds to NeurIPS 2026. Otherwise, COLM/TMLR.
After A1+A3: If both gate criteria are met, two further experiments are planned — B1 (6.9B step-budget sweep: does more training recover the full improvement at large scale?) and C1 (heterogeneous cooperative: does fusion survive different batch sizes, learning rates, and training durations per contributor?).
What This Makes Possible
Five problems cooperative training solves
The protocol's zero-communication property — contributors share only a starting checkpoint and a final trained checkpoint, never data — opens applications that are structurally impossible with synchronous training or federated learning.
A hospital network that can't share patient data
Five hospitals, each with thousands of records in different specialties — cardiology, oncology, pediatrics, radiology, neurology. Privacy laws prevent pooling. Today, each can fine-tune a model on their own data, but it only knows their specialty.
With KALAVAI, each hospital trains a specialist on their private data that never leaves their servers. They share only the trained checkpoint — not a single patient record. The fused model understands all five specialties. No data was shared. No privacy was violated.
This was impossible before. Federated learning requires gradient sharing during training, which leaks information. KALAVAI requires zero communication during training — only the final checkpoint is shared.
Endangered languages get a real language model
There are maybe 50,000 pages of digitised text in Yoruba. Or Quechua. Or Tamil literary archives. Not enough for any organisation to justify training a model. But a university in Nigeria trains a Yoruba specialist. A lab in Peru trains a Quechua specialist. A team in Chennai trains a classical Tamil specialist. A group in Wales trains a Welsh specialist.
Each trains on whatever they have — even a few thousand pages is enough to produce meaningful specialisation from a shared base model. The fused model handles all four languages plus the base model's English. No single institution had enough data or compute to build this. The cooperative did.
KALAVAI changes the economics from "one organisation needs all the data and all the compute" to "each community contributes what they have."
A legal AI built across jurisdictions
Indian contract law, UK common law, US constitutional law, EU regulatory law, Brazilian civil law — each a specialised domain with its own corpus and reasoning patterns. No single firm has expertise across all five. Today you either build a generic legal model that's mediocre everywhere, or a narrow one that knows one jurisdiction deeply.
With KALAVAI, a firm in Mumbai trains the Indian law specialist. A firm in London trains the UK specialist. A firm in São Paulo trains the Brazilian specialist. The fused model can analyse a cross-border contract touching Indian, UK, and EU law — routing the relevant clauses to the relevant specialist. Each firm contributed domain expertise without sharing proprietary case databases with competitors.
Scientific research across fields that don't talk to each other
A climate science lab trains on atmospheric modelling. A marine biology lab trains on ocean ecosystems. A geology department trains on seismology. An economics department trains on resource economics. Individually, each model knows its field. Fused, the model can reason about questions at the intersection — "how do seismic events in the Pacific affect marine ecosystems and what are the economic implications for coastal fisheries?"
A new kind of interdisciplinary tool that emerges from collaboration without anyone needing to be interdisciplinary themselves.
A country builds its own sovereign AI without a hyperscaler
A small country — Sri Lanka, Estonia, Rwanda — wants a national language model that understands their language, laws, culture, geography, educational curriculum. They can't afford to train from scratch. They can't rely on OpenAI or Google to prioritise Sinhala or Kinyarwanda.
With KALAVAI, the country's university trains a language specialist. The ministry of justice trains a legal specialist on national law. The education department trains on school textbooks. The health ministry trains on local medical guidelines. Each institution uses the GPUs they already have.
The fused model is a national AI that no foreign company built and no foreign company controls. Digital sovereignty through cooperative intelligence. It was not feasible before because no single institution in a small country has the compute or data. KALAVAI turns that constraint from a blocker into an irrelevance.
Reproduce It
30 minutes. One GPU. The whole protocol.
git clone https://github.com/mechramc/Kalavai.git
cd Kalavai
pip install transformers datasets torch accelerate
python experiments/kalavai_pythia_experiment.py
Requires any GPU with 24GB+ VRAM (RTX 3090, 4090, 5090, A100, or equivalent). Produces trained specialists, fused MoE, all evaluation numbers, and figures. Expected output: +14.2% ± 0.016% on held-out mixed evaluation.
| Script | Scale | Hardware | Time | Expected output |
|---|---|---|---|---|
kalavai_pythia_experiment.py | 410M | Any 24GB GPU | ~30 min | +14.2% ± 0.016% |
kalavai_pythia_1b_experiment.py | 1B | Any 24GB GPU | ~2 hours | +14.8% ± 0.003% |
kalavai_pythia_6b_experiment.py | 6.9B | A100 80GB | ~8 hours | +2.43% ± 0.00% |
kalavai_training_duration_crossover.py | 410M | Any 24GB GPU | ~4 hours | Crossover at 10k steps |
kalavai_domain_classifier_baseline.py | 410M | Any 24GB GPU | ~45 min | −21.1% (classifier) |
Every experiment is a self-contained Python file. No config files, no YAML. Read the script, understand the experiment, run it. 322 automated audit checks verify every result before any paper-ready number is reported.