KALAVAI — When Does Independent Specialist Fusion Work?

The $5 AI: How 20 Strangers Built a World-Class Model

A video walkthrough of the KALAVAI protocol — the mechanism, the results, and what this unlocks.

Beat Corporate AI with Consumer GPUs

Audio-only version — 40 minutes.

The Protocol

Four steps. Zero communication during training.

The entire KALAVAI protocol fits in a paragraph. A coordinator distributes a shared base checkpoint. Each contributor fine-tunes their copy on their own domain — independently, asynchronously, on whatever hardware they have. Nobody shares data, gradients, or activations. When everyone is done, they submit their checkpoints. A lightweight router (a single linear layer, trained for 500 steps on mixed data) learns which expert is best for which token. At inference, all specialists run in parallel and the router combines their outputs.

That's it. The mechanism is the protocol, not the infrastructure. Standard PyTorch. Standard HuggingFace. No custom CUDA kernels, no distributed training framework, no LoRA, no adapters.

# Everyone starts from the same model
base = load("pythia-410m", revision="step10000")

# Each person trains on their domain (independently, no communication)
specialist_code    = train(copy(base), code_data,    steps=2000)
specialist_science = train(copy(base), science_data, steps=2000)
specialist_fiction  = train(copy(base), fiction_data, steps=2000)

# A router learns who's good at what (500 steps, one linear layer)
router = nn.Linear(hidden_size, 3, bias=False)
fused = MoE(specialists=[code, science, fiction], router=router)
train_router(fused, mixed_data, steps=500)

# Result: fused model beats every individual specialist
# +14.2% over best specialist, +14.5% over monolithic training

1 / 15

KALAVAI four key findings — **Figure 1.** The four core findings. (A) Fusion improvement across model scales. (B) Training duration crossover: frozen layers become necessary around 10k steps. (C) Single-specialist dispatch fails catastrophically. (D) KALAVAI beats equal-compute monolithic training by +14.5%.

Core Results

Consistent gains at 410M and 1B. Reduced at 6.9B.

The fused model beats the best individual specialist at every tested scale. The gains at 410M and 1B are robust across three random seeds with near-zero variance. The 6.9B result is smaller — we attribute this to a much weaker per-parameter training signal (17× the parameters but only half the training steps).

Pythia-410M

+14.2%

3 seeds, ±0.016%

Pythia-1B

+14.8%

3 seeds, ±0.003%

Pythia-6.9B

+2.4%

3 seeds, corrected eval

KALAVAI improvement across model scales — **Figure 2.** Scale ladder. KALAVAI improvement as a function of model size. Gains are nearly identical at 410M and 1B (+14.2% and +14.8%), then contract at 6.9B (+2.4%). We attribute this to the per-parameter training signal budget: 6.9B has 17× the parameters but was fine-tuned for fewer total steps.

Beats equal-compute monolithic training

The natural objection: just train one model on all the data for the same total compute. We tested this directly. A single model fine-tuned on mixed data for 6,000 steps (equal to 3 specialists × 2,000 steps) achieves +6.7% over base. KALAVAI achieves +14.5% over that monolithic model.

KALAVAI MoE vs monolithic baseline at 410M — **Figure 3.** 410M monolithic comparison. MoE (1.793) beats both base (2.248) and monolithic (2.098) by a large margin.

Full fusion comparison at 410M — **Figure 4.** Full 410M comparison: base, three specialists, monolithic, weight-averaged, and MoE. Mixed held-out loss on the vertical axis.

Method	Loss	vs. Base	vs. Monolithic
Base model	2.248	—	—
Monolithic (6,000 steps mixed)	2.098	+6.7%	—
Best specialist (code)	2.089	+7.1%	+0.4%
Weight averaging	2.158	+4.0%	—
Wider model (3.5× params)	2.120	+5.9%	—
KALAVAI MoE	1.793	+20.2%	+14.5%

1B results: replication holds

The 1B scale replicates the 410M result with essentially zero variance. The improvement is +14.8% over the best specialist. Three seeds, standard deviation 0.003%. The monolithic baseline at 1B also confirms: KALAVAI beats equal-compute monolithic by ~+14.5% at 1B as well.

KALAVAI fusion comparison at 1B scale — **Figure 5.** Full comparison at Pythia-1B. MoE achieves held-out mixed loss of 1.696 vs. best specialist at 1.992 — a +14.8% improvement. The monolithic baseline (6,000 steps on mixed data) achieves 1.986, confirming the MoE advantage holds at 1B.

6.9B: the scale boundary

At 6.9B parameters the improvement contracts to +2.4%. The mechanism still works — the MoE is definitively better than every individual specialist — but the magnitude is much smaller. The most likely explanation: at 6.9B, each specialist had only 500 fine-tuning steps (budget-constrained by A100 runtime), leaving insufficient domain signal for strong specialisation. A follow-up experiment (B1: 6.9B step-budget sweep) will test whether more training recovers the larger improvement seen at 410M/1B.

6.9B scale results summary — **Figure 6.** 6.9B results. Despite the reduced improvement (+2.4%), the result is statistically clear across three seeds with zero variance. This flags a scale-dependent boundary, not a mechanism failure.

Three Governing Conditions

When fusion works — and when it doesn't

The paper's core contribution isn't "MoE is good." It's an empirical characterisation of the conditions under which post-hoc fusion of independently trained specialists succeeds or fails. We identify three.

1. Shared initialisation is necessary

Specialists must start from the same checkpoint. The shared starting point preserves representational compatibility — specialists diverge in what they learn, but the geometry of their representations stays aligned enough for a router to combine them. Initialising from different checkpoints breaks this: their representational spaces are no longer aligned, and the router cannot learn coherent dispatch.

Practical implication: The cooperative coordinator must distribute a single canonical checkpoint. All specialists must start from exactly the same revision — same weights, same tokenizer, same architecture. This is the one non-negotiable constraint of the protocol.

2. Frozen layers become necessary beyond ~10,000 steps

At short training horizons (≤5,000 steps), freezing layers is optional — more plastic representations allow specialists to diverge more effectively. But beyond approximately 10,000 steps, unfrozen specialists over-specialise and become harder to fuse. Freezing the first K layers provides a structural anchor that preserves routing compatibility.

Training duration crossover: freeze=0 vs freeze=4 — **Figure 7.** The training duration crossover. Without frozen layers, improvement peaks at 5k steps then degrades. With 4 frozen layers, improvement plateaus through 20k steps. Crossover at approximately 10,000 steps.

Steps	Freeze=0	Freeze=4	Winner
500	+9.9%	+8.9%	Freeze=0
1,000	+12.5%	+11.3%	Freeze=0
2,000	+15.1%	+13.9%	Freeze=0
5,000	+16.4%	+15.8%	Freeze=0
10,000	+15.4%	+15.6%	Freeze=4 ← crossover
20,000	+13.6%	+14.8%	Freeze=4

3. All specialists must run at inference

This is the paper's most surprising result. A domain classifier with 99.3% accuracy, routing each input to a single specialist, produces −21.1% degradation relative to base. The MoE running all three specialists and combining outputs produces +14.1% improvement. Same routing accuracy — opposite results. The 35 percentage point gap is the difference between a system that works and one that's worse than doing nothing.

MoE vs domain classifier dispatch — **Figure 8.** Routing strategies at Pythia-410M. Joint MoE inference achieves +14.1%. Single-specialist dispatch with a near-perfect domain classifier achieves −21.1%. The 35pp gap is entirely explained by catastrophic forgetting.

Why does single-expert dispatch fail? Each specialist forgets what it wasn't trained on. The code specialist's loss on science data is worse than the base model. When only one specialist runs, out-of-domain tokens have no fallback. Joint inference restores coverage by letting the router suppress out-of-domain specialists token by token.

The mechanism in one image

The cross-domain evaluation matrix shows why fusion works. Each specialist is best on its own domain (the diagonal) and worst on the others. The MoE router dispatches each token to the right diagonal entry, recovering all specialist gains simultaneously.

Cross-domain evaluation loss matrix at 410M — **Figure 9.** 410M cross-domain loss matrix. Green = lower loss. The pronounced diagonal confirms complementary specialisation — each specialist excels on its own domain and degrades on others.

Cross-domain evaluation loss matrix at 1B — **Figure 10.** 1B cross-domain loss matrix. The same diagonal structure holds at 1B. The mechanism is scale-invariant at 410M and 1B.

Ablations

What doesn't matter (and what does)

Router architecture doesn't matter

A uniform router (fixed 1/N weights, no training) achieves +6.7%. A trained linear router achieves +14.2%. A 2-layer MLP achieves +14.2%. The gap between uniform and learned routing is +7.5pp — entirely explained by the router's ability to suppress out-of-domain specialists. The specific function class is irrelevant; the minimum bar is learnable suppression.

Router architecture ablation: uniform vs linear vs MLP — **Figure 11.** Router architecture ablation. Uniform weighting underperforms learned routing by 7.5pp. Linear and MLP routers perform identically. The key is learning to suppress out-of-domain specialists, not the capacity of the function approximator.

Freeze depth sweep

We swept freeze depth from 0 (no frozen layers) to 12 (half the model). At 2,000 steps, freeze=0 wins. At 10,000 steps, freeze=4 and freeze=8 both outperform freeze=0. Practical guideline: training under 5,000 steps — skip freezing. Over 10,000 steps — freeze 4–8 layers.

Specialist count scales gracefully

Three, four, and five specialists all achieve approximately +14.1% with near-zero variance. The mechanism doesn't degrade as you add more contributors. Two-specialist is slightly higher because the evaluation problem is narrower.

2 to 5 specialist scaling — **Figure 13.** Specialist count scaling at 410M. 3–5 specialists cluster at +14.1% with near-zero variance. Adding more contributors doesn't hurt — it's a stable property of the fusion mechanism.

The mechanism survives base model maturity

Fusion improvement is consistent across Pythia checkpoints from 3.5% to 100% of pre-training at 410M (+13.4% to +15.0%) and 1B (+13.8% to +15.9%). Qwen-1.5B at full training is the boundary condition: −1.0%, because its pre-training corpus already covers the specialist domains — leaving no room for meaningful specialisation.

Maturity sweep across Pythia-410M, 1B, and Qwen-1.5B — **Figure 14.** Maturity sweep across model families. Pythia-410M and 1B show consistent improvement at every checkpoint. Qwen-1.5B at full training (−1.0%) defines the boundary condition: when the base model has already seen the specialist data during pre-training, fine-tuning produces insufficient differentiation.

410M maturity curve detail — **Figure 15.** Pythia-410M maturity sweep in detail. MoE improvement relative to best specialist at each pre-training checkpoint. Stable above +13% across the full trajectory.

1B maturity curve detail — **Figure 16.** Pythia-1B maturity sweep. Slightly higher range (+13.8%–+15.9%) compared to 410M. Consistent pattern — the mechanism is robust to the base model's training stage.

Extra parameters don't explain the gains

Two baselines rule out "more parameters" as the explanation. A wider single model with 3.5× the parameters achieves only +5.9%. A multi-head baseline with identical parameter count to the MoE but hard single-expert routing achieves −21.1%. The gain comes from cooperative specialisation plus joint inference, not raw capacity.

Wider model baseline comparison — **Figure 17.** Wider model baseline. A 3.5× wider model achieves +5.9% — less than half the MoE improvement (+14.2%). Extra parameters without specialisation can't replicate the cooperative mechanism.

Multi-head baseline with hard routing — **Figure 18.** Multi-head baseline. Same parameters as the MoE, but hard single-expert routing: −21.1%. This is the same catastrophic forgetting failure as the domain classifier. Joint inference is not optional.

The Router in Action

Token-level routing, not document classification

On hybrid-domain prompts, the router switches experts mid-sentence. The prompt "Derive the equation for protein folding using Python pandas" forces a domain switch within a single sentence: science tokens ("derive," "equation," "protein," "folding") should activate the science specialist; code tokens ("Python," "pandas") should activate the code specialist. The router discovers this structure from the training signal alone — no supervision, no domain labels, no explicit boundaries.

Token-level gate weight heatmap — science/code hybrid prompt — **Figure 19.** Token-level routing on a hybrid science/code prompt. Each column is a token; each row is an expert (code, science, fiction from top to bottom). The router assigns science weights to scientific vocabulary and code weights to programming terms — mid-sentence, without any supervision signal.

The pattern is robust across multiple hybrid prompt types, including narrative/science, technical/narrative, and multi-domain sentences with three or more domain switches:

Token-level routing — narrative/science hybrid — **Figure 20.** Routing on a narrative/science hybrid. Fiction tokens get fiction-specialist weight; scientific terms get science-specialist weight. Clean switching at domain boundaries.

Token-level routing — technical/narrative hybrid — **Figure 21.** Routing on a technical/narrative hybrid. Clean domain switching at the structural boundary between instruction and narrative context.

Token-level routing — code-heavy with embedded claim — **Figure 22.** Code-heavy prompt with embedded scientific claim. Router stays on code specialist but briefly shifts weight toward science on the embedded claim.

Token-level routing — multi-domain with three switches — **Figure 23.** Multi-domain sentence with three domain switches. The router tracks all three correctly, switching at each domain boundary.

Router confidence distribution

In practice, the router operates as a near-hard switch: the highest-weight expert receives over 95% of the routing weight in more than 99.7% of tokens. Crucially, hard routing (argmax dispatch) and soft routing (learned softmax weights) produce identical perplexity — confirming that the value is selection and suppression, not the specific weighting scheme.

**Figure 24.** Router gate weight distribution at 410M. The top-1 expert receives >95% weight in 99.7% of tokens. The router behaves as a near-hard switch with a small residual from other experts.

Router gate weight distribution at 1B — **Figure 25.** Router distribution at 1B. Same near-hard switching behaviour. The router confidence increases slightly with model scale.

Beyond Perplexity

Downstream benchmarks

Perplexity improvements are clear and consistent. Downstream benchmark accuracy is more modest — less than 1pp on standard tasks at 1B. This is expected at these scales: perplexity and benchmark accuracy don't reliably track below 7B parameters. The KALAVAI paper is explicit about this gap.

Downstream benchmark results at 410M — **Figure 26.** Downstream benchmarks at 410M. Modest <1pp gains across ARC, HellaSwag, WinoGrande, and PIQA. The training domains (code, science, fiction) are somewhat off-distribution for commonsense benchmarks.

Downstream benchmark results at 1B — **Figure 27.** Downstream benchmarks at 1B. Similar pattern to 410M. A cooperative trained on knowledge-intensive domains more directly aligned to the benchmark tasks would likely show stronger downstream gains.

On the benchmark gap. Standard commonsense benchmarks (ARC, HellaSwag) are factual Q&A, while the training domains are code, science, and fiction — a partial mismatch. A cooperative aligned to benchmark-relevant domains (e.g., world knowledge, reasoning, factual recall) would present a stronger evaluation of downstream gain potential.

Training Dynamics

How specialists diverge during training

Each specialist rapidly improves on its own domain while degrading on out-of-domain content. This divergence is exactly what makes the MoE valuable — if specialists didn't degrade on other domains, the router would have nothing useful to select between. The code specialist's catastrophic forgetting of science is a feature, not a bug.

Per-specialist training curves at 410M, seed 42 — **Figure 28.** Training curves at 410M, seed 42. Each specialist improves on its target domain (downward) while degrading on others (upward). Divergence is clear by step 500.

Per-specialist training curves at 1B, seed 42 — **Figure 29.** Training curves at 1B, seed 42. Same divergence pattern at larger scale. The 1B model diverges somewhat faster.

Specialist performance on own domain — **Figure 30.** Own-domain loss across training steps. All three specialists improve monotonically on their target domain. Peak own-domain performance is typically reached by step 2,000–3,000.

Specialist cross-domain forgetting — **Figure 31.** Cross-domain loss across training steps. Each specialist progressively forgets out-of-domain content — creating the routing opportunity that the MoE exploits.

MoE router training trajectory — **Figure 32.** Router training trajectory. The router learns rapidly — most of the MoE's performance gain is achieved within the first 100–200 router training steps. 500 steps is conservative; 200 would likely suffice.

Boundary Conditions

What we don't claim

The paper is explicit about five things it does not claim.

No inference efficiency. The fused model runs all N specialists in parallel. For N=3, inference overhead is approximately 2.5×. This is a training-time democratisation that trades inference cost for training accessibility.

No universal architecture generality. All primary results use Pythia. Qwen-1.5B at full training shows −1.0% — when the base model's pre-training already covers the specialist domains, fine-tuning produces insufficient differentiation.

No guaranteed downstream gains. Perplexity improvements are clear; benchmark accuracy improvements are modest (<1pp at 1B). Perplexity and downstream accuracy don't reliably track at these scales.

No real cooperative demonstrated. All experiments are simulated cooperatives on single machines. Heterogeneous hardware, asynchronous submission, and contributor reliability are open engineering problems.

No frontier-scale evidence. The 6.9B result (+2.4%) suggests scale-dependent sensitivity requiring further investigation.

Live Experiments

Phase A: the NeurIPS gate

Three experiments stand between this work and NeurIPS submission. Results will be committed to the repository and will update the paper before submission. Gate criterion: if A1 shows >5% MoE vs monolithic at 1B and A3 shows clear degradation with mismatched checkpoints, the paper proceeds to NeurIPS 2026. Otherwise, COLM/TMLR.

Done

A1 — 1B equal-compute monolithic baseline

Does KALAVAI beat monolithic at 1B scale? Gate requires >5% MoE vs. monolithic.

+14.5% (seeds 42, 137 complete — clears gate by 3×)

Queued

A2 — Inference cost benchmark

Dense vs. sparse top-1 MoE latency at 410M and 1B. Routing agreement % between sparse (frozen-layer router input) and dense (mean of final hidden states from all three specialists). This measurement directly quantifies whether removing the other experts' activations from the router input changes routing decisions.

Queued

A3 — Shared initialisation necessity

Does fusion break if specialists start from different checkpoints? Three conditions: control (all step10000), large gap (step5000/step10000/step20000), small gap (step8000/step10000/step12000). If the large-gap condition degrades, this confirms shared initialisation as a hard requirement.

After A1+A3: If both gate criteria are met, two further experiments are planned — B1 (6.9B step-budget sweep: does more training recover the full improvement at large scale?) and C1 (heterogeneous cooperative: does fusion survive different batch sizes, learning rates, and training durations per contributor?).

What This Makes Possible

Five problems cooperative training solves

The protocol's zero-communication property — contributors share only a starting checkpoint and a final trained checkpoint, never data — opens applications that are structurally impossible with synchronous training or federated learning.

Healthcare

A hospital network that can't share patient data

Five hospitals, each with thousands of records in different specialties — cardiology, oncology, pediatrics, radiology, neurology. Privacy laws prevent pooling. Today, each can fine-tune a model on their own data, but it only knows their specialty.

With KALAVAI, each hospital trains a specialist on their private data that never leaves their servers. They share only the trained checkpoint — not a single patient record. The fused model understands all five specialties. No data was shared. No privacy was violated.

This was impossible before. Federated learning requires gradient sharing during training, which leaks information. KALAVAI requires zero communication during training — only the final checkpoint is shared.

Low-Resource Languages

Endangered languages get a real language model

There are maybe 50,000 pages of digitised text in Yoruba. Or Quechua. Or Tamil literary archives. Not enough for any organisation to justify training a model. But a university in Nigeria trains a Yoruba specialist. A lab in Peru trains a Quechua specialist. A team in Chennai trains a classical Tamil specialist. A group in Wales trains a Welsh specialist.

Each trains on whatever they have — even a few thousand pages is enough to produce meaningful specialisation from a shared base model. The fused model handles all four languages plus the base model's English. No single institution had enough data or compute to build this. The cooperative did.

KALAVAI changes the economics from "one organisation needs all the data and all the compute" to "each community contributes what they have."

Legal AI

A legal AI built across jurisdictions

Indian contract law, UK common law, US constitutional law, EU regulatory law, Brazilian civil law — each a specialised domain with its own corpus and reasoning patterns. No single firm has expertise across all five. Today you either build a generic legal model that's mediocre everywhere, or a narrow one that knows one jurisdiction deeply.

With KALAVAI, a firm in Mumbai trains the Indian law specialist. A firm in London trains the UK specialist. A firm in São Paulo trains the Brazilian specialist. The fused model can analyse a cross-border contract touching Indian, UK, and EU law — routing the relevant clauses to the relevant specialist. Each firm contributed domain expertise without sharing proprietary case databases with competitors.

Interdisciplinary Science

Scientific research across fields that don't talk to each other

A climate science lab trains on atmospheric modelling. A marine biology lab trains on ocean ecosystems. A geology department trains on seismology. An economics department trains on resource economics. Individually, each model knows its field. Fused, the model can reason about questions at the intersection — "how do seismic events in the Pacific affect marine ecosystems and what are the economic implications for coastal fisheries?"

A new kind of interdisciplinary tool that emerges from collaboration without anyone needing to be interdisciplinary themselves.

Digital Sovereignty

A country builds its own sovereign AI without a hyperscaler

A small country — Sri Lanka, Estonia, Rwanda — wants a national language model that understands their language, laws, culture, geography, educational curriculum. They can't afford to train from scratch. They can't rely on OpenAI or Google to prioritise Sinhala or Kinyarwanda.

With KALAVAI, the country's university trains a language specialist. The ministry of justice trains a legal specialist on national law. The education department trains on school textbooks. The health ministry trains on local medical guidelines. Each institution uses the GPUs they already have.

The fused model is a national AI that no foreign company built and no foreign company controls. Digital sovereignty through cooperative intelligence. It was not feasible before because no single institution in a small country has the compute or data. KALAVAI turns that constraint from a blocker into an irrelevance.

Reproduce It

30 minutes. One GPU. The whole protocol.

git clone https://github.com/mechramc/Kalavai.git
cd Kalavai
pip install transformers datasets torch accelerate
python experiments/kalavai_pythia_experiment.py

Requires any GPU with 24GB+ VRAM (RTX 3090, 4090, 5090, A100, or equivalent). Produces trained specialists, fused MoE, all evaluation numbers, and figures. Expected output: +14.2% ± 0.016% on held-out mixed evaluation.

Script	Scale	Hardware	Time	Expected output
`kalavai_pythia_experiment.py`	410M	Any 24GB GPU	~30 min	+14.2% ± 0.016%
`kalavai_pythia_1b_experiment.py`	1B	Any 24GB GPU	~2 hours	+14.8% ± 0.003%
`kalavai_pythia_6b_experiment.py`	6.9B	A100 80GB	~8 hours	+2.43% ± 0.00%
`kalavai_training_duration_crossover.py`	410M	Any 24GB GPU	~4 hours	Crossover at 10k steps
`kalavai_domain_classifier_baseline.py`	410M	Any 24GB GPU	~45 min	−21.1% (classifier)

Every experiment is a self-contained Python file. No config files, no YAML. Read the script, understand the experiment, run it. 322 automated audit checks verify every result before any paper-ready number is reported.