Nous Research TST Review 2026: Token Superposition Training

What is Token Superposition Training (TST)?

On May 14, 2026, Nous Research (team behind Hermes Agent with 140K+ stars) proposed Token Superposition Training (TST) — a new pre-training method that reduces LLM training cost by approximately 60% while improving performance [citation:9].

How TST Works

TST uses a coarse-to-fine learning strategy. In early training, the model reads groups of consecutive tokens as "superposed tokens" — like skimming a paragraph rather than reading word-by-word. Later, training switches back to standard next-token prediction for fine-grained learning [citation:9].

Two stages:

Superposition Phase: Groups s consecutive tokens, averages embeddings, predicts which tokens appear in next group (not exact order)
Recovery Phase: Switches to standard autoregressive training to build generation ability

Key Results

On a 10B-A1B MoE model (similar to Qwen3 10B):

Baseline: 1.05T tokens, 12,311 B200-hours → Loss 2.252
TST: 2T tokens, 4,768 B200-hours (38.7% of baseline) → Loss 2.236 (BETTER) [citation:9]
Speedup: ~2.5x for same loss target

Key Advantages

No Architecture Changes: Final model is standard LLM — no inference changes required [citation:9]
No Tokenizer Changes: Works with existing tokenizers
2.5x Faster: Dramatically reduces pre-training time
Better Metrics: Lower loss AND improved downstream (HellaSwag, ARC, MMLU)

Why This Matters

Most efficiency methods change model architecture or tokenizers — forcing ecosystem re-compatibility. TST keeps complexity in training only. The final model is a standard LLM that can be deployed with any existing infrastructure [citation:9].

Optimal Hyperparameters

Bag size: 4-8 tokens
Superposition steps ratio: 20-40% of total training

Pricing

Research release — paper available on arXiv. Implementation coming soon.

Pros

60% GPU time reduction for pre-training
No architecture changes — works with any model
2.5x speedup for same loss target
Actually improves downstream metrics
Research-backed with rigorous experiments

Cons

Research paper only — no production implementation yet
Must switch back to standard training (two-phase process)
Optimal hyperparameters require tuning
Benefit magnitude varies by model scale

Who Should Use It?

Perfect for: AI researchers, LLM training teams, and companies pre-training their own models looking to reduce costs.

Verdict

TST is the most practical LLM training efficiency method since DeepSeek. The fact that it requires no architecture or tokenizer changes — and actually improves metrics — makes it extraordinarily valuable [citation:9].

Rating: 4.7/5 - The breakthrough LLM pre-training needed.

Search AI Hub