A procedural dataset that encodes the breadth of human scientific knowledge as step-by-step reasoning problems. The goal is not a benchmark. The goal is to teach machines how humans think, discover, and invent.
From counting to self-awareness. From following procedures to creating them.
The state space is larger than the observable universe. Every model, at every scale, must learn the algorithms.
| Model | Parameters | Can Memorise | Coverage | Verdict |
|---|---|---|---|---|
| GPT-2 | 124M | ~134,000 | 10-76 | MUST REASON |
| Llama-2 7B | 7B | ~7.5M | 10-74 | MUST REASON |
| Llama-2 70B | 70B | ~75M | 10-73 | MUST REASON |
| GPT-4 (est.) | ~1.8T | ~1.9B | 10-72 | MUST REASON |
| Llama-3.1 405B | 405B | ~438M | 10-72 | MUST REASON |
The entire curriculum is 1.85 MB of compressed algorithms, but produces terabytes of unique instances. A compression ratio of 1,250,000:1. The only winning strategy is to learn the algorithms.
The breadth of formalised human knowledge, encoded as reasoning problems.
135 tokens. Every character is its own token. No BPE. No subword merging. Digits stay atomic. LaTeX stays intact. The model learns to read and write mathematical notation as a native language.