Check out the (early) project and source code on GitHub.
The idea
Some problems are easy in one direction and hard in the other. Taking a derivative is mechanical. Finding an antiderivative can require genuine creativity. Generating a random expression and verifying a proof is easy. Discovering the proof is hard.
RPSDG (Reverse-Process Synthetic Data Generation) exploits this asymmetry. Run the easy direction with full step-by-step work, then reverse the result to get a hard problem with a known solution. You end up with process-supervised training data: not just the answer, but the entire derivation.
Richard Sutton’s “The Bitter Lesson” argues that methods scaling with compute and data will eventually win. The bottleneck is high-quality data. A lot of the world’s data is latent, the processes that generated it are not written down. In math, the way a proof was discovered is usually hidden behind a polished presentation. RPSDG is one way to manufacture that hidden process data.
Derivatives to Integrals
Computing derivatives is mechanical. Integration often is not. That asymmetry gives us a data pipeline.
Start with known functions. Pick functions \( f(x) \) with closed-form derivatives: polynomials, trig, exponentials, logarithms. Vary complexity.
Differentiate with full work shown. Take the derivative of \( f(x) \) to get \( f'(x) \), recording every step.
Reverse the process. Now \( f'(x) \) is the problem and \( f(x) \) is the solution. The recorded steps, read backward, give you a worked integration example.
By composing functions of varying complexity, you get integration problems of graduating difficulty. The training data comes with step-by-step solutions for free, because you generated it by running the easy direction.
Proofs by Random Walk
The same idea works for theorem proving. Generating proofs is hard. Verifying them is (comparatively) easy.
Random walks in expression space. Start with a random expression \( e_{\text{start}} \). Apply rewrite rules \( r_1, r_2, \ldots, r_n \) to get a chain of intermediate expressions ending at \( e_{\text{end}} \).
Read off the theorem. The pair \( (e_{\text{start}}, e_{\text{end}}) \) is a theorem. The chain of rewrites is its proof.
Reverse when useful. Running the chain backward works too, especially when a complex step in one direction (integration) becomes simple in the other (differentiation).
Scale it. Random starting points and random rewrite sequences give you a diverse set of theorems and proofs automatically. No human has to come up with the theorem first.
What this gets you
The training data has process supervision baked in. Every example includes intermediate steps, not just the final answer. That should help LLMs learn multi-step reasoning rather than pattern-matching to answers.
It also gives you something like explainability for free: the model’s training data literally consists of step-by-step solutions, so the model has a better chance of producing step-by-step reasoning at inference time.
Limitations and next steps
This is early work. The data generation pipeline exists, but I have not yet run the full fine-tuning experiments and benchmarks. The planned pipeline is:
- Data generation with graduating difficulty (curriculum learning)
- Fine-tuning transformer-based LMs on the generated data
- Self-supervised learning experiments
- Evaluation against standard math reasoning benchmarks
Further out, I want to explore reinforcement learning for rewarding multi-step reasoning even when the solution is not known in advance but can be verified.
Discussion