I finally got around to trying ChatGPT this week. People have been talking about it for weeks, but I was buried in cancer treatment, chemo recovery, surgery prep, and thesis work on Weibull distributions. I had no bandwidth for keeping up with ML.
When I finally tried it, my reaction was not surprise at the technology itself.
It was: “This makes sense. The pieces were all there.”
Why I Missed It
GPT-3 came out in 2020. I was dealing with a stage 3 cancer diagnosis, chemotherapy, mathematical statistics coursework, thesis research on masked failure data, surgery and recovery.
I had no attention left for tracking ML developments. The world moved on. I was focused on survival. That is fine.
The Theoretical Foundation
I have been interested in Marcus Hutter and Ray Solomonoff’s work for years.
Solomonoff induction is the theoretical foundation: optimal prediction is compression. Intelligence is sequence prediction. The shortest program that generates your observations is the best predictor of what comes next.
Hutter’s AIXI formalized this: intelligence is optimal compression-based prediction with resource bounds.
Back during my CS master’s, I proposed working on sequence prediction as a thesis topic, inspired by Solomonoff. The professor was not interested. I ended up doing encrypted search instead.
But the intuition stuck with me: prediction is compression is intelligence.
The Bitter Lesson
Rich Sutton’s “The Bitter Lesson” laid it out plainly: scaling compute and data beats clever algorithms. The lesson from 70 years of AI research is that general methods which use computation win. Hand-crafted features lose. Search and learning scale. Everything else does not.
I read that essay and found it compelling. But there is a difference between understanding theory and watching it play out at scale. OpenAI was actually doing the scaling while I was working on other problems.
ImageNet Should Have Been the Signal
In retrospect, ImageNet being solved by deep neural networks in 2012 was the canary. A simple architecture (CNNs), massive data, lots of compute. Superhuman image classification.
That was the proof: scale works. More layers, more data, more GPU hours.
GPT is the same pattern. Simple architecture (transformers). Massive data (internet-scale text). Enormous compute (thousands of GPUs). Result: something that looks disturbingly intelligent.
Connecting the Dots
The theoretical framework was already in place. Solomonoff said intelligence is compression. Hutter said optimal prediction with bounded resources. Sutton said scaling beats cleverness.
The empirical evidence accumulated. ImageNet showed scale solved vision. AlphaGo showed scale plus self-play solved Go. GPT-2 showed scale made coherent text generation.
GPT-3.5 at this level fits the pattern. But when you are focused on survival and thesis work, you do not always see the broader trajectory.
The Broader Implication
If sequence prediction is intelligence, and we have scaled it to this level, what comes next?
GPT-3.5 is already disturbingly capable. It writes code, explains concepts, engages in dialogue, reasons about abstractions, generates plausible arguments. It is not AGI. But it is further along the path than I expected to see in 2022.
What This Changes
I need to pay attention to AI developments again. Not just theory, but what is actually being deployed.
If this trajectory continues, GPT-4 will be better. Multimodal models will integrate vision, text, and code. Reasoning capabilities will improve. Alignment problems will become urgent.
The theoretical questions I care about (suffering in artificial minds, s-risks, value alignment) are no longer distant abstractions. They are engineering problems that need solutions now.
Full Circle
I proposed working on sequence prediction for my master’s thesis around 2012. Professor was not interested. Too speculative.
Ten years later, sequence prediction is the foundation of a technology that might change everything. The theoretical clarity was already there. What was needed was compute, data, and engineering, which is what OpenAI brought.
Cancer, Compute, and Mortality
There is a strange parallel. I have spent the last two years studying failure distributions while my own failure distribution became personal. Now I am watching AI systems get better at an exponential rate while my own compute substrate degrades.
The math of both is well-understood. Weibull distributions for biological failure. Scaling laws for neural network performance. But experiencing them is different from deriving them.
What Now
I am returning to AI research. Not full-time. I am still finishing the math degree, still dealing with cancer.
But I need to understand how these models actually work, what their failure modes are, whether they can be aligned, what risks they pose.
Solomonoff and Hutter gave us the theory. OpenAI gave us the engineering. Now we need to figure out what we have actually built.
Discussion