I am an engineer and entrepreneur trying to make sure AI is developed without killing everybody. I founded and was CEO of the startup Ripe Robotics from 2019 to 2024. | hunterjay.com
HunterJay
HunterJay’s Shortform
The Low Hanging Fruit of AI Self Improvement
How I Handle Automated Programming
I broadly agree, and I also think this explains why it sucks now (the models aren’t capable of doing this explicitly very well yet) but could be extremely good in the future (it directly utilises the intelligence we’re already training, and therefore should improve with it automatically).
I might have misunderstood, but I disagree that you cannot learn things like playing chess in-context. I can write up the rules to a new made up game, give Claude some examples of what those games look like, and then play the game against Claude and he will be able to play it (somewhat).
He’ll still make mistakes sometimes, but if he was able to write down those mistakes (and as he get smarter, understand and write down the reasons he made those mistakes and how to do better in the future), then when I start a fresh instance and load in those notes again, the new instance will play even better. Depending on how good Claude is at understanding his failures, and how good he is at writing down the lessons in a way that the new instance can generalise from, you could (effectively) have a system that keeps getting better at the game.
Do you disagree on this? Am I missing something about your claim here? Apologies if my response isn’t on point, I’m somewhat unsure what you are trying to say here.
I agree sample efficiency is an issue. In-context learning is wildly more sample efficient than fine-tuning, but it could be better. The trouble here is if you need some minimum number of samples to understand and generalise something properly, you might hit the context limit first (or you might not recall properly across the whole window, also another open problem).
Regarding recurrent state models, as far as I understand, this is effectively doing the same thing as having a KV-cache with memories, except that in the KV-cache case, the model must explicitly write out what to include, whereas state space models have some method built into their architecture and learnt during training. Am I correct on that? I suspect this is an advantage of the KV-cache memory method, if you believe the models will keep getting more intelligent regardless, because it lets us use that intelligence to decide what to remember—somewhat analogous to a human deciding what to study, or dwelling on things that then get committed to memory.
Having the model explicitly write what to include also sidesteps the problem of designing a good state space learning mechanism, since we’re just relying on intelligence we can train in known ways to decide what information to retain moving forward.
I see your point, and definitely agree that learning / deep understanding as a human is different to having access to lots of notes, but I think that the huge context + speed of reading + new model releases might be enough to get nearly all the benefits that require humans to use our long term memory.
For example, could this requirement in humans just be that we can store so little in conscious thought? If I can only hold 10 things ‘front of mind’ at a time, to do anything complex I am forced to rely on my larger long term memory to surface relevant things. If I could hold 1,000,000 things front of mind, maybe I wouldn’t need that at all. This is also somewhat justified by in-context learning (arguably) doing the same thing as gradient descent, though that is debated.
Regarding notes, I think the same difference in scale applies, as well as a qualitative difference. A human can only read maybe 10 words a second, but an AI can read on the order of 100,000. I argue that makes it closer to a human “reading” from their memory compared with a human reading notes, especially since all of the notes are loaded into the context window / working memory of the AI. That, I think, is more analogous to the way human memories are used than to how humans use notes, since the human reading can’t hold more than a sentence or so in working memory at a time without compressing it.
Thanks for replying. I definitely do expect real continual learning to be developed too, to be clear. I don’t know on what timeline, but if there’s any benefit to be gained by it, it will remain an R&D target and eventually be cracked, possibly by automated R&D. My main argument is that theoretical breakthroughs aren’t required to get most of the supposed benefits of continual learning.
I think context engineering is a fair description of what I am talking about, yes! Except that this is explicitly the subset of that where we are getting the AI to intelligently handle its own context. I hadn’t heard that about fine-tuning narrow skills, interesting if true, do you have a source going into that?
Regarding proprietary data, I am talking about a system where the information (memories + documentation) is kept out of the weights, and out of the training, which seems like it’s much better for proprietary data. The data never has to be shared, and to get the same performance again, you just load the memories into context without needing to fine-tune. Did I misunderstand you here?
Regarding the training, I’m not actually suggesting training on data produced at runtime, at least not in any way that is different to what happens today—I’m saying that you can take the post-training already being done (provide the model with some task, reward success) and expand it to allow models to learn to pass information between runs (provide the model with some task, let it write notes after, run another task with those notes, then reward based on the success on both tasks combined).
Interesting thoughts on replay with human memories, I think I agree. It effectively means humans are selecting what to remember using our full(?) intelligence, which is an interesting thing to think about in light of having the LLMs select what to remember by writing notes (and thinking about why designing state space models to learn to choose what to keep implicitly rather than explicitly has been so hard).
Prosaic Continual Learning
You’re correct, sorry for being confusing. Tracing through;
My understanding of steering is that you can add a steering vector to an activation vector at some layer, which causes the model outputs to be ‘steered’ in that direction. I.e.:
Record layer ’s activations when outputting “I am very happy”, get vector
Record layer ’s activations when outputting “I am totally neutral”, get vector
Subtract from to get steering vector , the difference between ‘happy’ and ‘neutral’ outputs.
Add to the activations at layer to steer the model into acting more happy, where is some scalar.
The tensor network architecture is scale invariant, which (by my understanding) means that scaling the activation vector at any layer maintains the relative magnitude of the activations at any later layer.
(Dumb) I thought that this meant that adding a steering vector of magnitude and adding a steering vector of magnitude would preserve the relative magnitude of the activations later in the network. That is, that scaling the steering vector would be scale invariant too. But that’s not the case — we’re changing the direction of the (activation vector + steering vector) when we increase the magnitude of the steering vector.
That’s pretty much all I was trying to correct in my response. When I was talking about entire layer / not entire layer, I was just trying to say you can’t pretend that adding a steering vector is actually just scaling the activation vector even if it is parallel in some dimensions. It’s a trivial point I was just thinking through aloud. Like:
If you have activation vector
You are scale invariant if you multiply by a scalar :
Which is the same as pointwise multiplication by the vector :
But you can’t just say, “Well, I’m only going to scale part of vector , and since it’s scaling, that means it maintains scale invariance”, because it’s not just scaling, and that’s a dumb thing to say — , then for any .
So basically you can ignore that, I was just slowly thinking through the maths to come to trivial conclusions.
Your claim here is different and good, and points to another useful thing about bilinear layers. As far as I can tell — you are saying you can decompose the effect of the steering vector into separable terms purely from the weights, whereas with ReLU you can’t do this because you don’t know which gates will flip. Neat!
With more thinking, I was broadly wrong here:
- If you add a steering vector, it’s not just scaling, so scale invariance doesn’t make a difference.
- If you scale an existing activation vector which makes up the entirety of one of the layers, the only effect would be to change the absolute magnitudes going into the softmax (since scale invariance means the relative magnitude at each position is the same). That could have some minor effect—changing the probability distribution to be sharper or flatter, but that’s all.
- If you scale some existing activation which is not an entire layer, then it’s no longer scale invariant anymore either, it’s kind of like adding a steering vector with zero magnitude in the other dimensions.
There is still a weak advantage for steering vectors in a tensor network because the change is going to be smooth, rather than discrete (since we’re not flipping gates on and off), but basically I was just confused here, sorry about that.
Another year has passed, 27 months total. Time for another review!
2023 Predictions for 2024, reassessed:
First, some predictions for 2024 were wrong because they hadn’t happened yet. Of those, let’s see how wrong I was—did they happen in 2025 instead?
“Agents can do basic tasks on computers—like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things”
<1 year off | I rated this as ‘Debatable’ last year, based on Claude with Computer Use and Figure’s robot control. Today, Claude for Chrome and Claude Code, Codex, etc can clearly do these tasks sans robotics control. The robotics control piece has remained elusive and harder to claim, however Tesla and Figure robots were deployed in some factory use, and use some general purpose transformers for part of their stack, so I think it can be claimed as “actually useful for some of these things” now.
“Context windows are no longer an issue for text generation tasks. Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.”
<1 year off | I rated this as ‘Mostly False’ for 2024, because although context windows had increased from ~8k to ~128k, I felt there were still limitations. They are now at ~1m, so if you are still having trouble with your text generation tasks, I would say it’s not because of the context window. Gemini 1.5 Pro also had a 1m context window in 2024, though I didn’t think it was effectively usable context at the time.
“GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.”
>1 year off, debatable framing | I still rate this as mostly false, although searching chats and the codebase works perfectly, and in effect is the same thing. The capability is there, just different to how I framed it in 2023.
“Online learning begins—GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).”
>1 year off | I still rate this as false, although current frontier models do help train their successors in several ways, we definitely do not run daily updates to the model.
“AI selection of what data to train on is used to improve datasets in general—training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.”
<1 year off | This is true today! AIs curate and filter their own data, and curriculum design is a larger part of efficient training. It is all done with AI assistance.
Overall Score for 2023 predictions about 2024:
Leans True 7
Too Vague 1
Leans False but <1 Year Late 3
Leans False and >1 Year Late 2
2023 Predictions for 2025:
“AI agents are used in basic robotics—like LLM driven delivery robots and (in demos of) household and factory robots, like the Tesla Bot. Multimodal models basically let them work out of the box, although not 100% reliably yet.”
Leans False | Demos of household and factory robots using LLMs have certainly happened, though the vibes of this prediction are overestimating progress in robotics. Tesla Optimus and Figure AI had small deployments in car factories, but not commercially, matching the prediction. Delivery robots did not use LLMs in any way, as far as I am aware, even for conversing with the sender or receiver, or for reasoning through high level path planning. Real-time multimodal AIs like Gemini Flash can live stream and respond to video and audio out of the box, but not reliably enough to be used as part of a robotics stack directly.
“Trends continue from the previous year:
The time horizons agents can work on increase
True | 100% true, and now well tracked by the famous METR chart.
LLMs improve on traditional LLM tasks
True | Trivially true, and an extremely obvious prediction. Comparing o1 to Opus 4.5, SWE-bench went from ~49% to ~81%, AIME maths went from ~74% to ~100%, and so on.
Smaller models get more capable
True | Claude Haiku 4.5 matches or exceeds Claude Opus 3 on many benchmarks, despite (apparently) being far smaller.
The best models get bigger.”
Debatable | Estimates of the most capable models’ parameter count is higher than previous years, but not by much, and the scaling up of parameters has not been a major source of performance improvements, against my broader expectations. It’s a technically correct prediction that missed the actual trend.
“AI curated and generated data becomes far more common than previously, especially for aligning models.”
True | Synthetic data became the default approach for LLMs, and is used deeply for alignment training through Constitutional AI. Deepseek’s R1 famously used pure RL for reasoning, and distilled to smaller models using the generated reasoning traces.
“Virtual environments become more common for training general purpose models, combined with traditional LLM training.”
Debatable | Robotics training in simulation is booming, as are world models as a research area, like Genie 3, however they aren’t commonly used for training frontier LLMs. Those are trained in and for environments like browsers, terminals, with other tools which take actions for them, but that stretches the definition of ‘virtual environment’ a little.
“Code writing AI (just LLMs with context and finetuning) are capable of completely producing basic apps, solving most basic bugs, and working with human programmers very well—it’s pair programming with an AI, with the AI knowing all of the low level details (a savant who has memorised the docs and can use them perfectly, and can see the entire codebase at once), and the human keeping track of the higher level plan and goals. The AI can also be used to recommend architectures and approaches, of course, and gradually does more and more between human inputs.”
True | I barely directly write code myself anymore, as Opus 4.5 in Claude Code did nearly all of it for me by December 2025. I do still need to track the higher level plan, architecture, and goals, as predicted.
“If there ever feels like a lull in progress, it will be in this period leading up to models capable enough for robotics control, long time frame agents, and full form video generation, which I don’t expect to happen in an large scale way in 2025.”
Leans True | There was talk of feeling like there was a lull in early-mid 2025, around the time leading up to and including GPT-5’s launch, and it is correct that robotics control, long time horizon agents (more than a few hours), and full form video generation haven’t taken off yet.
“Possibly GPT-6 or equivalent is released, but more likely continuous improvements to GPT-5 carry forward. There’s not a super meaningful difference at this point, with online learning continually improving existing models.”
Leans False | No online learning in the sense I meant it, though we do have a focus on better post-training leading to many model releases sharing the same pre-trained base. I also think it is surprisingly debatable whether the jumps in capability in the chain GPT-3 --> GPT-4 --> o1 --> Opus 4.5 were roughly equally sized jumps in capability (and hence whether Opus 4.5 is GPT-6 equivalent from 2023s perspective), though I would still assess as ‘probably not / leans false’.
Overall Score for 2023 predictions about 2025:
Leans True || True 6 (3 non-trivial)
Debatable 2
Leans False || False 2
Overall, I think my predictions matched my calibration about them, and potentially did even slightly better than my 40% − 70% claim. The biggest mistake was predicting some form of continual learning, and by far the biggest omission was to say nothing about reasoning models, which would become the dominant paradigm a year after my original write up. I did talk about runtime search in July 2024, several months before o1-preview was announced, but completely missed it in 2023.
I was pretty accurately calibrated on AI software capabilities, but too bullish on robotics. In hindsight, the ‘lull’ prediction for 2025 was probably the most unlikely one to get mostly correct, and I think compared to (my memory of) other predictions at the time I correctly downweighted video generation and upweighted coding automation.
See you all next year!
Also piggybacking, if anybody is Sydney-based or visiting Sydney, you are welcome to work out of the SydneyAISafetySpace.org (SASS) for free.
The fact that tensor network architectures are scale-invariant seems underappreciated for useful steering. If my understanding is correct, it would mean that scaling the steering vector should cause thesamepathways through the model to be activated, whereas without this we could be activating a totally different pathway, and get much less predictable behaviour.
Correction Below
Claude Opus 4.6 is Driven
I agree, I definitely underestimated video. Before publishing, I had a friend review my predictions and they called out video as being too low, and I adjusted upward in response and still underestimated it.
I’d now agree with 2026 or 2027 for coherent feature film length video, though I’m not sure if it would be at feature film artistic quality (including plot). I also agree with Her-like products in the next year or two!
Personally I would still expect cloud compute to still be used for robotics, but only in ways where latency doesn’t matter (like a planning and reasoning system on top of a smaller local model, doing deeper analysis like “There’s a bag on the floor by the door. Ordinarily it should be put away, but given that it wasn’t there 5 minutes ago, it might be actively used right now, so I should leave it...”). I’m not sure the privacy concerns will trump convenience, like with phones.
I also now think virtual agents will start to become a big thing in 2025 and 2026, doing some kinds of remote work, or sizable chucks of existing jobs autonomously (while still not being able to automate most jobs end to end)!
One year and 3 months on, I’m reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.
Rest of 2023
Small improvements to LLMs
Google releases something competitive to ChatGPT.
Mostly True | Google had already released Bard at the time, which sucked, but this was upgraded to Gemini and relaunched in December 2023. Gemini Ultra wasn’t released until February 2024 though, so points off for that.
Anthropic and OpenAI slightly improve GPT-4 and Claude2
True | GPT-4 Turbo and Claude 2.1 were both released in November 2023.
Meta or another group releases better open source models, up to around GPT-3.5 level.
False | Llama 2 had already been released at this time, and was nearly as good as GPT-3.5, but no other GPT-3.5-or-better open source models came out in 2023.
Small improvements to Image Generation
Dalle3 gets small improvements.
Debatable | This is a really lukewarm prediction. Small changes were made to Dalle3 in the rest of 2023, integrating with GPT-4 prompting, for example, though there were complaints they made it worse in an attempt to avoid copyright issues when it was integrated with Bing.
Google or Meta releases something similar to Dalle3, but not as good.
Mostly True | Google released Imagen 2 in December 2023, which was about as good as DALL-E 3. I don’t know how much I should penalise myself for it being about as good, rather than ‘not as good’.
Slight improvements to AI generated videos.
Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
True | Lots of people played around with making videos by stepping through frames made in DALL-3, and they mostly weren’t very good! Pika 1.0 came out in December 2023, but it also wasn’t that great.
Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
True | Figure AI is the most notable example of hooking up LLMs to robotics, and they did some experiments in late 2023 with GPT-4. As far as I know there wasn’t any commercial release of an LLM-enabled robot anywhere.
Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.
True | Mistral 7B was notable here, being smaller and more capable than some of the earlier, much larger models like BLOOM 176B (as far as I can tell).
Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right—of course models would get better! Let’s see how predictions further out did:
2024:
GPT-5 or equivalent is released.
It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
Mostly True | While they aren’t named GPT-5, the best released models today are as big an improvement over GPT-4 as GPT-4 was over GPT-3.5 as far as benchmarks can tell. Here’s a comparison table of GPT-3.5 and GPT-4, compared with the best released open weights model (DeepSeek V3), the best released close weights models (Claude Sonnet 3.5 (New) and o1),and the best known unreleased model (o3).
GPT-3.5 GPT-4 DeepSeek-V3 (Open Weights) Sonnet 3.5 (New) o1 o3 Context Length 16k 8k 128k 200k 128k / HumanEval 48.1% 67% / 93.7% // ARC-AGI <5% [1] <5% [1] / 20.3% 32% 88% SWE-bench Verified 0.4% [2] 2.8%
22.4% [3]
42.0% 49.0%
53.0% [4]
48.9% 71.7% Codeforces [5] 260
~1.5%392
4.0%
~1550
51.6%~1150
20.3%
1891
~91.0%
2727
~99.3%
GPQA Diamond / 33.0% 59.1% 58.0%
65.0% [6]78.0% 87.7% MATH / 52.9% 90.2% 78.3% 94.8% / MMLU 70.0% 86.4% 88.5% 88.3% 92.3% / DROP 64.9 80.9 91.6 87.1 // GSM8K 57.1% 92.0% / 96.4% // [1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”
[2] 0.4% with RAG, tested October 2023
[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.
[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1
[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.
[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).
https://x.com/OpenAI/status/1870186518230511844
https://openai.com/index/learning-to-reason-with-llms/
https://www.anthropic.com/news/3-5-models-and-computer-use
https://arxiv.org/pdf/2303.08774v5
https://arcprize.org/guide
--
Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
Debatable | It’s too vague to measure (“pretty much” and “wrong sometimes”—seriously, what was I thinking). It doesn’t feel like the models can do “any task” in a way that GPT-4 couldn’t, but at the same time “pretty much” every benchmark for LLMs has been saturated, and I ask Claude for help with nearly everything. Agentic tasks can’t be done, but that’s covered by other predictions, and this prediction is about being “guided by a person”.
Multimodal inputs, browsing, and agents based on it are all significantly better.
Mostly True | The agent structures as well as the models have improved significantly, as you can see by the same models doing much better on SWE-Bench under newer structures, and by newer models still beating older ones.
Agents can do basic tasks on computers—like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
Debatable | I could see this being graded either way, depending on specific metrics. Claude with Computer Use can do all of these things (sans robotics control) but isn’t really useful. The individual tasks are usefully done by a mix of Gemini, ChatGPT, and Figure’s (GPT-4o?) robot control, but they aren’t really agents.
Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
Mostly True | There are some production uses for Figure and Tesla’s robots, but these are more similar to traditional industrial robots doing a narrow task than to an agent.
Context windows are no longer an issue for text generation tasks.
Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
Mostly False | Context windows aren’t nearly as limiting as they were in October 2023, growing from ~8k to ~128k, and RAG and other techiques helping models intelligently search files and add them to their own context, but it’s definitely not solved. Long outputs like novels still suck, and long inputs like giant codebases or regulations still lead to models missing key details a lot of the time.
GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
Mostly False | Although it is close—Cursor has coding agents that can intelligently search the codebase for the files they need based on a provided task and add them to their own context, and ChatGPT has a memory feature (which doesn’t work super well). Neither of these is the same thing as just having the previous chats and codebase in context though.
This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
Mostly False | Agents are not yet useful, outside of some narrow coding agents.
Online learning begins—GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
False | As far as I’m aware, nothing like this is happening.
AI selection of what data to train on is used to improve datasets in general—training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
Mostly False | The trend has continued to move towards quality over quantity for training data, but I’m not aware of anybody specifically using existing LLMs to select / rank / weight training data automatically. I’m also now aware high quality data was already being repeated more often in the training sets. I don’t think anything is happening with a dynamic learning rate based on anything other than the loss.
Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
True | But also fairly trivial, it’s super well known that people are training models off the filtered outputs of earlier ones, and in general synethic data is working really well, especially for instruction tuning and for ground-truth’d domains like maths and coding.
Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
True | Cursor, a fork of Visual Studio, has pretty capable agents built in that use any model available via API that you like, and they work a lot better than manually pasting problems into ChatGPT did in October of 2023.
Open source models as capable of GPT-4 become available.
True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
* It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend. This is because of a combination of—datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.
True | Deepseek V3 stands out here—using only 37B active parameters (in a MoE architecture with 671B total), it achieves performance better than GPT-4’s, which is estimated to have more than 1700B. Deepseek V3 was also trained with only 2048 H800 GPUs for 2 months, compared with GPT-4’s estimated 15000 A100 GPUs for 3 months, several times higher.
You might be right—and whether the per-dollar gains were higher or lower than expected would be interesting to know—but I just don’t have any good information on this! If I’d thought of the possibility, I would have added it in Footnote 23 as another speculation, but I don’t think what I said is misleading or wrong.
For what it’s worth, in a one year review from Jacob Steinhardt, increased investment isn’t mentioned as an explanation for why the forecasts undershot.
Superintelligent AI is possible in the 2020s
10x per year for compute seems high to me. Naïvely I would expect the price/performance of compute to double every 1-2 years as it has been forever, with overall compute available for training big models being a function of that + increasing investment in the space, which could look more like one-time jumps. (I.e. a 10x jump in compute in 2024 may happen because of increased investment, but a 100x increase by 2025 seems unlikely.) But I am somewhat uncertain of this.
For parameters, I definitely think the largest models will keep getting bigger, and for compute to be the big driver of that—but also I would expect improvements like mixture of experts models to continue, which effectively allow more parameters with less compute (because not all of the parameters are used at all times). Other techniques, like RLHF, also improve the subjective performance of models without increasing their size (i.e. getting them to do useful things rather than only predict what next word is most likely).
I guess my prediction here would be simply that things like this continue, so that in 2025 if you have X compute, you could get a better model in 2025 than you could in 2023. But you also could have 5x to 50x more compute in 2025, so you have the sum of those improvements!
It’s obviously far cheaper to play with smaller models, so I expect lots of improvements will initially appear in models small-for-their-time.
Just my thoughts!
Regarding AGI race dynamics—I wonder if there’s an intuition pump for ‘time vs competitor’ preference?
For example, to me, based on my current knowledge, I think Anthropic reaching RSI before the next best company (Deepmind, maybe?) is worth about two years of time. (I.e. I estimate equal safety-relevant outcomes from Claude hitting RSI in 2027 as from Gemini hitting RSI in 2029).
That’s a super weird framework, and I just made up that two years number, but I think maybe helps me reason through preferences.
The neat thing about the framework is that it’s p(doom) agnostic. It’s about relative performance between AGI projects and expectations for how much safety work will reduce it in the near future, absolute numbers not needed.
It also lets you give clear, recordable, updatable beliefs. So, spitballing:
Anthropic—Leader
Deepmind -- +2 years for equal safety
OpenAI -- +2.5 years
SSI -- +2.5 years
Deepseek -- +3 years
Zai -- +4 years
Xai -- +5 years
Alibaba -- +5 years
Meta -- +6 years
Again I want to stress these are vibes, not a considered opinion. I expect to change my mind quickly once challenged with evidence my guesses are wrong.