One year and 3 months on, I’m reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.
Rest of 2023
Small improvements to LLMs
Google releases something competitive to ChatGPT.
Mostly True | Google had already released Bard at the time, which sucked, but this was upgraded to Gemini and relaunched in December 2023. Gemini Ultra wasn’t released until February 2024 though, so points off for that.
Anthropic and OpenAI slightly improve GPT-4 and Claude2
True | GPT-4 Turbo and Claude 2.1 were both released in November 2023.
Meta or another group releases better open source models, up to around GPT-3.5 level.
False | Llama 2 had already been released at this time, and was nearly as good as GPT-3.5, but no other GPT-3.5-or-better open source models came out in 2023.
Small improvements to Image Generation
Dalle3 gets small improvements.
Debatable | This is a really lukewarm prediction. Small changes were made to Dalle3 in the rest of 2023, integrating with GPT-4 prompting, for example, though there were complaints they made it worse in an attempt to avoid copyright issues when it was integrated with Bing.
Google or Meta releases something similar to Dalle3, but not as good.
Mostly True | Google released Imagen 2 in December 2023, which was about as good as DALL-E 3. I don’t know how much I should penalise myself for it being about as good, rather than ‘not as good’.
Slight improvements to AI generated videos.
Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
True | Lots of people played around with making videos by stepping through frames made in DALL-3, and they mostly weren’t very good! Pika 1.0 came out in December 2023, but it also wasn’t that great.
Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
True | Figure AI is the most notable example of hooking up LLMs to robotics, and they did some experiments in late 2023 with GPT-4. As far as I know there wasn’t any commercial release of an LLM-enabled robot anywhere.
Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.
True | Mistral 7B was notable here, being smaller and more capable than some of the earlier, much larger models like BLOOM 176B (as far as I can tell).
Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right—of course models would get better! Let’s see how predictions further out did:
2024:
GPT-5 or equivalent is released.
It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
Mostly True | While they aren’t named GPT-5, the best released models today are as big an improvement over GPT-4 as GPT-4 was over GPT-3.5 as far as benchmarks can tell. Here’s a comparison table of GPT-3.5 and GPT-4, compared with the best released open weights model (DeepSeek V3), the best released close weights models (Claude Sonnet 3.5 (New) and o1),and the best known unreleased model (o3).
GPT-3.5
GPT-4
DeepSeek-V3 (Open Weights)
Sonnet 3.5 (New)
o1
o3
Context Length
16k
8k
128k
200k
128k
/
HumanEval
48.1%
67%
/
93.7%
//
ARC-AGI
<5% [1]
<5% [1]
/
20.3%
32%
88%
SWE-bench Verified
0.4% [2]
2.8%
22.4% [3]
42.0%
49.0%
53.0% [4]
48.9%
71.7%
Codeforces [5]
260 ~1.5%
392
4.0%
~1550 51.6%
~1150
20.3%
1891
~91.0%
2727
~99.3%
GPQA Diamond
/
33.0%
59.1%
58.0% 65.0% [6]
78.0%
87.7%
MATH
/
52.9%
90.2%
78.3%
94.8%
/
MMLU
70.0%
86.4%
88.5%
88.3%
92.3%
/
DROP
64.9
80.9
91.6
87.1
//
GSM8K
57.1%
92.0%
/
96.4%
//
[1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”
[2] 0.4% with RAG, tested October 2023
[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.
[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1
[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.
[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).
Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
Debatable | It’s too vague to measure (“pretty much” and “wrong sometimes”—seriously, what was I thinking). It doesn’t feel like the models can do “any task” in a way that GPT-4 couldn’t, but at the same time “pretty much” every benchmark for LLMs has been saturated, and I ask Claude for help with nearly everything. Agentic tasks can’t be done, but that’s covered by other predictions, and this prediction is about being “guided by a person”.
Multimodal inputs, browsing, and agents based on it are all significantly better.
Mostly True | The agent structures as well as the models have improved significantly, as you can see by the same models doing much better on SWE-Bench under newer structures, and by newer models still beating older ones.
Agents can do basic tasks on computers—like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
Debatable | I could see this being graded either way, depending on specific metrics. Claude with Computer Use can do all of these things (sans robotics control) but isn’t really useful. The individual tasks are usefully done by a mix of Gemini, ChatGPT, and Figure’s (GPT-4o?) robot control, but they aren’t really agents.
Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
Mostly True | There are some production uses for Figure and Tesla’s robots, but these are more similar to traditional industrial robots doing a narrow task than to an agent.
Context windows are no longer an issue for text generation tasks.
Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
Mostly False | Context windows aren’t nearly as limiting as they were in October 2023, growing from ~8k to ~128k, and RAG and other techiques helping models intelligently search files and add them to their own context, but it’s definitely not solved. Long outputs like novels still suck, and long inputs like giant codebases or regulations still lead to models missing key details a lot of the time.
GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
Mostly False | Although it is close—Cursor has coding agents that can intelligently search the codebase for the files they need based on a provided task and add them to their own context, and ChatGPT has a memory feature (which doesn’t work super well). Neither of these is the same thing as just having the previous chats and codebase in context though.
This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
Mostly False | Agents are not yet useful, outside of some narrow coding agents.
Online learning begins—GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
False | As far as I’m aware, nothing like this is happening.
AI selection of what data to train on is used to improve datasets in general—training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
Mostly False | The trend has continued to move towards quality over quantity for training data, but I’m not aware of anybody specifically using existing LLMs to select / rank / weight training data automatically. I’m also now aware high quality data was already being repeated more often in the training sets. I don’t think anything is happening with a dynamic learning rate based on anything other than the loss.
Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
True | But also fairly trivial, it’s super well known that people are training models off the filtered outputs of earlier ones, and in general synethic data is working really well, especially for instruction tuning and for ground-truth’d domains like maths and coding.
Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
True | Cursor, a fork of Visual Studio, has pretty capable agents built in that use any model available via API that you like, and they work a lot better than manually pasting problems into ChatGPT did in October of 2023.
Open source models as capable of GPT-4 become available.
True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
* It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend. This is because of a combination of—datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.
True | Deepseek V3 stands out here—using only 37B active parameters (in a MoE architecture with 671B total), it achieves performance better than GPT-4’s, which is estimated to have more than 1700B. Deepseek V3 was also trained with only 2048 H800 GPUs for 2 months, compared with GPT-4’s estimated 15000 A100 GPUs for 3 months, several times higher.
Gemini 1206 Exp has a 2 million token context window, even if that isn’t the effective context it probably performs much better in that regard than gpt 4o and such. Haven’t tested yet because I don’t want to get ratelimited from ai studio incase they monitor that
Frankly the “shorter” conversations I had at a few tens of thousand of tokens were already noticeably more consistent than before, e. g. it referenced previous responses significantly later
One year and 3 months on, I’m reviewing my predictions! Overall, I mark 13 predictions as true or mostly true, 6 as false or mostly false, and 3 as debatable.
Rest of 2023
Small improvements to LLMs
Google releases something competitive to ChatGPT.
Mostly True | Google had already released Bard at the time, which sucked, but this was upgraded to Gemini and relaunched in December 2023. Gemini Ultra wasn’t released until February 2024 though, so points off for that.
Anthropic and OpenAI slightly improve GPT-4 and Claude2
True | GPT-4 Turbo and Claude 2.1 were both released in November 2023.
Meta or another group releases better open source models, up to around GPT-3.5 level.
False | Llama 2 had already been released at this time, and was nearly as good as GPT-3.5, but no other GPT-3.5-or-better open source models came out in 2023.
Small improvements to Image Generation
Dalle3 gets small improvements.
Debatable | This is a really lukewarm prediction. Small changes were made to Dalle3 in the rest of 2023, integrating with GPT-4 prompting, for example, though there were complaints they made it worse in an attempt to avoid copyright issues when it was integrated with Bing.
Google or Meta releases something similar to Dalle3, but not as good.
Mostly True | Google released Imagen 2 in December 2023, which was about as good as DALL-E 3. I don’t know how much I should penalise myself for it being about as good, rather than ‘not as good’.
Slight improvements to AI generated videos.
Basic hooking up of Dalle3 to video generation with tagged on software, not really good consumer stuff yet. Works in an interesting way, like Dalle1, but not useful for much yet.
True | Lots of people played around with making videos by stepping through frames made in DALL-3, and they mostly weren’t very good! Pika 1.0 came out in December 2023, but it also wasn’t that great.
Further experiments hooking LLMs up to robotics/cars, but nothing commercial released.
True | Figure AI is the most notable example of hooking up LLMs to robotics, and they did some experiments in late 2023 with GPT-4. As far as I know there wasn’t any commercial release of an LLM-enabled robot anywhere.
Small improvements in training efficiency and data usage, particularly obviously in smaller models becoming more capable than older, larger ones.
True | Mistral 7B was notable here, being smaller and more capable than some of the earlier, much larger models like BLOOM 176B (as far as I can tell).
Since those ‘Rest of 2023’ predictions were only for three months in the future, most of them were very trivial to get right—of course models would get better! Let’s see how predictions further out did:
2024:
GPT-5 or equivalent is released.
It’s as big a jump on GPT-4 as GPT-4 was on GPT-3.5.
Mostly True | While they aren’t named GPT-5, the best released models today are as big an improvement over GPT-4 as GPT-4 was over GPT-3.5 as far as benchmarks can tell. Here’s a comparison table of GPT-3.5 and GPT-4, compared with the best released open weights model (DeepSeek V3), the best released close weights models (Claude Sonnet 3.5 (New) and o1),and the best known unreleased model (o3).
2.8%
22.4% [3]
49.0%
53.0% [4]
~1.5%
392
4.0%
51.6%
~1150
20.3%
1891
~91.0%
2727
~99.3%
65.0% [6]
[1] From ARC Prize “In this method, contestants use a traditional LLM (like GPT-4) and rely on prompting techniques to solve ARC-AGI tasks. This was found to perform poorly, scoring <5%.”
[2] 0.4% with RAG, tested October 2023
[3] 2.8% with RAG, 22.4% with ‘SWE-agent’ structure, tested April 2024.
[4] 49.0% in launch paper, 53.0% on SWE-Verified’s leaderboard with OpenHands + CodeAct v2.1
[5] Sometimes scores were giving as a rating, and sometimes as a percentile. They have been converted to match.
[6] 58% published on Epoch AI, 65% claimed in release paper. Likely different assessment (CoT, best of N, etc).
https://x.com/OpenAI/status/1870186518230511844
https://openai.com/index/learning-to-reason-with-llms/
https://www.anthropic.com/news/3-5-models-and-computer-use
https://arxiv.org/pdf/2303.08774v5
https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd.it%2Fw8kkutnp6mwd1.png%3Fwidth%3D580%26format%3Dpng%26auto%3Dwebp%26s%3Dfea8540a697f1d6f27e9a32f31eda4378fde611e
https://arcprize.org/guide
https://www.deepseek.com/
https://www.researchgate.net/figure/Averaged-performance-on-the-tasks-from-the-Big-Bench-Hard-benchmark-Here-AO-CoT-and-ZS_tbl1_371163052
--
Can do pretty much any task when guided by a person, but still gets things wrong sometimes.
Debatable | It’s too vague to measure (“pretty much” and “wrong sometimes”—seriously, what was I thinking). It doesn’t feel like the models can do “any task” in a way that GPT-4 couldn’t, but at the same time “pretty much” every benchmark for LLMs has been saturated, and I ask Claude for help with nearly everything. Agentic tasks can’t be done, but that’s covered by other predictions, and this prediction is about being “guided by a person”.
Multimodal inputs, browsing, and agents based on it are all significantly better.
Mostly True | The agent structures as well as the models have improved significantly, as you can see by the same models doing much better on SWE-Bench under newer structures, and by newer models still beating older ones.
Agents can do basic tasks on computers—like filling in forms, working in excel, pulling up information on the web, and basic robotics control. This reaches the point where it is actually useful for some of these things.
Debatable | I could see this being graded either way, depending on specific metrics. Claude with Computer Use can do all of these things (sans robotics control) but isn’t really useful. The individual tasks are usefully done by a mix of Gemini, ChatGPT, and Figure’s (GPT-4o?) robot control, but they aren’t really agents.
Robotics and long-horizon agents still don’t work well enough for production. Things fall apart if the agent has to do something with too many branching possibilities or on time horizons beyond half an hour or so. This time period / complexity quickly improves as low-hanging workarounds are added.
Mostly True | There are some production uses for Figure and Tesla’s robots, but these are more similar to traditional industrial robots doing a narrow task than to an agent.
Context windows are no longer an issue for text generation tasks.
Algorithmic improvements, or summarisation and workarounds, better attention on infinite context windows, or something like that solves the problem pretty much completely from a user’s perspective for the best models.
Mostly False | Context windows aren’t nearly as limiting as they were in October 2023, growing from ~8k to ~128k, and RAG and other techiques helping models intelligently search files and add them to their own context, but it’s definitely not solved. Long outputs like novels still suck, and long inputs like giant codebases or regulations still lead to models missing key details a lot of the time.
GPT-5 has the context of all previous chats, Copilot has the entire codebase as context, etc.
Mostly False | Although it is close—Cursor has coding agents that can intelligently search the codebase for the files they need based on a provided task and add them to their own context, and ChatGPT has a memory feature (which doesn’t work super well). Neither of these is the same thing as just having the previous chats and codebase in context though.
This is later applied to agent usage, and agents quickly improve to become useful, in the same way that LLMs weren’t useful for everyday work until ChatGPT.
Mostly False | Agents are not yet useful, outside of some narrow coding agents.
Online learning begins—GPT-5 or equivalent improves itself slowly, autonomously, but not noticeably faster than current models are improved with human effort and a training step. It does something like select its own data to train on from all of the inputs and outputs it has received, and is trained on this data autonomously and regularly (daily or more often).
False | As far as I’m aware, nothing like this is happening.
AI selection of what data to train on is used to improve datasets in general—training for one epoch on all data becomes less common, as some high quality or relevant parts of giant sets are repeated more often or allowed larger step size.
Mostly False | The trend has continued to move towards quality over quantity for training data, but I’m not aware of anybody specifically using existing LLMs to select / rank / weight training data automatically. I’m also now aware high quality data was already being repeated more often in the training sets. I don’t think anything is happening with a dynamic learning rate based on anything other than the loss.
Autonomous generation of data is used more extensively, especially for aligning base models, or for training models smaller than the best ones (by using data generated by larger models).
True | But also fairly trivial, it’s super well known that people are training models off the filtered outputs of earlier ones, and in general synethic data is working really well, especially for instruction tuning and for ground-truth’d domains like maths and coding.
Code writing is much better, and tie-ins to Visual Studio are better than GPT-4 is today, as well as having much better context.
True | Cursor, a fork of Visual Studio, has pretty capable agents built in that use any model available via API that you like, and they work a lot better than manually pasting problems into ChatGPT did in October of 2023.
Open source models as capable of GPT-4 become available.
True | Deepseek V3 is open weights* and has performance exceeding GPT-4 on most benchmarks. As is Mistral Large 2, and Llama 3.1 405B.
* It’s not entirely open source, as in, the code and data needed to train a copy is not available. But that’s not how ‘open source’ is being used regarding model weights, although I am personally trying to use clearer language now.
Training and runtime efficiency improves by at least a factor of two, while hardware continues improvements on trend. This is because of a combination of—datasets improved by AI curation and generation, improved model architecture, and improvements in hyperparameter selection, including work similar to the optimisations gained from discovering Chinchilla scaling laws.
True | Deepseek V3 stands out here—using only 37B active parameters (in a MoE architecture with 671B total), it achieves performance better than GPT-4’s, which is estimated to have more than 1700B. Deepseek V3 was also trained with only 2048 H800 GPUs for 2 months, compared with GPT-4’s estimated 15000 A100 GPUs for 3 months, several times higher.
Gemini 1206 Exp has a 2 million token context window, even if that isn’t the effective context it probably performs much better in that regard than gpt 4o and such. Haven’t tested yet because I don’t want to get ratelimited from ai studio incase they monitor that
Frankly the “shorter” conversations I had at a few tens of thousand of tokens were already noticeably more consistent than before, e. g. it referenced previous responses significantly later