ongoing improvements to RL environments are already priced into the existing trend
I agree with this especially for e.g. METR tasks, or proxies for how generally smart a model is.
A case for acceleration in enterprise revenue (rather than general smarts) could look like:
So far RL still has been pretty targeted towards coding, research/browsing, math, or being-ok-at-generic-tool-use (taking random MCP servers and making them into environments, like Kimi K2 did but with RL).
It takes SWE time to build custom interfaces for models to work with economically productive software like Excel, Salesforce, or Photoshop. We’re not there yet, at least with publicly released models. Once we are, this suddenly unlocks a massive amount of economic value.
Ultimately I don’t really buy this either, since we already have e.g. some Excel/Sheets integrations that are not great but better than what there was a couple months ago. And increase in breadth of RL environments is probably already factored into the trend somewhat.
ETA: this also matters less if you’re primarily tracking AI R&D capabilities (or it might but indirectly, through driving more investment etc.).
I’d frame the pace of RL environmental progress with a simple 2×2.
Is the task bounded (Codeforces, IMO-style problems) or unbounded (financial analysis using Excel, executive communication using slides, coding in unstructured codebases, design work using Photoshop etc).
Do we have in-house expertise (yes for coding and easy to source for IMO) or not (OpenAI is hiring finance pros this week to help build evals for Financial agents as I am writing this comment). The presence of expertise helps companies build RL environments that better reflect the actual problem space.
That gives a rough order of progress:
Bounded problem + know-how: o3 preview crushed Codeforces in Dec 2024.
Unbounded problem + know-how: the Codex product line.
Unbounded problem + limited know-how: ChatGPT agents still weak at spreadsheets & terrible at slides today, but I expect that to change in 6 to 12 months.
Not sure where Bounded problems with little know how (e.g Frontier Math) falls in this though…
Yeah, revenue trends generally seem less robust regardless. (It doesn’t look like there is a consistent longer running trend except maybe over the last 2 years. I’d also expect revenue to be less stable in general.)
I agree with this especially for e.g. METR tasks, or proxies for how generally smart a model is.
A case for acceleration in enterprise revenue (rather than general smarts) could look like:
So far RL still has been pretty targeted towards coding, research/browsing, math, or being-ok-at-generic-tool-use (taking random MCP servers and making them into environments, like Kimi K2 did but with RL).
It takes SWE time to build custom interfaces for models to work with economically productive software like Excel, Salesforce, or Photoshop. We’re not there yet, at least with publicly released models. Once we are, this suddenly unlocks a massive amount of economic value.
Ultimately I don’t really buy this either, since we already have e.g. some Excel/Sheets integrations that are not great but better than what there was a couple months ago. And increase in breadth of RL environments is probably already factored into the trend somewhat.
ETA: this also matters less if you’re primarily tracking AI R&D capabilities (or it might but indirectly, through driving more investment etc.).
I’d frame the pace of RL environmental progress with a simple 2×2.
Is the task bounded (Codeforces, IMO-style problems) or unbounded (financial analysis using Excel, executive communication using slides, coding in unstructured codebases, design work using Photoshop etc).
Do we have in-house expertise (yes for coding and easy to source for IMO) or not (OpenAI is hiring finance pros this week to help build evals for Financial agents as I am writing this comment). The presence of expertise helps companies build RL environments that better reflect the actual problem space.
That gives a rough order of progress:
Bounded problem + know-how: o3 preview crushed Codeforces in Dec 2024.
Unbounded problem + know-how: the Codex product line.
Unbounded problem + limited know-how: ChatGPT agents still weak at spreadsheets & terrible at slides today, but I expect that to change in 6 to 12 months.
Not sure where Bounded problems with little know how (e.g Frontier Math) falls in this though…
Yeah, revenue trends generally seem less robust regardless. (It doesn’t look like there is a consistent longer running trend except maybe over the last 2 years. I’d also expect revenue to be less stable in general.)