It’s not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.
that’s one path to RSI—where the improvement is happening to the (language) model itself.
the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn’t be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.
Such a self-improving codebase… would it be reasonable to call this an agent?
Sufficiently competent code rewriting isn’t implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won’t yet be sufficient capabilities to self-improve in some other way.
Altman and Amodei are implying there’s knowably more there in terms of some sort of scaling for test-time compute, but that could mean multiple different things: scaling RL training, scaling manual creation of tasks with verifiable outcomes (graders), scaling effective context length to enable longer reasoning traces. The o1 post and the R1 paper show graphs with lines that keep going up, but there is no discussion of how much compute even this much costs, what happens if we pour more compute into this without adding more tasks with verifiable outcomes, and how many tasks are already being used.
It’s not yet known if there is a way of turning R1-like training into RSI with any amount of compute. This is currently gated by quantity and quality of graders for outcomes of answering questions, which resist automated development.
that’s one path to RSI—where the improvement is happening to the (language) model itself.
the other kind—which feels more accessible to indie developers and less explored—is an LLM (eg R1) looping in a codebase, where each loop improves the codebase itself. The LLM wouldn’t be changing, but the codebase that calls it would be gaining new APIs/memory/capabilities as the LLM improves it.
Such a self-improving codebase… would it be reasonable to call this an agent?
Sufficiently competent code rewriting isn’t implied by R1/o3, and how much better future iterations of this technique get remains unclear, similarly to how it remains unclear how scaling pretraining using $150bn training systems cashes out in terms of capabilities. It remains possible that even after all these directions of scaling run their course, there won’t yet be sufficient capabilities to self-improve in some other way.
Altman and Amodei are implying there’s knowably more there in terms of some sort of scaling for test-time compute, but that could mean multiple different things: scaling RL training, scaling manual creation of tasks with verifiable outcomes (graders), scaling effective context length to enable longer reasoning traces. The o1 post and the R1 paper show graphs with lines that keep going up, but there is no discussion of how much compute even this much costs, what happens if we pour more compute into this without adding more tasks with verifiable outcomes, and how many tasks are already being used.