My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”
I expect it’ll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it’ll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).
My best guess is this is because right now in training they never have to maintain code they wrote, I imagine there will be a period where they code becomes very clean whenever they are incentivized by having to work over their own code over longer time horizons, followed by ??? as they optimize for “whatever design patterns are optimal for a multi-agent system collaborating on some code”
I expect it’ll actually be solved a bit before that, because minimally-scaffolded LLMs can already give pretty good code review feedback that catches a lot of these issues, and so already-existing RLAIF techniques should work fine. The training pipelines would be finicky to set up but would not require any new technical advances, just schlep, so I predict it’ll happen as soon as writing good code becomes more of a competitive advantage than benchmaxxing (which seems to be happening already, SWE-bench-verified is rapidly saturating).