I still don’t think that a bunch of free-associating inner monologues talking to each other gives you AGI, and it still seems to be an open question whether adding RL on top just works.
The “hallucinations” of the latest reasoning models look more like capability failures than alignment failures to me, and I think this points towards “no.” But my credences are very unstable; if METR task length projections hold up or the next reasoning model easily zero-shots Pokemon I will just about convert.
Investigating preliminary evaluations of o3 and o4-mini I am more convinced that task length is scaling as projected.
Pokémon has fallen, but as far as I can tell this relied on scaffolding improvements for Gemini 2.5 pro customized during the run, NOT a new smarter model.
Overall, I am already questioning my position one week later.
Pokémon is actually load-bearing for your models? I’m imagining a counterfactual world in which Sonnet 3.7′s initial report involved it beating Pokémon Red, and I don’t think my present-day position would’ve been any different in it.
Even aside from tons of walkthrough information present in LLMs’ training set, and iterative prompting allowing to identify and patch holes in LLMs’ pretrained instinctive game knowledge, Pokémon is simply not a good test of open-ended agency. At the macro-scale, the game state can only progress forward, and progressing it requires solving relatively closed-form combat/navigational challenges. Which means if you’re not too unlikely to blunder through each of those isolated challenges, you’re fated to “fail upwards”. The game-state topology doesn’t allow you to progress backward or get stuck in a dead end: you can’t lose a badge or un-win a boss battle. I. e.: there’s basically an implicit “long-horizon agency scaffold” built into the game.
Which means what this tests is mainly the ability to solve somewhat-diverse isolated challenges in sequence. But not the ability to autonomously decompose long-term tasks into said isolated challenges in a way such that the sequence of isolated challenges implacably points at the long-term task’s accomplishment.
I think the hallucinations/reward hacking is actually a real alignment failure, but an alignment failure that happens to degrade capabilities a lot, though at least some of the misbehavior is probably due to context, but I have seen evidence that the alignment failures are more deliberate than regular capabilities failures.
That said, if this keeps happening, the likely answer is because capabilities progress is to a significant degree bottlenecked on alignment progress, such that you need a significant degree of progress on preventing specification gaming to get new capabilities, and this would definitely be a good world for misalignment issues if the hypothesis is true (which I put some weight on)
(Also, it’s telling that the areas where RL has worked best are areas where you can basically create unhackable reward models like many games/puzzles, and once reward hacking is on the table, capabilities start to decrease).
I still don’t think that a bunch of free-associating inner monologues talking to each other gives you AGI, and it still seems to be an open question whether adding RL on top just works.
The “hallucinations” of the latest reasoning models look more like capability failures than alignment failures to me, and I think this points towards “no.” But my credences are very unstable; if METR task length projections hold up or the next reasoning model easily zero-shots Pokemon I will just about convert.
Investigating preliminary evaluations of o3 and o4-mini I am more convinced that task length is scaling as projected.
Pokémon has fallen, but as far as I can tell this relied on scaffolding improvements for Gemini 2.5 pro customized during the run, NOT a new smarter model.
Overall, I am already questioning my position one week later.
Pokémon is actually load-bearing for your models? I’m imagining a counterfactual world in which Sonnet 3.7′s initial report involved it beating Pokémon Red, and I don’t think my present-day position would’ve been any different in it.
Even aside from tons of walkthrough information present in LLMs’ training set, and iterative prompting allowing to identify and patch holes in LLMs’ pretrained instinctive game knowledge, Pokémon is simply not a good test of open-ended agency. At the macro-scale, the game state can only progress forward, and progressing it requires solving relatively closed-form combat/navigational challenges. Which means if you’re not too unlikely to blunder through each of those isolated challenges, you’re fated to “fail upwards”. The game-state topology doesn’t allow you to progress backward or get stuck in a dead end: you can’t lose a badge or un-win a boss battle. I. e.: there’s basically an implicit “long-horizon agency scaffold” built into the game.
Which means what this tests is mainly the ability to solve somewhat-diverse isolated challenges in sequence. But not the ability to autonomously decompose long-term tasks into said isolated challenges in a way such that the sequence of isolated challenges implacably points at the long-term task’s accomplishment.
Hmm, maybe I’m suffering from having never played Pokémon… who would’ve thought that could be an important hole in my education?
I think the hallucinations/reward hacking is actually a real alignment failure, but an alignment failure that happens to degrade capabilities a lot, though at least some of the misbehavior is probably due to context, but I have seen evidence that the alignment failures are more deliberate than regular capabilities failures.
That said, if this keeps happening, the likely answer is because capabilities progress is to a significant degree bottlenecked on alignment progress, such that you need a significant degree of progress on preventing specification gaming to get new capabilities, and this would definitely be a good world for misalignment issues if the hypothesis is true (which I put some weight on)
(Also, it’s telling that the areas where RL has worked best are areas where you can basically create unhackable reward models like many games/puzzles, and once reward hacking is on the table, capabilities start to decrease).