Grinding slimes in the dungeon of AI alignment research

This post aims to describe some differing views on challenges in AI alignment research (as I understand them) using an analogy to a video game with multiple levels.

I then draw some conclusions about the potential upsides and downsides of doing and publishing different kinds of alignment research.

RPGs, puzzle games

AI alignment might be like an RPG, or it might be more like a puzzle game.

In an RPG, clearing early-game dungeons (aligning early AI systems) builds up the player’s in-game skills, abilities, and stat points (builds up tools needed to align future AI systems). Even the best RPG gamer in the world would die pretty quickly if dropped into a late-game dungeon as an unlevelled newbie.

In a puzzle game, solving earlier puzzles unlocks harder puzzles, but there are no in-game abilities or stat points to build up. If the hardest puzzle were suddenly unlocked, a good puzzle gamer could solve it without any experience solving earlier, easier puzzles.

In both scenarios, you eventually get to a dungeon or puzzle where you must play on hardcore mode: you get one shot to solve the puzzle, or one life to run the dungeon, with no do-overs.

Furthermore, there are lots of players playing the game, and if anyone attempts a dungeon or puzzle in hardcore mode and makes a mistake, the game is over for everyone. (And if you die in the game, you die in real life.)

Analogy to strategies in alignment research

Alignment strategies that are focused on studying current AI systems, or that plan to rely on using future AI systems to help with alignment, are playing in RPG mode.

Right now, researchers that take this approach are grinding in the slime dungeons, hoping to level up enough that they’ll be ready to run the hardcore dungeons as soon as others (read: capabilities researchers) unlock them.

Alignment strategies focused on speculating about future AI systems, and attempting to solve the theoretical and foundational problems that those systems are hypothesized to have, are playing in the puzzle game mode.

In this model, the hardcore puzzles are the only ones that matter, even if they’re not yet unlocked and we can only speculate on what they look like. Of course, this strategy is less likely to result in any visible progress through the game—it hardly looks like playing the game at all! And different people will have difficulty on agreeing about what future puzzles will look like, let alone make progress on solving them. Still, if the first hardcore puzzle could be unlocked at any moment, this strategy is probably the best way to prepare.

What if you pick the wrong analogy?

What if AI alignment is more like a puzzle game, but researchers treat it as an RPG? Then we’ll end up unlocking a hardcore puzzle without any time to speculate on solutions in advance, potentially resulting in a fast game over.

What if AI alignment is more like an RPG, but researchers treat it as a puzzle? Then we’ll end up wasting our time, twiddling our thumbs speculating on future dungeons that we’ll never reach. Wasting time is not without costs, but it’s a more recoverable mistake.

(In reality, of course, there are researchers pursuing both strategies, and many others going full speed ahead on capabilities research, with no regard for either.)

Implications if we’re in the RPG world

Even if AI alignment is more like an RPG than a puzzle game, grinding slimes has a potential downside: it could result in unlocking harder dungeons, faster, for everyone.

I think this has implications for when researchers should publish alignment research, especially research focused on current AI systems.

Even if the research is purely alignment-focused and provides no capabilities gains, if it makes current AI systems safer, that unlocks harder dungeons sooner, because it makes near-future systems (look) safer, even if it doesn’t inspire any actual ideas for how to build them.

As a joke example, ChatGPT is apparently rolling out a new plugin system, that will enable it to do things like order groceries from Instacart. Suppose asking ChatGPT to order your groceries rudely resulted in the bot changing your order as revenge.

I think it would be better for safety-conscious researchers NOT to publicly help with this or publish any kind of research that could be useful for preventing or even understanding this behavior, at least until capabilities researchers figure out how to prevent it on their own. And even then, if you understand what went wrong, it might be better to not publish it, unless you’re sure it doesn’t contribute to making the next system any safer.

In the game analogy, there is a tradeoff: holding back on publishing means less effective grinding. But grinding slimes (or boars) up to end-boss levels is usually possible, even if it’s a lot slower and less fun. (The problem of others racing ahead of you remains.)

Implications if we’re in the puzzle world

Speculating publicly about solutions to locked and unknown puzzles also has a downside: it could give other players ideas about how to solve easier puzzles that we would rather they not solve so soon.

This risk seems somewhat lower than the risks posed in the RPG case—I don’t think there’s any capabilities researcher who decides GPT-n+1 is safe to build because they believe MIRI has worked out the kinks of logical decision theory or embedded agency. But it’s plausible that this sort of research is still a source of general inspiration for some kinds of capabilities.

Another downside of the puzzle strategy is that it may be difficult or impossible to make real progress on, or even know if you are making progress at all. (Unless you can play and unlock the harder puzzles in secret, by yourself.)

Conclusion

It seems at least plausible that publishing alignment research has risks, even if it doesn’t contribute to capabilities at all. I think this risk is most pronounced for research focused on understanding current systems, including research that makes them safer. I’m not sure that any of this has any implications in the world where capabilities researchers are racing to the brink regardless.

Personally, I think we’re more likely to be in the puzzle world, where future AI systems do not much help with alignment. But I didn’t discuss that question here, and the game analogy probably doesn’t have much to say about it.