In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.
“If we build AI in this particular way, it will be dangerous”
I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.
In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.
Yes, that would be immediately reward-hacked. It’s extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? “I’ll give you a billion dollars if you play chess.” “No, because I value not losing more than a billion dollars.” “I’m putting a gun to your head and will kill you if you don’t play!” “Oh, please do, thank you—after all, it’s impossible to lose a game of chess if I’m dead!” This is why RL agents have a nasty tendency to learn to ‘commit suicide’ if you reward-shape badly or the environment is too hard. (Tom7′s lexicographic agent famously learns to simply pause Tetris to avoid losing.)
“If we build AI in this particular way, it will be dangerous”
Okay, so maybe don’t do that then.
I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.
In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.
Yes, that would be immediately reward-hacked. It’s extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? “I’ll give you a billion dollars if you play chess.” “No, because I value not losing more than a billion dollars.” “I’m putting a gun to your head and will kill you if you don’t play!” “Oh, please do, thank you—after all, it’s impossible to lose a game of chess if I’m dead!” This is why RL agents have a nasty tendency to learn to ‘commit suicide’ if you reward-shape badly or the environment is too hard. (Tom7′s lexicographic agent famously learns to simply pause Tetris to avoid losing.)