Logan Zoellner comments on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

Logan Zoellner 25 Nov 2023 22:23 UTC
2 points
0
In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.

“If we build AI in this particular way, it will be dangerous”

Okay, so maybe don’t do that then.
- Quintin Pope 25 Nov 2023 23:03 UTC
  6 points
  −2
  Parent
  I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.
  - porby 27 Nov 2023 18:22 UTC
    4 points
    0
    Parent
    In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.
    - gwern 27 Nov 2023 19:20 UTC
      8 points
      2
      Parent
      Yes, that would be immediately reward-hacked. It’s extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? “I’ll give you a billion dollars if you play chess.” “No, because I value not losing more than a billion dollars.” “I’m putting a gun to your head and will kill you if you don’t play!” “Oh, please do, thank you—after all, it’s impossible to lose a game of chess if I’m dead!” This is why RL agents have a nasty tendency to learn to ‘commit suicide’ if you reward-shape badly or the environment is too hard. (Tom7′s lexicographic agent famously learns to simply pause Tetris to avoid losing.)