Neel Nanda comments on Interpretability Will Not Reliably Find Deceptive AI

Neel Nanda 28 May 2025 21:34 UTC
8 points
5
From a conceptual perspective, I would argue that the reason the queen’s odds thing works is that stockfish was trained in the world of normal chess and does not generalise well to the world of weird chess. The super intelligence was trained in the real world which contains things like interpretability and black box safeguards. It may not have been directly trained to interact with them, but It’ll be aware of them and it will be capable of reasoning about dealing with a novel obstacles. This is an addition to the various ways the techniques could break without this being directly intended by the model
- yams 29 May 2025 16:09 UTC
  3 points
  0
  Parent
  Can you offer more explanation for: “the reason the queen’s odds things works…”
  My guess is that this would be true if Stockfish were mostly an LLM or similar (making something like ‘the most common move’ each time), but it seems less likely for the actual architecture of Stockfish (which leans heavily on tree search and, later in the game, searches from a list of solved positions and implements their solutions). Perhaps this is what you meant by beginning your reply with ‘conceptually’, but I’m not sure.
  [I do basically just think this particular example is a total disanalogy, and literally mean this as a question about Stockfish.]
  - Neel Nanda 29 May 2025 18:15 UTC
    2 points
    0
    Parent
    Fair! I’m not actually very familiar with the setting or exactly how Stockfish works. I just assumed that Stockfish performs much less well in that setting than a system optimised for it.
    
    Though being a queen up is a major advantage, I would guess that’s not enough to beat a great chess AI ? But am not confident
  - Aharon Azulay 30 May 2025 16:41 UTC
    1 point
    0
    Parent
    I agree that the analogy is not perfect. Can you elaborate on why you think this is a complete disanalogy?
    - yams 30 May 2025 18:19 UTC
      2 points
      1
      Parent
      There are a bunch of weird sub points and side things here, but I think the big one is that narrow intelligence is not some bounded ‘slice’ of general intelligence. It’s a different kind of thing entirely. I wouldn’t model interactions with a narrow intelligence in a bounded environment as at all representative of superintelligence (except as a lower bound on the capabilities one should expect of superintellignece!). A superintelligence also isn’t an ensemble of individual narrow AIs (it may be an ensemble of fairly general systems a la MoE, but it won’t be “stockfish for cooking plus stockfish for navigation plus stockfish for....”, because that would leave a lot out).
      Stockfish cannot, for instance, change the rules of the game or edit the board state in a text file or grow arms, punch you out, and take your wallet. A superintelligence given the narrow task of beating you at queen’s odds chess could simply cheat in ways you wouldn’t expect (esp. if we put the superintelligence into a literally impossible situation that conflicts with some goal it has outside the terms of the defined game).
      We lack precisely a robust mechanism of reliably bounding the output space of such an entity (an alignment solution!). Something like mech interp just isn’t really targeted at getting that thing, either; it’s foundational research with some (hopefully a lot of) safety-relevant implementations and insights.
      I think this is the kind of thing Neel is pointing at with “how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.” You don’t know how stockfish is going to beat you in a fair game; you just know (or we can say ‘strong prior’ but like… 99%+, right?) that it will. And that ‘fair game’ is the meta game, the objective in the broader world, not capturing the king.
      
      (I give myself like a ⁴⁄₁₀ for this explanation; feel even more affordance than usual to ask for clarification.)
- Aharon Azulay 29 May 2025 10:48 UTC
  1 point
  0
  Parent
  I agree that defining what game we are playing is important. However, I’m not sure about the claim that Stockfish would win if it were trained to win at queen odds or other unusual chess variants. There are many unbalanced games we can invent where one side has a great advantage over the other. Actually, there is a version of Leela that was specifically trained with knight odds. It is an interesting system because it creates balanced games against human grandmasters. However, I would guess that even if you invested trillions of dollars to train a chess engine on queen odds, humans would still be able to reliably beat it.
  I’m not sure where the analogy breaks down, but I do think we should focus more on the nature of the game that we play with superintelligence in the real world. The game could change, for example, when a superintelligence escapes the labs and all their mitigation tools, making the game more balanced for itself (for example, by running itself in a distributed manner on millions of personal computers it has hacked).
  As long as we make sure that the game stays unbalanced, I do think we have a chance to mitigate the risks.
  - faul_sname 29 May 2025 18:51 UTC
    2 points
    0
    Parent
    There’s a LeelaQueenOdds too, which they say performs at 2000-2700 Elo depending on time controls.