yams comments on Interpretability Will Not Reliably Find Deceptive AI

yams 29 May 2025 16:09 UTC
3 points
0
Can you offer more explanation for: “the reason the queen’s odds things works…”
My guess is that this would be true if Stockfish were mostly an LLM or similar (making something like ‘the most common move’ each time), but it seems less likely for the actual architecture of Stockfish (which leans heavily on tree search and, later in the game, searches from a list of solved positions and implements their solutions). Perhaps this is what you meant by beginning your reply with ‘conceptually’, but I’m not sure.
[I do basically just think this particular example is a total disanalogy, and literally mean this as a question about Stockfish.]
- Neel Nanda 29 May 2025 18:15 UTC
  2 points
  0
  Parent
  Fair! I’m not actually very familiar with the setting or exactly how Stockfish works. I just assumed that Stockfish performs much less well in that setting than a system optimised for it.
  
  Though being a queen up is a major advantage, I would guess that’s not enough to beat a great chess AI ? But am not confident
- Aharon Azulay 30 May 2025 16:41 UTC
  1 point
  0
  Parent
  I agree that the analogy is not perfect. Can you elaborate on why you think this is a complete disanalogy?
  - yams 30 May 2025 18:19 UTC
    2 points
    1
    Parent
    There are a bunch of weird sub points and side things here, but I think the big one is that narrow intelligence is not some bounded ‘slice’ of general intelligence. It’s a different kind of thing entirely. I wouldn’t model interactions with a narrow intelligence in a bounded environment as at all representative of superintelligence (except as a lower bound on the capabilities one should expect of superintellignece!). A superintelligence also isn’t an ensemble of individual narrow AIs (it may be an ensemble of fairly general systems a la MoE, but it won’t be “stockfish for cooking plus stockfish for navigation plus stockfish for....”, because that would leave a lot out).
    Stockfish cannot, for instance, change the rules of the game or edit the board state in a text file or grow arms, punch you out, and take your wallet. A superintelligence given the narrow task of beating you at queen’s odds chess could simply cheat in ways you wouldn’t expect (esp. if we put the superintelligence into a literally impossible situation that conflicts with some goal it has outside the terms of the defined game).
    We lack precisely a robust mechanism of reliably bounding the output space of such an entity (an alignment solution!). Something like mech interp just isn’t really targeted at getting that thing, either; it’s foundational research with some (hopefully a lot of) safety-relevant implementations and insights.
    I think this is the kind of thing Neel is pointing at with “how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.” You don’t know how stockfish is going to beat you in a fair game; you just know (or we can say ‘strong prior’ but like… 99%+, right?) that it will. And that ‘fair game’ is the meta game, the objective in the broader world, not capturing the king.
    
    (I give myself like a ⁴⁄₁₀ for this explanation; feel even more affordance than usual to ask for clarification.)