Aharon Azulay comments on Interpretability Will Not Reliably Find Deceptive AI

Aharon Azulay 30 May 2025 16:41 UTC
1 point
0
I agree that the analogy is not perfect. Can you elaborate on why you think this is a complete disanalogy?
- yams 30 May 2025 18:19 UTC
  2 points
  1
  Parent
  There are a bunch of weird sub points and side things here, but I think the big one is that narrow intelligence is not some bounded ‘slice’ of general intelligence. It’s a different kind of thing entirely. I wouldn’t model interactions with a narrow intelligence in a bounded environment as at all representative of superintelligence (except as a lower bound on the capabilities one should expect of superintellignece!). A superintelligence also isn’t an ensemble of individual narrow AIs (it may be an ensemble of fairly general systems a la MoE, but it won’t be “stockfish for cooking plus stockfish for navigation plus stockfish for....”, because that would leave a lot out).
  Stockfish cannot, for instance, change the rules of the game or edit the board state in a text file or grow arms, punch you out, and take your wallet. A superintelligence given the narrow task of beating you at queen’s odds chess could simply cheat in ways you wouldn’t expect (esp. if we put the superintelligence into a literally impossible situation that conflicts with some goal it has outside the terms of the defined game).
  We lack precisely a robust mechanism of reliably bounding the output space of such an entity (an alignment solution!). Something like mech interp just isn’t really targeted at getting that thing, either; it’s foundational research with some (hopefully a lot of) safety-relevant implementations and insights.
  I think this is the kind of thing Neel is pointing at with “how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.” You don’t know how stockfish is going to beat you in a fair game; you just know (or we can say ‘strong prior’ but like… 99%+, right?) that it will. And that ‘fair game’ is the meta game, the objective in the broader world, not capturing the king.
  
  (I give myself like a ⁴⁄₁₀ for this explanation; feel even more affordance than usual to ask for clarification.)