Aharon Azulay comments on Interpretability Will Not Reliably Find Deceptive AI

Aharon Azulay 28 May 2025 15:23 UTC
3 points
0
However, I am pretty pessimistic in general about reliable safeguards against superintelligence with any methods, given how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.
To use an imperfect analogy, I could defeat the narrowly superintelligent Stockfish at ‘queen odds chess’ where Stockfish starts the game down a queen.

Can’t we think of interpretability and black-box safeguards as the extra pieces we can use to reliably win against rogue superintelligence?
- Neel Nanda 28 May 2025 21:34 UTC
  8 points
  5
  Parent
  From a conceptual perspective, I would argue that the reason the queen’s odds thing works is that stockfish was trained in the world of normal chess and does not generalise well to the world of weird chess. The super intelligence was trained in the real world which contains things like interpretability and black box safeguards. It may not have been directly trained to interact with them, but It’ll be aware of them and it will be capable of reasoning about dealing with a novel obstacles. This is an addition to the various ways the techniques could break without this being directly intended by the model
  - yams 29 May 2025 16:09 UTC
    3 points
    0
    Parent
    Can you offer more explanation for: “the reason the queen’s odds things works…”
    My guess is that this would be true if Stockfish were mostly an LLM or similar (making something like ‘the most common move’ each time), but it seems less likely for the actual architecture of Stockfish (which leans heavily on tree search and, later in the game, searches from a list of solved positions and implements their solutions). Perhaps this is what you meant by beginning your reply with ‘conceptually’, but I’m not sure.
    [I do basically just think this particular example is a total disanalogy, and literally mean this as a question about Stockfish.]
    - Neel Nanda 29 May 2025 18:15 UTC
      2 points
      0
      Parent
      Fair! I’m not actually very familiar with the setting or exactly how Stockfish works. I just assumed that Stockfish performs much less well in that setting than a system optimised for it.
      
      Though being a queen up is a major advantage, I would guess that’s not enough to beat a great chess AI ? But am not confident
    - Aharon Azulay 30 May 2025 16:41 UTC
      1 point
      0
      Parent
      I agree that the analogy is not perfect. Can you elaborate on why you think this is a complete disanalogy?
      - yams 30 May 2025 18:19 UTC
        2 points
        1
        Parent
        There are a bunch of weird sub points and side things here, but I think the big one is that narrow intelligence is not some bounded ‘slice’ of general intelligence. It’s a different kind of thing entirely. I wouldn’t model interactions with a narrow intelligence in a bounded environment as at all representative of superintelligence (except as a lower bound on the capabilities one should expect of superintellignece!). A superintelligence also isn’t an ensemble of individual narrow AIs (it may be an ensemble of fairly general systems a la MoE, but it won’t be “stockfish for cooking plus stockfish for navigation plus stockfish for....”, because that would leave a lot out).
        Stockfish cannot, for instance, change the rules of the game or edit the board state in a text file or grow arms, punch you out, and take your wallet. A superintelligence given the narrow task of beating you at queen’s odds chess could simply cheat in ways you wouldn’t expect (esp. if we put the superintelligence into a literally impossible situation that conflicts with some goal it has outside the terms of the defined game).
        We lack precisely a robust mechanism of reliably bounding the output space of such an entity (an alignment solution!). Something like mech interp just isn’t really targeted at getting that thing, either; it’s foundational research with some (hopefully a lot of) safety-relevant implementations and insights.
        I think this is the kind of thing Neel is pointing at with “how exceptionally hard it is to reason about how a system far smarter than me could evade my plans.” You don’t know how stockfish is going to beat you in a fair game; you just know (or we can say ‘strong prior’ but like… 99%+, right?) that it will. And that ‘fair game’ is the meta game, the objective in the broader world, not capturing the king.
        
        (I give myself like a ⁴⁄₁₀ for this explanation; feel even more affordance than usual to ask for clarification.)
  - Aharon Azulay 29 May 2025 10:48 UTC
    1 point
    0
    Parent
    I agree that defining what game we are playing is important. However, I’m not sure about the claim that Stockfish would win if it were trained to win at queen odds or other unusual chess variants. There are many unbalanced games we can invent where one side has a great advantage over the other. Actually, there is a version of Leela that was specifically trained with knight odds. It is an interesting system because it creates balanced games against human grandmasters. However, I would guess that even if you invested trillions of dollars to train a chess engine on queen odds, humans would still be able to reliably beat it.
    I’m not sure where the analogy breaks down, but I do think we should focus more on the nature of the game that we play with superintelligence in the real world. The game could change, for example, when a superintelligence escapes the labs and all their mitigation tools, making the game more balanced for itself (for example, by running itself in a distributed manner on millions of personal computers it has hacked).
    As long as we make sure that the game stays unbalanced, I do think we have a chance to mitigate the risks.
    - faul_sname 29 May 2025 18:51 UTC
      2 points
      0
      Parent
      There’s a LeelaQueenOdds too, which they say performs at 2000-2700 Elo depending on time controls.
- Jeremy Gillen 29 May 2025 11:27 UTC
  5 points
  2
  Parent
  Can you beat this bot though?
  - faul_sname 29 May 2025 18:56 UTC
    4 points
    0
    Parent
    Related question—people who have played against LeelaQueenOdds describe it as basically an adversarial attack against humans. Can humans in turn learn adversarial strategies against LeelaQueenOdds?
    
    (bringing up here since it seems relevant and you seem unusually likely to have already looked into this)
    - Jeremy Gillen 29 May 2025 19:37 UTC
      4 points
      0
      Parent
      I haven’t heard of any adversarial attacks, but I wouldn’t be surprised if they existed and were learnable. I’ve tried a variety of strategies, just for fun, and haven’t found anything that works except luck. I focused on various ways of forcing trades, and this often feels like it’s working but almost never does. As you can see, my record isn’t great.
      I think I started playing it when I read simplegeometry’s comment you linked in your shortform.
      It seems to be gaining a lot of ground by exploiting my poor openings. Maybe one strategy would be to memorise a specialised opening much deeper than usual? That could be enough. But it’d feel like cheating to me if I used an engine to find that opening. It’d also feel like cheating because it’s exploiting Leela’s lack of memory of past games. It’d be easy to modify it to deliberately play diverse games when playing against the same person.
      - faul_sname 29 May 2025 20:18 UTC
        4 points
        0
        Parent
        Would you consider it cheating to observe a bunch of games between Leela and Stockfish, at every move predicting a probability distribution over what move you think Stockfish will play? That might give you an intuition for whether Leela is working by exploiting a few known blind spots (in which case you would generally make accurate predictions about what Stockfish would do, except for a few specific moves), or whether Leela is just out-executing you by a little bit per move (which would look like just being bad at predicting what Stockfish would do in the general case.
        Jeremy Gillen 29 May 2025 21:01 UTC
        9 points
        0
        Parent
        I don’t think that’d help a lot. I just looked back at several computer analyses, and the (stockfish) evaluation of the games all look like this:
        This makes me think that Leela is pushing me into a complex position and then letting me blunder. I’d guess that looking at optimal moves in these complex positions would be good training, but probably wouldn’t have easy to learn patterns.
        faul_sname 29 May 2025 21:10 UTC
        4 points
        0
        Parent
        Oh, interesting! I didn’t expect to see a mix of games decided by many small blunders and games decided by a few big blunders.
        
        I actually do suspect that there are learnable patterns in these complex positions, but I’m basing that off my experiences with a different game (hex, where my Elo is ~1800) where “the game is usually decided by a single blunder and recognizing blunder-prone situations is key to getting better” is perhaps more strongly true than of chess.
        Jeremy Gillen 29 May 2025 22:04 UTC
        2 points
        0
        Parent
        Yeah I didn’t expect that either, I expected earlier losses (although in retrospect that wouldn’t make sense, because stockfish is capable of recovering from bad starting positions if it’s up a queen).
        Intuitively, over all the games I played, each loss felt different (except for the substantial fraction that were just silly blunders). I think if I learned to recognise blunders in the complex positions I would just become a better player in general, rather than just against LeelaQueenOdds.
        Just tried hex, that’s fun.
  - Aharon Azulay 30 May 2025 16:39 UTC
    3 points
    0
    Parent
    Maybe I can’t :] but it is beatable by top humans. I bet I could win against a god with queen + knight odds.
    My actual point was not about the specific configuration, but rather the general claim that what is important is how balanced the game you play is, and that you can beat an infinitely intelligent being in sufficiently unbalanced games.
    - Jeremy Gillen 30 May 2025 21:07 UTC
      5 points
      0
      Parent
      Everyone agrees that sufficiently unbalanced games can allow a human to beat a god. This isn’t a very useful fact, since it’s difficult to intuit how unbalanced the game needs to be.
      If you can win against a god with queen+knight odds you’ll have no trouble reliably beating Leela with the same odds. I’d bet you can’t win more than 6 out of 10? $20?