Rohin Shah comments on To be legible, evidence of misalignment probably has to be behavioral

Rohin Shah 11 May 2025 9:15 UTC
LW: 4 AF: 4
0
AF
In some sense I agree with this post, but I’m not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate “evidence which is just from fancy internals-based methods (and can’t be supported by human inspection of AI behavior)”, and that is the primary story for why it is impactful? I don’t think this is true of probing, SAEs, circuit analysis, debate, …
- ryan_greenblatt 12 May 2025 2:11 UTC
  LW: 6 AF: 6
  2
  AF Parent
  This is often important in my thinking: when thinking about various internals based methods that could test for scheming (but won’t produce direct behavioral evidence), this comes up. I wrote this doc after noticing that I wanted to reference this somewhere.
  
  Also, I often hear people discuss getting non-behavioral evidence for scheming using internals/interp. (As an example, probes for detecting deceptive cognition and then seeing if this fire more than expected on honeypots.) And, understanding this isn’t going to result in legible evidence is important for understanding the theory of change for this work: it’s important that you can iterate usefully against the method. I think people sometimes explicitly model iterating against these testing methods, but sometimes they don’t.
  
  Maybe this doesn’t come up as much in your conversation with people, but I’ve seen internals based testing methods which don’t clearly ground out in behavioral evidence discussed often.
  
  (E.g., it’s the application that the Anthropic interp team has most discussed, it’s the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
  
  Notably Dario seemingly thinks that circuit style interp analysis (which IMO would be unlikely to yield behavioral evidence on it’s own) is the main way we might get definitive (aka legible) evidence of scheming. So, I think Dario’s essay on interp is an example of someone disagreeing with this post! Dario’s essay on interp came out after this post was published, otherwise I might have referenced it.
  
  I wasn’t trying to trigger any research particular reprioritization with this post, but I historically found that people hadn’t really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
  - ryan_greenblatt 12 May 2025 2:27 UTC
    LW: 5 AF: 5
    4
    AF Parent
    Here is the quote from Dario:
    
    More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
    
    To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking. The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.
    
    IMO, this implies that interp would allow for rallying support while it would be hard otherwise, implying the behavioral evidence isn’t key.
    - Rohin Shah 12 May 2025 11:11 UTC
      LW: 3 AF: 3
      0
      AF Parent
      I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.
  - Rohin Shah 12 May 2025 11:08 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)
    Maybe this doesn’t come up as much in your conversation with people, but I’ve seen internals based testing methods which don’t clearly ground out in behavioral evidence discussed often.
    (E.g., it’s the application that the Anthropic interp team has most discussed, it’s the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.)
    The Anthropic discussion seems to be about making a safety case, which seems different from generating evidence of scheming. I haven’t been imagining that if Anthropic fails to make a specific type of safety cases, they then immediately start trying to convince the world that models are scheming (as opposed to e.g. making other mitigations more stringent).
    I think if a probe for internal deceptive reasoning works well enough, then once it actually fires, you could then do some further work to turn it into legible evidence of scheming (or learn that it was a false positive), so I feel like the considerations in this post don’t apply.
    I wasn’t trying to trigger any research particular reprioritization with this post, but I historically found that people hadn’t really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
    Fair enough. I would be sad if people moved away from e.g. probing for deceptive reasoning or circuit analysis because they now think that these methods can’t help produce legible evidence of misalignment (which would seem incorrect to me), which seems like the most likely effect of a post like this. But I agree with the general norm of just saying true things that people are interested in without worrying too much about these kinds of effects.