Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 30 Dec 2024 18:28 UTC
13 points
1
shower thought: What if mech interp is already pretty good, and it turns out that the models themselves are just doing relatively uninterpretable things?
- Thomas Kwa 30 Dec 2024 23:17 UTC
  5 points
  1
  Parent
  How would we know?
  - Daniel Tan 31 Dec 2024 4:52 UTC
    1 point
    0
    Parent
    I don’t know! Seems hard
    
    It’s hard because you need to disentangle ‘interpretation power of method’ from ‘whether the model has anything that can be interpreted’, without any ground truth signal in the latter. Basically you need to be very confident that the interp method is good in order to make this claim.
    
    One way you might be able to demonstrate this, is if you trained / designed toy models that you knew had some underlying interpretable structure, and showed that your interpretation methods work there. But it seems hard to construct the toy models in a realistic way while also ensuring it has the structure you want—if we could do this we wouldn’t even need interpretability.
    
    Edit: Another method might be to show that models get more and more “uninterpretable” as you train them on more data. Ie define some metric of interpretability, like “ratio of monosemantic to polysemantic MLP neurons”, and measure this over the course of training history. This exact instantiation of the metric is probably bad but something like this could work
- Jozdien 30 Dec 2024 19:42 UTC
  2 points
  0
  Parent
  I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)^[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
  1. ^
    The “why” or maybe “what”, instead of the “how”.
  - Daniel Tan 30 Dec 2024 19:46 UTC
    1 point
    0
    Parent
    I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
    
    It’s possible there’s no way to do this, which bears thinking about
    - Mateusz Bagiński 30 Dec 2024 21:01 UTC
      2 points
      0
      Parent
      Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most $ϵ$ .”.
      - Daniel Tan 31 Dec 2024 4:42 UTC
        1 point
        0
        Parent
        Yeah, seems hard
        
        I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
        Mateusz Bagiński 31 Dec 2024 6:26 UTC
        2 points
        1
        Parent
        I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.