Jozdien comments on Daniel Tan’s Shortform

Jozdien 30 Dec 2024 19:42 UTC
2 points
0
I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)^[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
1. ^
  The “why” or maybe “what”, instead of the “how”.
- Daniel Tan 30 Dec 2024 19:46 UTC
  1 point
  0
  Parent
  I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?
  
  It’s possible there’s no way to do this, which bears thinking about
  - Mateusz Bagiński 30 Dec 2024 21:01 UTC
    2 points
    0
    Parent
    Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most $ϵ$ .”.
    - Daniel Tan 31 Dec 2024 4:42 UTC
      1 point
      0
      Parent
      Yeah, seems hard
      
      I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
      - Mateusz Bagiński 31 Dec 2024 6:26 UTC
        2 points
        1
        Parent
        I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.