Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 30 Dec 2024 19:46 UTC
1 point
0
I agree, but my point was more of “how would we distinguish this scenario from the default assumption that the interp methods aren’t good enough yet”? How can we make a method-agnostic argument that the model is somehow interpretable?

It’s possible there’s no way to do this, which bears thinking about
- Mateusz Bagiński 30 Dec 2024 21:01 UTC
  2 points
  0
  Parent
  Something like “We have mapped out the possible human-understandable or algorithmically neat descriptions of the network’s behavior sufficiently comprehensively and sampled from this space sufficiently comprehensively to know that the probability that there’s a description of its behavior that is meaningfully shorter than the shortest one of the ones that we’ve found is at most $ϵ$ .”.
  - Daniel Tan 31 Dec 2024 4:42 UTC
    1 point
    0
    Parent
    Yeah, seems hard
    
    I’m not convinced that you can satisfy either of those “sufficiently comprehensively” such that you’d be comfortable arguing your model is not somehow interpretable
    - Mateusz Bagiński 31 Dec 2024 6:26 UTC
      2 points
      1
      Parent
      I’m not claiming it’s feasible (within decades). That’s just what a solution might look like.