johnswentworth comments on A transparency and interpretability tech tree

johnswentworth Jun 17, 2022, 10:29 AM
LW: 22 AF: 18
14
AF
I find this post interesting as a jumping-off point; seems like the kind of thing which will inspire useful responses via people going “no that’s totally wrong!”.
In that spirit, some ways I think it’s totally wrong:
- Best-case/worst-case are confusing names for the things they’re pointing to. I’d use “partial” and “comprehensive” interpretability.
- I expect training process interpretability to not particularly help produce robust-to-training interpretability. Inspection transparency might help somewhat with robust-to-training interpretability, but mostly I expect that robust-to-training interpretability requires a fundamentally different approach/principles/mindset than the other two.
  - (I’m also somewhat skeptical about the usefulness of robust-to-training interpretability as an alignment strategy in general, but not sure that’s within-scope here.)
- I expect partial/best-case interpretability could help somewhat with comprehensive/worst-case, but mostly I expect comprehensive/worst case to require a fundamentally different approach/principles/mindset, such that partial/best-case progress mostly doesn’t address the hard parts of comprehensive/worst-case.
On the flip side, here the kinds of foundational “techs” which I think could provide a foundation for comprehensive/worst-case interpretability, and maybe for robust-to-training interpretability:
- For comprehensive/worst-case interpretability, I expect the first important prerequisite is the ability to identify natural Markov blankets/natural internal APIs. We want to recognize chunks of the net which only interact with the rest of the net via some limited information-channel, and we can rule out all other interactions between the chunk and the rest of the net. (This is basically the natural abstraction idea, applied to neural nets.)
- I expect the next important prerequisite for both comprehensive/worst-case and robust-to-training interpretability is to robustly connect internal data structures in the net to real-world human-recognizable things which the internal data structures “represent”. That, in turn, will require progress on ontology identification/natural abstraction as applied to the world.
  - Alternatively, one could just try to recognize/interpret the core “agency datastructures” (like e.g. goals, world models, search) in a robust-to-training way, without connecting all the internal data structures to their real-world referents. For that more limited problem, the prerequisite work is more mathematical than empirical; one needs definitions/operationalizations of goals/search/world-models/etc which robustly characterize the potentially-dangerous sorts of goal-seeking behavior. (Though this strategy is another case where I’m somewhat skeptical that it’s viable as an alignment strategy in general.)
- evhub Jun 17, 2022, 8:20 PM
  LW: 9 AF: 8
  7
  AF Parent
  
  I find this post interesting as a jumping-off point; seems like the kind of thing which will inspire useful responses via people going “no that’s totally wrong!”.
  
  Yeah, I’d be very happy if that were the result of this post—in fact, I’d encourage you and others to just try to build your own tech trees so that we have multiple visions of progress that we can compare.