William_S comments on A transparency and interpretability tech tree

William_S 17 Jun 2022 4:15 UTC
LW: 4 AF: 3
0
AF
Do you think we could basically go 1->4 and 2->5 if we could train a helper network to behaviourally clone humans using transparency tools and run the helper network over the entire network/training process? Or if we do critique style training (RL reward some helper model with access to the main model weights if it produces evidence of the property we don’t want the main network to have)?
- RobertKirk 20 Jun 2022 13:22 UTC
  LW: 5 AF: 4
  2
  AF Parent
  The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn’t expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.