RobertKirk comments on A transparency and interpretability tech tree

RobertKirk 20 Jun 2022 13:22 UTC
LW: 5 AF: 4
2
AF
The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn’t expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.