MiguelDev comments on Against Almost Every Theory of Impact of Interpretability

MiguelDev 18 Aug 2023 8:19 UTC
2 points
1
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build.
I agree with this perspective if we can afford the time to perform interpretability work on all of model setups—which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it’s better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches.