johnswentworth comments on How Do Selection Theorems Relate To Interpretability?

johnswentworth 28 Jun 2022 16:03 UTC
6 points
0
A useful heuristic: for alignment purposes, most of the value of an interpretability technique comes not from being able to find things, but being able to guarantee that we didn’t miss anything. The ability to check only some of the work has very little impact on the difficulty of alignment—in particular, it means we cannot apply any optimization pressure at all to that interpretability method (including optimization pressure like “the humans try new designs until they find one which doesn’t raise any problems visible to the interpretability tools”). The main channel through which partial interpretability would be useful is if it leads to a method for more comprehensive interpretability.