This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard
This is true, though to be clear I’m specifically making the point that interpretability will not be a highly reliable method on its own for establishing the lack of deception—while this is true of all current approaches, in my opinion, I think only mech interp people have ever seriously claimed that their approach might succeed this hard