I think this really depends on what “good” means exactly. For instance, if humans think it’s good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our “good” mech interp to scheme more deceptively.
I’m guessing your notion of good must explicitly mean that this scenario isn’t possible. But this really begs the question—how could we know if our mech interp has reached that level of goodness?
I think this really depends on what “good” means exactly. For instance, if humans think it’s good but we overestimate how good our interp is, and the AI system knows this, then the AI system can take advantage of our “good” mech interp to scheme more deceptively.
I’m guessing your notion of good must explicitly mean that this scenario isn’t possible. But this really begs the question—how could we know if our mech interp has reached that level of goodness?