I prefer to avoid terms such as “pretending” or “faking”, and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, “faking” would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don’t think it is a necessary condition.
I prefer to avoid terms such as “pretending” or “faking”, and try to define these more precisely.
As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, “faking” would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don’t think it is a necessary condition.