However, it seems to me that the more fundamental issue is that it’s circular. “Suppose ELK was solved” already implies that we must be able to train heads that we’re sure are not themselves deceptively aligned.
In some sense this is a simpler problem, since it only requires we avoid deception for some particular type of model. But I’d expect it to be of similar difficulty: I expect the insights required to robustly avoid deception for a reporter to be the same insights required to robustly avoid deception in a predictor.
Granted, it’d still be nice if we had a way to expose deception in a model someone else trained. But assuming we’ve solved it for reporters, I’d think the more reasonable approach would be to say “Hey—people training predictors, how about using [approach we used for reporters]?”.
Perhaps I’m wrong on the circularity—if so I’d be keen to know why. E.g. if we knew deception would show up in predictors before reporters, we could trust things for a while—but we’d need reporter reporters soon enough. Even with a stack of n, we’d only have confidence in being able to test a predictor for deception, not a means to train a non-deceptive predictor.
A nitpick: Maybe tweak the introduction a bit to make clear that it’s not a definition of “A deceptively aligned model”. (as a definition, what you’ve written describes any form of proxy alignment; deceptive alignment is more specific: there’s no ‘Perhaps’)
Interesting. I think I agree with all of this.
However, it seems to me that the more fundamental issue is that it’s circular. “Suppose ELK was solved” already implies that we must be able to train heads that we’re sure are not themselves deceptively aligned.
In some sense this is a simpler problem, since it only requires we avoid deception for some particular type of model. But I’d expect it to be of similar difficulty: I expect the insights required to robustly avoid deception for a reporter to be the same insights required to robustly avoid deception in a predictor.
Granted, it’d still be nice if we had a way to expose deception in a model someone else trained. But assuming we’ve solved it for reporters, I’d think the more reasonable approach would be to say “Hey—people training predictors, how about using [approach we used for reporters]?”.
Perhaps I’m wrong on the circularity—if so I’d be keen to know why.
E.g. if we knew deception would show up in predictors before reporters, we could trust things for a while—but we’d need reporter reporters soon enough. Even with a stack of n, we’d only have confidence in being able to test a predictor for deception, not a means to train a non-deceptive predictor.
A nitpick:
Maybe tweak the introduction a bit to make clear that it’s not a definition of “A deceptively aligned model”. (as a definition, what you’ve written describes any form of proxy alignment; deceptive alignment is more specific: there’s no ‘Perhaps’)