Why wouldn’t a solution to Eliciting Latent Knowledge (ELK) help with solving deceptive alignment as well? Isn’t the answer to whether the model is being deceptive part of its latent knowledge?
If ELK is solved in the worst case, how much more work needs to be done to solve the alignment problem as a whole?
I think the issue might be that the ELK head (the system responsible for eliciting another system’s latent knowledge) might itself be deceptively aligned. So if we don’t solve deceptive alignment our ELK head won’t be reliable.
Why wouldn’t a solution to Eliciting Latent Knowledge (ELK) help with solving deceptive alignment as well? Isn’t the answer to whether the model is being deceptive part of its latent knowledge?
If ELK is solved in the worst case, how much more work needs to be done to solve the alignment problem as a whole?
I think the issue might be that the ELK head (the system responsible for eliciting another system’s latent knowledge) might itself be deceptively aligned. So if we don’t solve deceptive alignment our ELK head won’t be reliable.