Quadratic Reciprocity comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Quadratic Reciprocity 3 Nov 2022 21:07 UTC
3 points
0
Why wouldn’t a solution to Eliciting Latent Knowledge (ELK) help with solving deceptive alignment as well? Isn’t the answer to whether the model is being deceptive part of its latent knowledge?

If ELK is solved in the worst case, how much more work needs to be done to solve the alignment problem as a whole?
- lberglund 5 Nov 2022 17:35 UTC
  1 point
  0
  Parent
  I think the issue might be that the ELK head (the system responsible for eliciting another system’s latent knowledge) might itself be deceptively aligned. So if we don’t solve deceptive alignment our ELK head won’t be reliable.