Charlie Steiner comments on ARC’s first technical report: Eliciting Latent Knowledge

Charlie Steiner 8 Jan 2022 9:17 UTC
LW: 4 AF: 2
0
AF
When you say “some case in which a human might make different judgments, but where it’s catastrophic for the AI not to make the correct judgment,” what I hear is “some case where humans would sometimes make catastrophic judgments.”
I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.