Zane comments on Lying to chess players for alignment

Zane 25 Oct 2023 18:54 UTC
1 point
0
Unsure about the time controls at the moment; see my response to aphyer. The advisors would be able to give the A player justification for the move they’ve recommended.
The concern that A might not be able to understand the reasoning that the advisors give them is a valid one, and that’s the whole point of the experiment! If A can’t follow the reasoning well enough to determine whether it’s good advice, then (says the analogy) people who are asking AIs how to solve alignment can’t follow their reasoning well enough to determine whether it’s good advice.