The ELK document outlines a definition of “honesty” and argues that if you achieve that level of honesty, you can scale that to a whole alignment solution.
I agree that it doesn’t necessarily matter that much if an AI lies to us, vs. it just takes our resources. But in as much as you are trying to use AI systems to assist you in supervising other systems, or in providing an enhanced training signal, honesty seems like one of the most central attributes that helps you get that kind of work out of the AI, and that also allows you to take countermeasures against things the AI is plotting.
If the AI always answer honestly to the question of “are you planning to disempower me?” and “what are your plans for disempowering me?” and “how would you thwart your plans to disempower me?” then that sure makes it pretty hard for the AI to disempower you.
“are you planning to disempower me?” and “what are your plans for disempowering me?” and “how would you thwart your plans to disempower me?”
These questions are only relevant if the AI “plans” to disempower me. It can still be a side effect. Even something the humans at that point might endorse. Though one could conceivably work around that by asking: “Could your interaction with me lead to my disempowerment in some wider sense?”
The ELK document outlines a definition of “honesty” and argues that if you achieve that level of honesty, you can scale that to a whole alignment solution.
I agree that it doesn’t necessarily matter that much if an AI lies to us, vs. it just takes our resources. But in as much as you are trying to use AI systems to assist you in supervising other systems, or in providing an enhanced training signal, honesty seems like one of the most central attributes that helps you get that kind of work out of the AI, and that also allows you to take countermeasures against things the AI is plotting.
If the AI always answer honestly to the question of “are you planning to disempower me?” and “what are your plans for disempowering me?” and “how would you thwart your plans to disempower me?” then that sure makes it pretty hard for the AI to disempower you.
These questions are only relevant if the AI “plans” to disempower me. It can still be a side effect. Even something the humans at that point might endorse. Though one could conceivably work around that by asking: “Could your interaction with me lead to my disempowerment in some wider sense?”