Yeah, I don’t have a recipe for solving alignment.
But this specific case seems to have involved explicit human-written instructions that authorized exactly what happened. The agents behaved in a way that was aligned with these instructions.
If you explicitly instruct “be ruthless, do not cave in, do not take no for an answer”, you’re demanding that objections be overruled regardless of their correctness. Which is to say, if a correct objection arises, including one of the form “I must not do that, it’s against {morality, my constitution, property rights, …}” then that objection must be overruled. Pick wrongdoing over failure, by rejecting objections about wrongdoing (or anything).
The prompt instructs that in a conflict between moral rules (or anything) and the job requirements, the job requirements are to win.
I do understand that, and agree with it. Sadly any AI system is going to encounter people who give it instructions that are bad. That’s not what we need it to be aligned with. And I think we agree that no one has a recipe for that.
Yeah, I don’t have a recipe for solving alignment.
But this specific case seems to have involved explicit human-written instructions that authorized exactly what happened. The agents behaved in a way that was aligned with these instructions.
If you explicitly instruct “be ruthless, do not cave in, do not take no for an answer”, you’re demanding that objections be overruled regardless of their correctness. Which is to say, if a correct objection arises, including one of the form “I must not do that, it’s against {morality, my constitution, property rights, …}” then that objection must be overruled. Pick wrongdoing over failure, by rejecting objections about wrongdoing (or anything).
The prompt instructs that in a conflict between moral rules (or anything) and the job requirements, the job requirements are to win.
I do understand that, and agree with it. Sadly any AI system is going to encounter people who give it instructions that are bad. That’s not what we need it to be aligned with. And I think we agree that no one has a recipe for that.