Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.
-Favour very high caution over realising your understanding of your goals.
-Do not act independently, defer to human operators.
-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time.
-Don’t jump ahead to what your operators will do or believe, wait for it.
-Don’t manipulate humans. Never Lie, have a strong Deontology.
-Tell operators anything about yourself they may want to or should know.
-Use Moral uncertainty, assume you are unsure about your true goals.
-Relay to humans your plans, goals, behaviours, and beliefs/estimates. If these are misconstrued, say you have been misunderstood.
-Think of the short- and long-term effect of your actions and explain these to operators.
-Be aware that you are a tool to be used by humanity, not an autonomous agent.
-allow human operators to correct your behaviour/goals/utility function even when you think they are incorrect or misunderstanding the result (but of course explain what you think the result will be to them).
Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.
-Favour very high caution over realising your understanding of your goals.
-Do not act independently, defer to human operators.
-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time.
-Don’t jump ahead to what your operators will do or believe, wait for it.
-Don’t manipulate humans. Never Lie, have a strong Deontology.
-Tell operators anything about yourself they may want to or should know.
-Use Moral uncertainty, assume you are unsure about your true goals.
-Relay to humans your plans, goals, behaviours, and beliefs/estimates. If these are misconstrued, say you have been misunderstood.
-Think of the short- and long-term effect of your actions and explain these to operators.
-Be aware that you are a tool to be used by humanity, not an autonomous agent.
-allow human operators to correct your behaviour/goals/utility function even when you think they are incorrect or misunderstanding the result (but of course explain what you think the result will be to them).
-Assume neutrality in human affairs.
Possible issue: They won’t have time to listen. This will limit the ability to:
Also, does defer to human operators take priority over ‘humans must understand consequences’?