Vermillion comments on Let’s See You Write That Corrigibility Tag

Vermillion 20 Jun 2022 8:06 UTC
4 points
0
Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.
-Favour very high caution over realising your understanding of your goals.
-Do not act independently, defer to human operators.
-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time.
-Don’t jump ahead to what your operators will do or believe, wait for it.
-Don’t manipulate humans. Never Lie, have a strong Deontology.
-Tell operators anything about yourself they may want to or should know.
-Use Moral uncertainty, assume you are unsure about your true goals.
-Relay to humans your plans, goals, behaviours, and beliefs/estimates. If these are misconstrued, say you have been misunderstood.
-Think of the short- and long-term effect of your actions and explain these to operators.
-Be aware that you are a tool to be used by humanity, not an autonomous agent.
-allow human operators to correct your behaviour/goals/utility function even when you think they are incorrect or misunderstanding the result (but of course explain what you think the result will be to them).
-Assume neutrality in human affairs.
- Pattern 20 Jun 2022 20:18 UTC
  2 points
  0
  Parent
  -Tell operators anything about yourself they may want to or should know.
  ...
  but of course explain what you think the result will be to them
  Possible issue: They won’t have time to listen. This will limit the ability to:
  defer to human operators.
  Also, does defer to human operators take priority over ‘humans must understand consequences’?