johnswentworth comments on Let’s See You Write That Corrigibility Tag

johnswentworth 20 Jun 2022 18:02 UTC
LW: 49 AF: 18
6
AF
Seems like a worthwhile exercise...
Principles Meta
There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.
A List
- From an alignment perspective, the point of corrigibility is to fail safely and potentially get more than one shot. Two general classes of principles toward that end:
  - If there’s any potential problem at all, throw an error and shut down. Raise errors early, raise errors often.
  - Fail informatively. Provide lots of info about why the failure occurred, make it as loud and legible as possible.
- Note that failing frequently implies an institutional design problem coupled with the system design problem: we want the designers to not provide too much accidental selection pressure via iteration, lest they select against visibility of failures.
- Major principle: locality!
  - Three example sub-principles:
    Avoid impact outside some local chunk of spacetime
    Avoid reasoning about stuff outside some local chunk of spacetime
    Avoid optimizing outside some local chunk of spacetime
  - The three examples above are meant to be illustrative, not comprehensive; you can easily generate more along these lines.
  - Sub-principles like the three above are mutually incompatible in cases of any reasonable complexity. Example: can’t avoid accidental impact/optimization outside the local chunk of spacetime without reasoning about stuff outside the local chunk of spacetime. The preferred ways to handle such incompatibilities are (1) choose problems for which the principles are not incompatible, and (2) raise an informative error if they are.
  - The local chunk of spacetime in question should not contain the user, the system’s own processing hardware, other humans, or other strong agenty systems. Implicit in this but also worth explicitly highlighting:
    Avoid impacting/reasoning about/optimizing/etc the user
    Avoid impacting/reasoning about/optimizing/etc the system’s own hardware
    Avoid impacting/reasoning about/optimizing/etc other humans (this includes trades)
    Avoid impacting/reasoning about/optimizing/etc other strong agenty systems (this includes trades)
  - Also avoid logical equivalents of nonlocality, e.g. acausal trade
  - Always look for other kinds of nonlocality—e.g. if we hadn’t already noticed logical nonlocality as a possibility, how would we find it?
    Better yet, we’d really like to rule out whole swaths of nonlocality without having to notice them at all.
- Major principle: understandability!
  - The system’s behavior should be predictable to a human; it should do what users expect, and nothing else.
  - The system’s internal reasoning should make sense to a human.
    System should use a human-legible ontology
    Should be able to categorically rule out any “side channel” interactions which route through non-human-legible processing
  - In general, to the extent that we want the system to not actively model users/humans, the users/humans need to do the work of checking that plans/reasoning do what humans want. So plans/reasoning need to be human-legible as much as possible.
    Plans and planning should be minimal:
    No more optimization than needed (i.e. quantilize)
    No more resources than needed
    No more detailed modelling/reasoning than needed
    No more computation/observation/activity than need
    Avoid making the plan more complex than needed (i.e. plans should be simple)
    Avoid making the environment more complex than needed
    Avoid outsourcing work to other agents
    Definitely no children!!!
    Plans should avoid pushing the world way out-of-distribution compared to what humans are able to reason about.
    Plans should not dramatically shift the natural ontology of the world
- Do what the user says, what the user means, what the user expects, etc. These are mutually incompatible in general. The preferred ways to handle such incompatibilities are (1) choose problems for which they are not incompatible, and (2) raise an informative error if they are.
- Major principle: get feedback from the user at runtime!
  - Runtime feedback should actually be used, even when “incompatible” in some way with whatever the system previously thought it was doing.
    Don’t avoid shutdown
    Raise an error if feedback is incompatible in some way with other info/objectives/etc.
  - Note that feedback is implicitly optimized against, which is dangerous. Limit that optimization pressure.
- Infohazards and persuasion-optimized info need to not be presented to the user, which is very incompatible with other principles above. Ideally, we want to choose problems/search spaces for which such things are unlikely to come up. Throwing a legible error if such things come up is itself dangerous (since it draws human attention to the infohazard), and creates another institutional design problem coupled to the technical problems.
- A system which follows all these principles, and others like them, probably won’t do anything directly useful, at least not at first. That’s ok. All those informative errors will make the confused humans less confused over time.
Exercise Meta
I do not think I currently know what concept Eliezer usually wants to point to with the word “corrigibility”, nor am I even sure that he’s pointing to a coherent concept at all (as opposed to, say, a bunch of not-actually-unified properties which would make it less likely for a strong AGI to kill us on the first failure).
I have omitted principles of the form “don’t do <stupid thing>”, like e.g. don’t optimize any part of the system for human approval/feedback, don’t outsource error-analysis or interpretation to another AI system, etc.
These took 1-2 hours to generate, and another 30 min − 1 hr to write up as a comment.
What links here?
- The Plan − 2022 Update by johnswentworth (1 Dec 2022 20:43 UTC; 239 points)

johnswentworth comments on Let’s See You Write That Corrigibility Tag

Principles Meta

A List

Exercise Meta