Chris_Leong comments on Towards Hodge-podge Alignment

Chris_Leong 19 Dec 2022 23:54 UTC
7 points
6
Very happy to see someone writing about this as I’ve been thinking that there should be more research into this for a while. I guess my main doubt with this strategy is that if a system is running for long enough in a wide enough variety of circumstances maybe certain rare outcomes are virtually guaranteed?
- DragonGod 20 Dec 2022 0:06 UTC
  5 points
  4
  Parent
  Conjecture:
  
  Well constructed assemblages are at least as safe as any of their component primitives
  
  More usefully:
  
  The safety guarantees of a well constructed assemblage is a superset of the union of safety guarantees of each component primitive
  
  Or at least theorems of the above sort means that assemblages are no less safe than their components and are potentially much safer.
  
  And assemblages naturally provide security in depth (e.g. the swiss cheese strategy).
  - Cleo Nardo 20 Dec 2022 0:46 UTC
    6 points
    1
    Parent
    The heuristic is “assemblage is safer than its primitives”.
    Formally:
    For every primitive $p$ and assemblages $A_{1}$ and $A_{2}$ and wiring diagram $D$ , the following is true:
    If $D \circ (A_{1} \otimes p)$ strongly dominates $A_{1}$ then $D \circ (A_{2} \otimes p)$ weakly dominates $A_{2}$ .
    Recall that $D \circ (A \otimes p)$ is the wiring-together of $A$ and $p$ using the wiring diagram $D$ .
    In English, this says that $p$ can’t be helpful in one assemblage and unhelpful in another.
    I expect counterexamples to this heuristic to look like this:
    Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
    Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
    These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
    - DragonGod 20 Dec 2022 0:56 UTC
      1 point
      0
      Parent
      Can you please take a look at this comment?