Very happy to see someone writing about this as I’ve been thinking that there should be more research into this for a while. I guess my main doubt with this strategy is that if a system is running for long enough in a wide enough variety of circumstances maybe certain rare outcomes are virtually guaranteed?
Very happy to see someone writing about this as I’ve been thinking that there should be more research into this for a while. I guess my main doubt with this strategy is that if a system is running for long enough in a wide enough variety of circumstances maybe certain rare outcomes are virtually guaranteed?
Conjecture:
More usefully:
Or at least theorems of the above sort means that assemblages are no less safe than their components and are potentially much safer.
And assemblages naturally provide security in depth (e.g. the swiss cheese strategy).
The heuristic is “assemblage is safer than its primitives”.
Formally:
For every primitive p and assemblages A1 and A2 and wiring diagram D, the following is true:
If D∘(A1⊗p) strongly dominates A1 then D∘(A2⊗p) weakly dominates A2.
Recall that D∘(A⊗p) is the wiring-together of A and p using the wiring diagram D.
In English, this says that p can’t be helpful in one assemblage and unhelpful in another.
I expect counterexamples to this heuristic to look like this:
Many corrigibility primitives allow a human to influence certain properties of the internal state of the AI.
Many interpretability primitives allow a human to learn certain properties of the internal state of the AI.
These primitives might make an assemblage less safe because the AI could use these primitives itself, leading to self-modification.
Can you please take a look at this comment?