Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.
My whole point of ‘defense in depth’ was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
That’s true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.
Now you’re just ascribing magical powers to a potentially-transhuman AI. I’m sure there exists such a silver bullet, in fact by definition if security isn’t 100%, that’s just another way of saying there exists a strategy which will work; but that’s ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.
As I understand EY’s point, it’s that (a) the safety provided by any combination of defenses A, B, C, etc. around an unboundedly self-optimizing system with poorly architected goals will be less than the safety provided by such a system with well architected goals, and that (b) the safety provided by any combination of defenses A, B, C, etc. around such a system with poorly architected goals is too low to justify constructing such a system, but that (c) the safety provided by such a system with well architected goals is high enough to justify constructing such a system.
That the safety provided by a combination of defenses A, B, C is greater than that provided by A alone is certainly true, but seems entirely beside his point.
(For my own part, a and b seem pretty plausible to me, though I’m convinced of neither c nor that we can construct such a system in the first place.)
My whole point of ‘defense in depth’ was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
That’s true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.
Now you’re just ascribing magical powers to a potentially-transhuman AI. I’m sure there exists such a silver bullet, in fact by definition if security isn’t 100%, that’s just another way of saying there exists a strategy which will work; but that’s ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.
Disengaging.
As I understand EY’s point, it’s that (a) the safety provided by any combination of defenses A, B, C, etc. around an unboundedly self-optimizing system with poorly architected goals will be less than the safety provided by such a system with well architected goals, and that (b) the safety provided by any combination of defenses A, B, C, etc. around such a system with poorly architected goals is too low to justify constructing such a system, but that (c) the safety provided by such a system with well architected goals is high enough to justify constructing such a system.
That the safety provided by a combination of defenses A, B, C is greater than that provided by A alone is certainly true, but seems entirely beside his point.
(For my own part, a and b seem pretty plausible to me, though I’m convinced of neither c nor that we can construct such a system in the first place.)