Related, The Swiss cheese model of safety is underappreciated in the field. This model has worked across other industries and seems to be what works for the only general intelligence we know: humans. Humans use a mixture of strategies for safety we could imitate for AI safety (see this draft). However, the agent foundations community seems to be completely neglecting this.
The big issue is that a lot of the swiss cheese strategy assumes failures are sufficiently uncorrelated that multiple defense layers stack, but AI can coordinate failures such that unlikely events become probable ones even through layers of defenses.
However, I think AI progress is slow and continuous enough that I do think that swiss cheese models are reasonably useful, and I do think there’s a band of performance where optimization doesn’t totally obsolete the strategy.
The big issue is that a lot of the swiss cheese strategy assumes failures are sufficiently uncorrelated that multiple defense layers stack, but AI can coordinate failures such that unlikely events become probable ones even through layers of defenses.
However, I think AI progress is slow and continuous enough that I do think that swiss cheese models are reasonably useful, and I do think there’s a band of performance where optimization doesn’t totally obsolete the strategy.