The problem with the swiss cheese model here is illustrative of why this is unpromising as stated. In the swiss cheese model you start with some working system, and then the world throws unexpected accidents at you, and you need to protect the working system from being interrupted by an accident. This is not our position with respect to aligned AI—a misaligned AI is not well-modeled as an aligned AI plus some misaligning factors. That is living in the should-universe-plus-diff. If you prevent all “accidents,” the AI will not revert to its normal non-accident home state of human-friendliness.
Yes, combining multiple safety features is done all the time, e.g. if you’re designing a fusion reactor. But you don’t design a working fusion reactor by taking twenty non-working designs and summing all their features. Such an approach to fusion-reactor design wouldn’t work because:
features of a fusion reactor only improve its function with in a specific context that has to be taken into account during design
probably some of the features you’re adding were bad ideas to begin with, and those can cancel out all the fusion you were trying to promote with all the other features, because fusion is a rare and special thing that takes work to make happen
some of the other features are working at cross-purposes—e.g. one feature might involve outside resources, and another feature might involve isolating the system from the outside
Some of the features will have unexpected synergy, which will go unnoticed because synergy requires carefully setting parameters by thinking about what produces fusion, not just about combining features.
There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for
[RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted
[Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
Your objection seems to assume that #1 is the only viable approach to AI existential safety, but that seems very much not obvious. Corrigibility is itself an approach for #2 and if I’m not mistaken, Yudkowsky considers it one way of tackling alignment.
To: Cleo Nardo, you should pick a better name for this strategy. “Hodge-podge alignment” is such terrible marketing.
The problem with the swiss cheese model here is illustrative of why this is unpromising as stated. In the swiss cheese model you start with some working system, and then the world throws unexpected accidents at you, and you need to protect the working system from being interrupted by an accident. This is not our position with respect to aligned AI—a misaligned AI is not well-modeled as an aligned AI plus some misaligning factors. That is living in the should-universe-plus-diff. If you prevent all “accidents,” the AI will not revert to its normal non-accident home state of human-friendliness.
Yes, combining multiple safety features is done all the time, e.g. if you’re designing a fusion reactor. But you don’t design a working fusion reactor by taking twenty non-working designs and summing all their features. Such an approach to fusion-reactor design wouldn’t work because:
features of a fusion reactor only improve its function with in a specific context that has to be taken into account during design
probably some of the features you’re adding were bad ideas to begin with, and those can cancel out all the fusion you were trying to promote with all the other features, because fusion is a rare and special thing that takes work to make happen
some of the other features are working at cross-purposes—e.g. one feature might involve outside resources, and another feature might involve isolating the system from the outside
Some of the features will have unexpected synergy, which will go unnoticed because synergy requires carefully setting parameters by thinking about what produces fusion, not just about combining features.
I disagree with the core of this objection.
There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for [RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
Your objection seems to assume that #1 is the only viable approach to AI existential safety, but that seems very much not obvious. Corrigibility is itself an approach for #2 and if I’m not mistaken, Yudkowsky considers it one way of tackling alignment.
To: Cleo Nardo, you should pick a better name for this strategy. “Hodge-podge alignment” is such terrible marketing.