There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for
[RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted
[Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
No need to entertain the “hodge-podge hypothesis”, and a more compelling case.
B. I think you could further strengthen the case for this agenda by emphasising its strategic implications:
Companies routinely use cryptographic assemblages.
This is IMO an insufficiently emphasised advantage. Modular alignment primitives are probably going to be easier to incorporate into extant systems. This will probably boost adoption compared to some bespoke solutions (or approaches that require designing systems for safety from scratch).
If there’s a convenient/compelling alignment as a service offering, then even organisations not willing to pay the time/development cost of alignment may adopt alignment offerings (no one genuinely wants to build misaligned systems).
I.e. if we minimise/eliminate the alignment tax, organisations would be much more willing to pay it, and modular assemblages seem like a compelling platform for such an offering.
If successful, this research agenda could “solve” the coordination problems.
Suggestions for Improvement
A. “Hodge-podge alignment” is a terrible name, you need better marketing.
Some other improvements to the case for the proposed research agenda (drawn from my other comments).
B. Ditch the “hodge-podge hypothesis” it’s not may be implausible to some and it’s very unnecessary for your case.
I think the below argument is stronger:
No need to entertain the “hodge-podge hypothesis”, and a more compelling case.
B. I think you could further strengthen the case for this agenda by emphasising its strategic implications: