I don’t like the hodge-podge hypothesis and think it’s just a distraction, so maybe I’m coming at it from a different angle. As I see it, the case for this research proposal is:
There are two basic approaches to AI existential safety:
Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for [RLHF, IRL, value learning more generally, etc.]
Safeguarding systems that may not necessarily be ideally targeted [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.
I don’t like the hodge-podge hypothesis and think it’s just a distraction, so maybe I’m coming at it from a different angle. As I see it, the case for this research proposal is: