DragonGod comments on Towards Hodge-podge Alignment

DragonGod 21 Dec 2022 21:14 UTC
1 point
0
I don’t like the hodge-podge hypothesis and think it’s just a distraction, so maybe I’m coming at it from a different angle. As I see it, the case for this research proposal is:
There are two basic approaches to AI existential safety:
1. Targeting AI systems at values we’d be “happy” (where we fully informed) for them to optimise for
  [RLHF, IRL, value learning more generally, etc.]
2. Safeguarding systems that may not necessarily be ideally targeted
  [Quantilisation, impact regularisation, boxing, myopia, designing non agentic systems, corrigibility, etc.]
“Hodge-podge alignment” provides a highly scalable and viable “defense in depth” research agenda for #2.