DragonGod comments on Towards Hodge-podge Alignment

DragonGod 19 Dec 2022 23:03 UTC
14 points
13
Enjoyed this post, strongly upvoted.

I believe the Hodge Podge hypothesis is probably false at moderately high credence. I expect that alignment can probably be solved in a simple/straightforward manner. (E.g. something like ambitious value learning may just work if the natural abstractions hypothesis is true and there exists a broad basin of attraction around human values or human (meta)ethics in concept space.)

But research diversification suggests that we should try the hodge podge strategy nonetheless. ;)

More concretely, this research proposal is a pareto improvement to the research landscape and it seems to me that it might potentially be a significant improvement (for the reasons you stated).

In particular:
- I expect that many alignment primitives would be synergistic.
  - E.g. quantilisation and impact regularisation
  - Boxing and ~everything
- A Swiss cheese style approach to safety seems intuitively sensible from an alignment as security mindset
  - I.e. Assemblages naturally provide security in depth
- It’s a highly scalable agenda in a way most others are not
  - It can exploit the theory overhang
  - Easy to onboard new software engineers and have them produce useful alignment work
- It’s robust to epistemic uncertainty about what the solution is
- Modular alignment primitives are an ideal framework for alignment as a service (AaaS)
  - They are probably going to be easier to incorporate into extant systems
    This will probably boost adoption compared to bespoke solutions or approaches that require the system to be built from scratch for safety
  - If there’s a convenient alignment as a service offering, then even organisations not willing to pay the time/development cost of alignment may adopt alignment offerings
    No one genuinely wants to build misaligned systems, so if we eliminate/minimise the alignment tax, more people will build aligned system
  - AaaS could solve/considerably mitigate the coordination problems confronting the field
I also like the schema you presented for getting robust arguments/guarantees for safety.