Jeremy Gillen answers What are the known difficulties with this alignment approach?

Jeremy Gillen 12 Feb 2024 6:23 UTC
2 points
0
I think the overall goal in this proposal is to get a corrigible agent capable of bounded tasks (that maybe shuts down after task completion), rather than a sovereign?
One remaining problem (ontology identification) is making sure your goal specification stays the same for a world-model that changes/learns.
Then the next remaining problem is the inner alignment problem of making sure that the planning algorithm/optimizer (whatever it is that generates actions given a goal, whether or not it’s separable from other components) is actually pointed at the goal you’ve specified and doesn’t have any other goals mixed into it. (see Context Disaster for more detail on some of this, optimization daemons, and actual effectiveness). Part of this problem is making sure the system is stable under reflection.
Then you’ve got the outer alignment problem of making sure that your fusion power plant goal is safe to optimize (e.g. it won’t kill people who get in the way, doesn’t have any extreme effects if the world model doesn’t exactly match reality, or if you’ve forgotten some detail). (See Goodness estimate bias, unforeseen maximum).
Ideally here you build in some form of corrigibility and other fail-safe mechanisms, so that you can iterate on the details.
That’s all the main ones imo. Conditional on solving the above, and actively trying to foresee other difficult-to-iterate problems, I think it’d be relatively easy to foresee and fix remaining issues.