Florian_Dietz comments on Inner and outer alignment decompose one hard problem into two extremely hard problems

Florian_Dietz 7 Jul 2025 8:54 UTC
2 points
1
...am I stupid for only now realizing that a lot of people have been treating outer and inner alignment as actual separate problems to solve?

I have always thought of this distinction as a useful heuristic to find problems in research: You can find issues in alignment approaches more quickly by looking at inner and outer aspects separately.

But trying to actually treat them as separate problems to solve sounds crazy to me.

If your Inner Alignment algorithm isn’t aware of the types of objectives the Outer Alignment reward function might specify then it is unable to make efficient optimizations. We intuitively know that there is no point in keeping track of e.g. the number of atoms in a rock, but if the inner optimization process is supposed to be fully compatible with ANY objective then it is unable to optimize efficiently for concepts that human-relevant reward functions actually care about.

Conversely, if the Outer Alignment mechanism is not aware of the abstractions and simplifications used by the mechanism that implements the Inner Alignment, then it won’t be able to account for modelling inaccuracies and will use words or concepts in its specification that actually map to different concepts than intended.

But perhaps most importantly: Humans do not work that way. The equivalent of the inner/outer alignment decomposition in humans is “create a strict code of ethics and follow it precisely”. I can not think of a single person known for their morals who actually operated like that. This stuff is done by philosophers who want to publish papers, not by people who actually do things. (There probably are some people like this in the EA community, but I would be surprised if they are the majority).