Seth Herd comments on The alignment stability problem

Seth Herd 29 Mar 2023 1:07 UTC
LW: 3 AF: 2
0
AF
I probably should’ve titled this “the alignment stability problem in artificial neural network AI”. There’s plenty of work on algorithmic maximizers. But it’s a lot trickier if values/goals are encoded in a network’s distributed representations of the world.
I also should’ve cited Alex Turner’s Understanding and avoiding value drift. There he makes a strong case that dominant shards will try to avoid value drift through other shards establishing stronger connections to rewards. But that’s not quite good enough. Even if it avoids sudden value drift, at least for the central shard or central tendency in values, it doesn’t really address the stability of a multi-goal system. And it doesn’t address slow subtle drift over time.
Those are important, because we may need a multi-goal system, and we definitely want alignment to stay stable over years, let alone centuries of learning and reflection.