Lucius Bushnaq comments on On how various plans miss the hard bits of the alignment challenge

Lucius Bushnaq 12 Jul 2022 7:47 UTC
13 points
3
Epistemic status: Story. I am just assuming that my current guesses for the answers to outstanding research questions are true here. Which I don’t think they are. They’re not entangled enough with actual data yet for that to be the case. This is just trying to motivate why I think those are the right kinds of things to research.
Figure out how to measure and plot information flows in ML systems. Develop an understanding of abstractions, natural or otherwise, and how they are embedded in ML systems as information processing modules.
Use these tools to find out how ML systems embed things like subagents, world models, and goals, how they interlink, and how they form during training. I’m still talking about systems like current reinforcement learners/transformer models or things not far removed from them here.
With some better idea of what “goals” in ML systems even look like, formalise these concepts, and find selection theorems that tell you, rigorously, which goals a given loss function used by the outer optimiser will select for. I suspect that in dumb systems, this is (or could be made) pretty predictable and robust to medium sized changes in the loss function, architecture, or outer optimiser. Because that seems to be the case in our brains. E.g. some people are born blind and never use their inborn primitive facial recognition while forming social instincts, yet their values seem to turn out decidedly human-like. Humans need perturbations like sociopath-level breakage of the reward circuitry that is our loss function, or being raised by wolves, to not form human-like values.
Now you know how to make a (dumb) AGI that wants particular things, the definitions of which are linked to concepts in its world model that are free to evolve as it gets smarter. You also know which training situations and setups to avoid to stop the outer optimiser from making new goals you don’t want.
Use this to train an AGI to have human-baby-like primitive goals/subagents/Shards, like humans being sad is bad, cartoons are fun, etc. As capabilities increase, these primitive goals would seem likely to be generalised by the AGI into human-adult-like preferences even under moderately big perturbations, because that sure seems to happen with humans.
Now train the not-goal parts of the AGI to superhuman capability. Since it wants its own goals to be preserved just as much as you do, it will gladly go along with this. Should a takeoff point be reached, it will use its knowledge of its own structure to self-modify while preserving its extrapolated, human-like values.