Neuroscience and Alignment

I’ve been in many conversations where I’ve mentioned the idea of using neuroscience for outer alignment, and the people who I’m talking to usually seem pretty confused about why I would want to do that. Well, I’m confused about why one wouldn’t want to do that, and in this post I explain why.

As far as I see it, there are three main strategies people have for trying to deal with AI alignment in worlds where AI alignment is hard.

  1. Value alignment

  2. Corrigibility

  3. Control/​scalable alignment

In my opinion, these are all great efforts, but I personally like the idea of working on value alignment directly. Why? First some negatives of the others:

  1. Corrigibility requires moderate to extreme levels of philosophical deconfusion, an effort worth doing for some, but a very small set not including myself. Another negative of this approach is that by-default the robust solutions to the problems won’t be easily implementable in deep learning.

  2. Control/​scalable alignment requires understanding the capabilities & behaviors of inherently unpredictable systems. Sounds hard![1]

Why is value alignment different from these? Because we have working example of a value-aligned system right in front of us: The human brain. This permits an entirely scientific approach, requiring minimal philosophical deconfusion. And in contrast to corrigibility solutions, biological and artificial neural-networks are based upon the same fundamental principles, so there’s a much greater chance that insights from the one easily work in the other.

In the most perfect world, we would never touch corrigibility or control with a 10-foot stick, and instead once we realized the vast benefits and potential pitfalls of AGI, we’d get to work on decoding human values (or more likely the generators of human values) directly from the source.

Indeed, in worlds where control or scalable alignment go well, I expect the research area our AI minions will most prioritize is neuroscience. The AIs will likely be too dumb or have the wrong inductive biases to hold an entire human morality in their head, and even if they do, we don’t know whether they do, so we need them to demonstrate that their values are the same as our values in a way which can’t be gamed by exploiting our many biases or philosophical inadequacies. The best way to do that is through empiricism, directly studying & making predictions about the thing you’re trying to explain.

The thing is, we don’t need to wait until potentially transformative AGI in order to start doing that research, we can do it now! And even use presently existing AIs to help!

I am hopeful there are in fact clean values or generators of values in our brains such that we could just understand those mechanisms, and not other mechanisms. In worlds where this is not the case, I get more pessimistic about our chances of ever aligning AIs, because in those worlds all computations in the brain are necessary to do a “human morality”, which means that if you try to do, say, RLHF or DPO to your model and hope that it ends up aligned afterwards, it will not be aligned, because it is not literally simulating an entire human brain. It’s doing less than that, and so it must be some necessary computation its missing.

Put another way, worlds where you need to understand the entire human brain to understand human morality are often also worlds where human morality is incredibly complex, so value learning approaches are less likely to succeed, and the only aligned AIs are those which are digital emulations of human brains. Thus again, neuroscience is even more necessary.

Thanks to @Jozdien for comments

  1. ^

    I usually see people say “we do control so we can do scalable alignment”, where scalable alignment is taking a small model and having it align a larger model, and figuring out procedures such that the larger model can only get more aligned than the smaller model. This IMO has very similar problems to control, so I lump together the strategy & criticism.