How I think about alignment

This was written as part of the first Refine blog post day. Thanks for comments by Chin Ze Shen, Tamsin Leake, Paul Bricman, Adam Shimi.

Magic agentic fluid/​force

Edit: “Magi agentic fluid” = “influence” (more or less). I forgot that this word existed, so I made up my own terminology. Oops! To my defence, before writing this post I did not have a word, just a concept, and a sort of visualisation to go with it in my head. My internal thoughts are majority non-verbal. So when trying to translate my thoughts to words, I landed on the term that best described how this concept looked in my mind.

Somewhere in my brain there is some sort of physical encoding of my values. This encoding could be spread out over the entire brain, it could be implicit somehow. I’m not making any claim of how values are implemented in a brain, just that the information is somehow in there.

Somewhere in the future a super intelligent AI is going to do some action.

If we solve alignment, then there will be some causal link between the values in my head (or some human head) and the action of that AI. In some way, whatever the AI does, it should do it because that is what we want.

This is not purely about information. Technically I have some non-zero causal influence over everything in my future lightcone, but most of this influence is too small to matter. More relevant, but still not the thing we want is deceptive AI. In this case the AI’s action is guided by our values in a non-neglectable way, but not in the way we want.

I have a placeholder concept which I call magic agentic fluid or magic agentic force. (I’m using “magic” in the traditional rationalist way of tagging that I don’t yet have a good model of how this works.)

MAF is a high-level abstraction. I expect it to not be there when you zoom in too much. Same as how solid objects just exist at a macro scale, and if you zoom in too much there are just atoms. There is no essence, only shared properties. I think the same way about agents and agency.

MAF is also in some way like energy. In physics energy is a clearly defined concept, but you can’t have just energy. There is no energy without a medium. Same as how you can’t have speed without there being anything that is moving. This is not a perfect analogy.

But I think that this concept does point to something real and/​or useful, and that it would be valuable to try to get a better grasp on what it is, and what it is made of.

Let’s say I want to eat an apple, and later I am eating an apple. There is some causal chain that we can follow from me wanting the apple to me eating the apple. There is a chain of events propagating through spacetime, and it carries with it my will and it enacts it. This chain is the MAF.

I would very much like to understand the full chain of how my values are causing my actions. If you have any relevant information, please let me know.

Trial and error as an incomplete example of MAF

I want an apple → I try to get an apple and if it doesn’t work, I try something else until I have an apple → I have an apple

This is an example of MAF, because wanting an apple caused me to have an apple through some causal chain. But also notice that there are steps missing. How was “I want an apple” transformed into “I try to get an apple and if it doesn’t work, I try something else until I have an apple”? Why not the alternative plan “I cry until my parents figure out what I want and provide me with an apple”. Also, the second step involves generating actual things to try. When humans execute trial and error, we don’t just do random things, we do something smarter. This is not just about efficiency. If I try to get an apple by trying random actions sequences, with no learning other than “that exact sequence did not work”, then I’ll never get an apple, because I’ll die first.

I could generate a more complete example, involving neural nets or something, and that would be useful. But I’ll stop here for now.

Mapping the territory along the path

When we have solved Alignment, there will be a causal chain, carrying MAF from somewhere inside my brain, all the way to the actions of the AI. The MAF has to survive passage through several distinct regions with minimal (preferably zero) information loss or other corruption.

  1. The first territory to be crossed is travelling from my brain to my actions, where actions are anything that is externally observable.

  2. The next part is the human AI interaction.

  3. The third part is the internal mind of the AI

  4. The last part is how the AI’s actions interact with the world (including me).

Obviously all these paths have lots of back and through feedback, both internally and with the neighbouring territories. However, for now I will not worry too much about that. I’m not yet ready to plan the journey of the MAF. I am mainly focusing on trying to understand and map out all the relevant territory.

My current focus

I’m currently prioritising understanding the brain (the first part of the MAF’s journey), for two reasons:

  1. It’s the part I feel the most confused about.

  2. It is tightly tied up with understanding what human values even are, which is a separate question that I think we need to solve.

The MAF needs to carry information containing my values. It seems like it would be easier to plan the journey if we know things like “What is the type signature of human values?”

By getting a better map of the part of the brain where values are encoded (possibly the entire brain), we’ll both get a better understanding of what human values are, and how to get that information out in a non-corrupted way.

In addition, learning more about how brains work will probably give us useful ideas for Aligned AI design.

Appendix: Is there a possible alternative path around (not through) the brain?

Maybe we don’t need a causal link starting from the representation of my values in my brain. Maybe it is easier to reconstruct my values by looking at the casual inputs that formed my values, e.g. genetic information and life history. Or you can go further back and study how evolution shaped my current values. Maybe if human values are a very natural consequence of evolutionary pressures and game theory, it would be easier to get the information this way, but I’m not very optimistic about it. I expect there to be too much happenstance encoded in my values. I think an understanding of evolution and game theory and learning about my umwelt could do a lot of work in creating better priors, but it is not a substitute for learning about my values directly from me.

One way to view this is that even if you can identify all the incentives that shaped me (both evolutionary and within my lifetime), it would be wrong to assume that my values will be identical to these incentives. See e.g. Reward is not the optimization target. Maybe it is possible to pinpoint my value from my history (evolutionary and personal), but it would be far from straightforward, and personally I don’t think you can get all the information that way. Some of the information relevant to my value formation (e.g. exactly how my brain grew, or some childhood event) will be lost in time, except for the consequences on my brain.

Learning about my history (evolutionary and personal) can be useful for inferring my values, but I don’t think it is enough.

Appendix: Aligned AI as a MAF amplifier

An aligned AI is a magic agentic force amplifier. Importantly, it’s supposed to amplify the force without corrupting the content.

One way to accomplish this is for the AI to learn my values in detail, and then act on those values. I visualise this as I’m having a fountain of MAF at the centre of my soul (I don’t actually believe in souls, it’s just a visualisation). The AI learns my values, through dialogue or something, and creates a copy of my fountain inside itself. Now technically the AI’s actions flow from the copy of my values, but that’s ok if the copying process is precise enough.

Another way to build an agentic force amplifier is to build an AI that is more directly reacting to my actions. My central example for this is a Servomotor, which literally amplifies the literal force applied by my arms. I’ve been thinking about how to generalise this, but I don’t know how to scale it to super intelligence (or even human level intelligence). With the servomotor there is a tight feedback loop where I can notice the outcome of my amplified force on the steering wheel which allows me to force correct. Is there a way to recreate something that plays the same role for more complicated actions? Or is this type of amplification doomed as soon as there is a significant time delay between my action and the outcome of the amplification?