I’ve been following this discussion from Jan’s first post, and I’ve been enjoying it. I’ve put together some pictures to explain what I see in this discussion.
Something like the original misalignment might be something like this:
This is fair as a first take, and if we want to look at it through a utility function optimisation lens, we might say something like this:
Where cultural values is the local environment that we’re optimising for.
As Jacob mentions, humans are still very effective when it comes to general optimisation if we look directly at how well it matches evolution’s utility function. This calls for a new model.
Here’s what I think actually happens :
Which can be perceived as something like this in the environmental sense:
Based on this model, what is cultural (human) evolution telling us about misalignment?
We have adopted proxy values (Y1,Y2,..YN) or culture in order to optimise for X or IGF. In other words, the shard of cultural values developed as a more efficient optimisation target in the new environment where different tribes applied optimisation pressure on each other.
Also, I really enjoy the book The Secret Of Our Success when thinking about these models as it provides some very nice evidence about human evolution.
I agree with your general model of the proxy, but human brains are clearly more complex than just optimizing for cultural values. It’s more like culture/memes is a new layer of replicators evolving in tandem with (but much faster) than genes. The genes may determine the architectural prior for the brain and reward functions etc, but the memes are the dataset which more determines the resulting mind. Our reward circuitry hasn’t changed that much recently, so the proxy is mostly still pre-cultural, but cultural software has evolved on top to exploit/cooperate/control that circuitry.
I’ve been following this discussion from Jan’s first post, and I’ve been enjoying it. I’ve put together some pictures to explain what I see in this discussion.
Something like the original misalignment might be something like this:
This is fair as a first take, and if we want to look at it through a utility function optimisation lens, we might say something like this:
Where cultural values is the local environment that we’re optimising for.
As Jacob mentions, humans are still very effective when it comes to general optimisation if we look directly at how well it matches evolution’s utility function. This calls for a new model.
Here’s what I think actually happens :
Which can be perceived as something like this in the environmental sense:
Based on this model, what is cultural (human) evolution telling us about misalignment?
We have adopted proxy values (Y1,Y2,..YN) or culture in order to optimise for X or IGF. In other words, the shard of cultural values developed as a more efficient optimisation target in the new environment where different tribes applied optimisation pressure on each other.
Also, I really enjoy the book The Secret Of Our Success when thinking about these models as it provides some very nice evidence about human evolution.
I agree with your general model of the proxy, but human brains are clearly more complex than just optimizing for cultural values. It’s more like culture/memes is a new layer of replicators evolving in tandem with (but much faster) than genes. The genes may determine the architectural prior for the brain and reward functions etc, but the memes are the dataset which more determines the resulting mind. Our reward circuitry hasn’t changed that much recently, so the proxy is mostly still pre-cultural, but cultural software has evolved on top to exploit/cooperate/control that circuitry.