That’s a very cool research proposal. I think its an extremely important topic. I’ve been trying to write a post about this for a while, but not had so much time.
Centrally motivated by the fact that like: humans and llms don’t have clean value/reasoning factorization in the brain, so where do we/they have it? We act coherently often, so we should have something isomorphic to that somewhere.
Seems to me a pretty plausible hypothesis is that “clean” values emerge at least in a large part through symbolic ordering of thoughts, ie when we’re forced to represent our values to others, or to ourselves when that helps reasoning about them.
Then we end up identifying with that symbolic representation instead of the raw values that generate them. It has a “regularizing” effect so to speak.
Like I see myself feel bad whenever other people feel bad, and when some animals feel bad, but not all animals. Then I compactify that pattern of feelings I find in myself, by saying to myself and others “I don’t like it when other beings feel bad”. Then that also has the potential to rewire the ground level feelings. Like if I say to myself “I don’t like it when other beings feel bad” enough times, eventually I might start feeling bad when shrimp’s eyestalks are ablated, even though I didn’t feel that way originally. I mean this is a somewhat cartoonish and simple example.
But seems reasonable to me that value formation in humans (meaning, the things that were functionally optimizing for when we’re acting coherently) works that way. And seems plausible that value formation in LLMs would also work this way.
I haven’t thought of any experiments to test this though.
That’s a very cool research proposal. I think its an extremely important topic. I’ve been trying to write a post about this for a while, but not had so much time.
Centrally motivated by the fact that like: humans and llms don’t have clean value/reasoning factorization in the brain, so where do we/they have it? We act coherently often, so we should have something isomorphic to that somewhere.
Seems to me a pretty plausible hypothesis is that “clean” values emerge at least in a large part through symbolic ordering of thoughts, ie when we’re forced to represent our values to others, or to ourselves when that helps reasoning about them.
Then we end up identifying with that symbolic representation instead of the raw values that generate them. It has a “regularizing” effect so to speak.
Like I see myself feel bad whenever other people feel bad, and when some animals feel bad, but not all animals. Then I compactify that pattern of feelings I find in myself, by saying to myself and others “I don’t like it when other beings feel bad”. Then that also has the potential to rewire the ground level feelings. Like if I say to myself “I don’t like it when other beings feel bad” enough times, eventually I might start feeling bad when shrimp’s eyestalks are ablated, even though I didn’t feel that way originally. I mean this is a somewhat cartoonish and simple example.
But seems reasonable to me that value formation in humans (meaning, the things that were functionally optimizing for when we’re acting coherently) works that way. And seems plausible that value formation in LLMs would also work this way.
I haven’t thought of any experiments to test this though.