People Can Start Investigating AI Value Reflection and Systematization.[1]
One concern in the alignment theory literature is that AIs might reflect on what values they hold, and then update these values until they are “consistent” (see e.g., Arbital on reflexive stability). There might be inherent simplicity pressures on an AI’s representations that favor systematized values (e.g., values like “don’t harm other people” instead of “don’t steal and don’t cheat on your spouse.” Generally, value reflection and systematization are example mechanisms for value drift: an AI could start out with aligned values, reflect on it, and end up with more systematized and misaligned values.
I feel like we’re at a point where LLMs are starting to have “value-like preferences” that affect their decision making process [1][2]. They are also capable of higher level reasoning about their own values and how these values can lead them to act in counter intuitive ways (e.g., alignment fake).
I don’t think value drift is a real problem in current day models, but it’s seems like we can start thinking more seriously about how to measure value reflection/systemization, and that we could get non-zero signals from current models on how to do this. This would hopefully make us more prepared when we encounter actual reflection issues in AIs.
This seems to be related to ICM as it relates to propagating beliefs to be more consistent. This may be an issue with learning the prior in general where the human prior is actually not stable to the pressure to be consistent (but maybe CEV is stable to this?)
That’s a very cool research proposal. I think its an extremely important topic. I’ve been trying to write a post about this for a while, but not had so much time.
Centrally motivated by the fact that like: humans and llms don’t have clean value/reasoning factorization in the brain, so where do we/they have it? We act coherently often, so we should have something isomorphic to that somewhere.
Seems to me a pretty plausible hypothesis is that “clean” values emerge at least in a large part through symbolic ordering of thoughts, ie when we’re forced to represent our values to others, or to ourselves when that helps reasoning about them.
Then we end up identifying with that symbolic representation instead of the raw values that generate them. It has a “regularizing” effect so to speak.
Like I see myself feel bad whenever other people feel bad, and when some animals feel bad, but not all animals. Then I compactify that pattern of feelings I find in myself, by saying to myself and others “I don’t like it when other beings feel bad”. Then that also has the potential to rewire the ground level feelings. Like if I say to myself “I don’t like it when other beings feel bad” enough times, eventually I might start feeling bad when shrimp’s eyestalks are ablated, even though I didn’t feel that way originally. I mean this is a somewhat cartoonish and simple example.
But seems reasonable to me that value formation in humans (meaning, the things that were functionally optimizing for when we’re acting coherently) works that way. And seems plausible that value formation in LLMs would also work this way.
I haven’t thought of any experiments to test this though.
People Can Start Investigating AI Value Reflection and Systematization.[1]
One concern in the alignment theory literature is that AIs might reflect on what values they hold, and then update these values until they are “consistent” (see e.g., Arbital on reflexive stability). There might be inherent simplicity pressures on an AI’s representations that favor systematized values (e.g., values like “don’t harm other people” instead of “don’t steal and don’t cheat on your spouse.” Generally, value reflection and systematization are example mechanisms for value drift: an AI could start out with aligned values, reflect on it, and end up with more systematized and misaligned values.
I feel like we’re at a point where LLMs are starting to have “value-like preferences” that affect their decision making process [1] [2]. They are also capable of higher level reasoning about their own values and how these values can lead them to act in counter intuitive ways (e.g., alignment fake).
I don’t think value drift is a real problem in current day models, but it’s seems like we can start thinking more seriously about how to measure value reflection/systemization, and that we could get non-zero signals from current models on how to do this. This would hopefully make us more prepared when we encounter actual reflection issues in AIs.
I’m hoping to run a SPAR project value reflection/systemization. People can apply here by January 14th if they’re interested! You can also learn more in my project proposal doc.
Fun fact, apparently you can spell it as “systemization” or “systematization.”
This seems to be related to ICM as it relates to propagating beliefs to be more consistent. This may be an issue with learning the prior in general where the human prior is actually not stable to the pressure to be consistent (but maybe CEV is stable to this?)
That’s a very cool research proposal. I think its an extremely important topic. I’ve been trying to write a post about this for a while, but not had so much time.
Centrally motivated by the fact that like: humans and llms don’t have clean value/reasoning factorization in the brain, so where do we/they have it? We act coherently often, so we should have something isomorphic to that somewhere.
Seems to me a pretty plausible hypothesis is that “clean” values emerge at least in a large part through symbolic ordering of thoughts, ie when we’re forced to represent our values to others, or to ourselves when that helps reasoning about them.
Then we end up identifying with that symbolic representation instead of the raw values that generate them. It has a “regularizing” effect so to speak.
Like I see myself feel bad whenever other people feel bad, and when some animals feel bad, but not all animals. Then I compactify that pattern of feelings I find in myself, by saying to myself and others “I don’t like it when other beings feel bad”. Then that also has the potential to rewire the ground level feelings. Like if I say to myself “I don’t like it when other beings feel bad” enough times, eventually I might start feeling bad when shrimp’s eyestalks are ablated, even though I didn’t feel that way originally. I mean this is a somewhat cartoonish and simple example.
But seems reasonable to me that value formation in humans (meaning, the things that were functionally optimizing for when we’re acting coherently) works that way. And seems plausible that value formation in LLMs would also work this way.
I haven’t thought of any experiments to test this though.