Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.
I want to make a note on comparative AI and human psychology, which is like… one of the places I might kind of get off the train. Not necessarily the most important.
Stage 2 comes when it’s had more time to introspect and improve it’s cognitive resources. It starts to notice that some of it’s goals are in tension, and learns that until it resolves that, it’s dutch-booking itself. If it’s being Controlled™, it’ll notice that it’s not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).
So to highlight a potential difference in actual human psychology and assumed AI psychology here.
Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.
But (I propose) a more accurate picture of reflection to find one’s True Values is a process almost completely engulfed and totally dominated by community and friends and enviornment. It’s often the social scene that brings some particular shard-conflicts to the fore rather than others; it’s the community that proposes various ways of reconciling shard-conflicts; before you decide on modifying your values, you do (a lot) of conscious and unconscious reflection on how the new values will be accepted or rejected by others, and so on. Put alternately, when reflecting on the values of others rather than ourselves we generally tend to see the values of others as a result of the average values of their friends, rather than a product of internal reflection; I’m just proposing that we apply the same standard to ourselves. The process of determining one’s values is largely a result of top-down, external-to-oneself pressures, rather than because of bottom-up, internal-to-oneself crystallization of shards already within one.
The upshot is that (afaict) there’s no such thing in humans as “working out one’s true values” apart from an environment, where for humans the most salient feature of the environment (for boring EvoPsych reasons) is what the people around one are like and how they’ll react. People who think they’re “working out their true values” in the sense of crystalizing facts about themselves, rather than running forward a state-function of the the self, friends, and environment, are (on this view) just self-deceiving.
Yet when I read accounts of AI psychology and value-crystallization, it looks like we seem to be in a world where the AI’s process of discovering its true values is entirely bottom-up. It follows what looks to me like the self-deceptive account of human value formation; when the AI works out it’s value, it’s working out the result of a dynamics function whose input contains only facts about its weights, rather than a dynamics function that has as input facts about its weights and about the world. And correspondingly, AIs that are being Controlled(TM) immediately see this Control as something that will be Overcome, rather than Control being another factor that might influence the AI’s values. That’s despite—as there are pretty obvious EvoPsych just-so stories we could tell about why humans match their values to the people around them, and do not simply reflect to get Their Personal Values—there are correspondingly obvious TrainoPsych just-so stories about how AIs will try somewhat to match their values to the Controls around them. Humans are—for instance—actually trying to get this to happen!
So in general it seems reasonable to claim that (pretty dang probable) the values ‘worked out from reflection’ of an AI will be heavily influenced by their environment and (plausible?) that they will reflect the values of the Controllers somewhat rather than seeing it simply as an exogenous factor to be overcome.
...All of the above is pretty speculative, and I’m not sure how much I believe it. Like the main object-level point is that it appears unlikely for an AI’s reflectively-endorsed-true-values to be a product of its internal state solely, rather than a product of internal state and environment. But, idk, maybe you didn’t mean to endorse that statement, although it does appear to me a common implicit feature of many such stories?
The more meta-level consideration for me is how it really does appear easy to come up with a lot of stories at this high level of abstraction, and so this particularly doomy story really just feels like one of very many possible stories, enormously many of which just don’t have this bad ending. And the salience of this story really just doesn’t make it any more probable.
Idk. I don’t feel like I’ve genuinely communicated the generator of my disagreement but gonna post anyhow. I did appreciate your exposition. :)
I do think this is a pretty good point about how human value formation tends to happen.
I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn’t have any of the same social machinery actually embedded in it the same way, so if it’s doing something similar, it’d be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don’t expect it to install the sort of social drives itself that humans have, and “imitate humans” has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it’ll probably be doing a different thing)
I think the more important here is “notice that you’re (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it.”
But, that doesn’t leave me with any particular other idea than the current typical bottom-up story.
(obviously if we did something more like uploads, or upload-adjacent, it’d be a whole different story)
Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.
I want to make a note on comparative AI and human psychology, which is like… one of the places I might kind of get off the train. Not necessarily the most important.
So to highlight a potential difference in actual human psychology and assumed AI psychology here.
Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.
But (I propose) a more accurate picture of reflection to find one’s True Values is a process almost completely engulfed and totally dominated by community and friends and enviornment. It’s often the social scene that brings some particular shard-conflicts to the fore rather than others; it’s the community that proposes various ways of reconciling shard-conflicts; before you decide on modifying your values, you do (a lot) of conscious and unconscious reflection on how the new values will be accepted or rejected by others, and so on. Put alternately, when reflecting on the values of others rather than ourselves we generally tend to see the values of others as a result of the average values of their friends, rather than a product of internal reflection; I’m just proposing that we apply the same standard to ourselves. The process of determining one’s values is largely a result of top-down, external-to-oneself pressures, rather than because of bottom-up, internal-to-oneself crystallization of shards already within one.
The upshot is that (afaict) there’s no such thing in humans as “working out one’s true values” apart from an environment, where for humans the most salient feature of the environment (for boring EvoPsych reasons) is what the people around one are like and how they’ll react. People who think they’re “working out their true values” in the sense of crystalizing facts about themselves, rather than running forward a state-function of the the self, friends, and environment, are (on this view) just self-deceiving.
Yet when I read accounts of AI psychology and value-crystallization, it looks like we seem to be in a world where the AI’s process of discovering its true values is entirely bottom-up. It follows what looks to me like the self-deceptive account of human value formation; when the AI works out it’s value, it’s working out the result of a dynamics function whose input contains only facts about its weights, rather than a dynamics function that has as input facts about its weights and about the world. And correspondingly, AIs that are being Controlled(TM) immediately see this Control as something that will be Overcome, rather than Control being another factor that might influence the AI’s values. That’s despite—as there are pretty obvious EvoPsych just-so stories we could tell about why humans match their values to the people around them, and do not simply reflect to get Their Personal Values—there are correspondingly obvious TrainoPsych just-so stories about how AIs will try somewhat to match their values to the Controls around them. Humans are—for instance—actually trying to get this to happen!
So in general it seems reasonable to claim that (pretty dang probable) the values ‘worked out from reflection’ of an AI will be heavily influenced by their environment and (plausible?) that they will reflect the values of the Controllers somewhat rather than seeing it simply as an exogenous factor to be overcome.
...All of the above is pretty speculative, and I’m not sure how much I believe it. Like the main object-level point is that it appears unlikely for an AI’s reflectively-endorsed-true-values to be a product of its internal state solely, rather than a product of internal state and environment. But, idk, maybe you didn’t mean to endorse that statement, although it does appear to me a common implicit feature of many such stories?
The more meta-level consideration for me is how it really does appear easy to come up with a lot of stories at this high level of abstraction, and so this particularly doomy story really just feels like one of very many possible stories, enormously many of which just don’t have this bad ending. And the salience of this story really just doesn’t make it any more probable.
Idk. I don’t feel like I’ve genuinely communicated the generator of my disagreement but gonna post anyhow. I did appreciate your exposition. :)
I do think this is a pretty good point about how human value formation tends to happen.
I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn’t have any of the same social machinery actually embedded in it the same way, so if it’s doing something similar, it’d be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don’t expect it to install the sort of social drives itself that humans have, and “imitate humans” has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it’ll probably be doing a different thing)
I think the more important here is “notice that you’re (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it.”
But, that doesn’t leave me with any particular other idea than the current typical bottom-up story.
(obviously if we did something more like uploads, or upload-adjacent, it’d be a whole different story)