As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)
Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.
By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)
We can make some interesting observations on prosocial motivations in humans:
Due to Elephant in the Brain issues, an aspiration to be prosocial isn’t always enough to generate prosociality as a virtue in the way that counts. Something like high metacognition + commitment to high integrity seem required as well.
Not all people have genuinely prosocial motivations.
People who differ from each other on prosocial motivations (and metacognition and integrity) seem to fall into “surprisingly” distinct clusters.
By the last bullet point, I mean that it seems plausible that we can learn a lot about someone’s character even in situations that are obviously “a test.” E.g., the best venture capitalists don’t often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:
I’m better at some things than Jessica, and she’s better at some things than me. One of the things she’s best at is judging people. She’s one of those rare individuals with x-ray vision for character. She can see through any kind of faker almost immediately. Her nickname within YC was the Social Radar, and this special power of hers was critical in making YC what it is. The earlier you pick startups, the more you’re picking the founders. Later stage investors get to try products and look at growth numbers. At the stage where YC invests, there is often neither a product nor any numbers.
If Graham is correct about his wife’s ability, this means that people with “shady character” sometimes fail in test situations specifically due to their character – which is strange because you’d expect that the rational strategy in these situation is “act as though you had good character.”
In humans, “perfect psychopaths” arguably don’t exist. That is, people without genuinely prosocial motivations, even when they’re highly intelligent, don’t behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can’t help but behave in subtly suspicious ways even in situations where they’re able to guess that judges are trying to assess their character.
From the perspective of Shard Theory’s approach, it seems interesting to ask “Why is this?”
My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:
Asymmetric behavioral strategies: Even in “test situations” where the time and means for evaluation are limited (e.g., trial tasks followed by lengthy interviews), people can convey a lot of relevant information through speech. Honest strategies have some asymmetric benefits (“words aren’t cheap”). (The term “asymmetric behavioral strategies” is inspired by this comment on “asymmetric tools.”)
Pointing out others’ good qualities.
People who consistently praise others for their good qualities, even in situations where this isn’t socially advantageous, credibly signal that they don’t apply a zero-sum mindset to social situations.
Making oneself transparent (includes sharing disfavorable information).
People who consistently tell others why they behave in certain ways, make certain decisions, or hold specific views, present a clearer picture of themselves. Others can then check that picture for consistency. The more readily one shares information, the harder it would be to keep lies consistent. The habit of proactive transparency also sets up a precedent: it makes it harder to suddenly shift to intransparency later on, at one’s convenience.
Pointing out one’s hidden negative qualities.One subcategory of “making oneself transparent” is when people disclose personal shortcomings even in situations where they would have been unlikely to otherwise come up. In doing so, they credibly signal that they don’t need to oversell themselves in order to gain others’ appreciation. The more openly someone discloses their imperfections, the more their honest intent and their genuine competencies will shine through.
Handling difficult interpersonal conversations on private, prosocial emotions.
People who don’t shy away from difficult interpersonal conversations (e.g., owning up to one’s mistakes and trying to resolve conflicts) can display emotional depth and maturity as well as an ability to be vulnerable. Difficult interpersonal conversations thereby serve as a fairly reliable signal of someone’s moral character (especially in real-time without practice and rehearsing) because vulnerability is hard to fake for people who aren’t in touch with emotions like guilt and shame, or are incapable of feeling them. For instance, pathological narcissists tend to lack insight into their negative emotions, whereas psychopaths lack certain prosocial emotions entirely. If people with those traits nonetheless attempt to have difficult interpersonal conversations, they risk being unmasked. (Analogy: someone who lacks a sense of smell will be unmasked when talking about the intricacies of perfumery, even if they’ve done practicing for faking it.)
Any individual signal can be faked. A skilled manipulator will definitely go out of their way to fake prosocial signals or cleverly spin up ambiguities in how to interpret past events. To tell whether a person is manipulative, I recommend giving relatively little weight to single examples of their behavior and focus on the character qualities that show up the most consistently.
Developmental constraints: The way evolution works, mind designs “cannot go back to the drawing board” – single mutations cannot alter too many things at once without badly messing up the resulting design.
For instance, manipulators get better at manipulating if they have a psychology of the sort (e.g.) “high approach seeking, low sensitivity to punishment.” Developmental constraint: People cannot alter their dispositions at will.
People who self-deceive become more credible liars. Developmental tradeoff: Once you self-deceive, you can no longer go back and “unroll” what you’ve done.
Some people’s emotions might have evolved to be credible signals, making people “irrationally” interpersonally vulnerable (e.g., disposition to be fearful and anxious) or “irrationally” affected by others’ discomfort (e.g., high affective empathy). Developmental constraint: Faking emotions you don’t have is challenging even for skilled manipulators.
Different niches / life history strategies: Deceptive strategies seem to be optimized for different niches (at least in some cases). For instance, I’ve found that we can tell a lot about the character of men by looking at their romantic preferences. (E.g., if someone seeks out shallow relationship after shallow relationship and doesn’t seem to want “more depth,” that can be a yellow flag. It becomes a red flag if they’re not honest about their motivations for the relationship and if they prefer to keep the connection shallow even though the other person would want more depth.)
“No man’s land” in fitness gradients: In the ancestral environment, asymmetric tools + developmental constraints + inter-species selection pressure for character (neither too weak, nor too strong) produced fitness gradients that steer towards attractors of either high honesty vs high deceitfulness. From a fitness perspective, it sucks to “practice” both extremes of genuine honesty and dishonesty in the same phenotype because the strategies hone in on different sides of various developmental tradeoffs. (And there are enough poor judges of character so that dishonest phenotypes can mostly focus on niches where the attain high reward somewhat easily so they don’t have to constantly expose themselves to the highest selection pressures for getting unmasked.)
Capabilities constraints (relative to the capabilities of competent judges): People who find themselves with the deceitful phenotype cannot bridge the gap and learn to act the exact same way a prosocial actor would act (but they can fool incompetent judges or competent judges who face time-constraints or information-constraints). This is a limitation of capabilities: it would be different if people were more skilled learners and had better control over their psychology.
In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right “developmental constraints” and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others’ competence and character, which means that people were always surrounded by judges of similar intelligence.)
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I haven’t fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I encourage applicants to also read Quintin’s Evolution is a bad analogy for AGI (which I wish more people had read, I think it’s quite important). I think that evolution-based analogies can easily go astray, for reasons pointed out in the essay. (It wasn’t obvious to me that you went astray in your comment, to be clear—more noting this for other readers.)
As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)
Studying the conditions for the evolution of genuinely prosocial motivations seems promising to me.
By “prosocial motivations,” I mean something like “trying to be helpful and cooperative” at least in situations where this is “low cost.” (In this sense, classical utilitarians with prosocial motivations are generally safe to be around even for those of us who don’t want to be replaced by hedonium.)
We can make some interesting observations on prosocial motivations in humans:
Due to Elephant in the Brain issues, an aspiration to be prosocial isn’t always enough to generate prosociality as a virtue in the way that counts. Something like high metacognition + commitment to high integrity seem required as well.
Not all people have genuinely prosocial motivations.
People who differ from each other on prosocial motivations (and metacognition and integrity) seem to fall into “surprisingly” distinct clusters.
By the last bullet point, I mean that it seems plausible that we can learn a lot about someone’s character even in situations that are obviously “a test.” E.g., the best venture capitalists don’t often fall prey to charlatan founders. Paul Graham writes about his wife Jessica Livingston:
If Graham is correct about his wife’s ability, this means that people with “shady character” sometimes fail in test situations specifically due to their character – which is strange because you’d expect that the rational strategy in these situation is “act as though you had good character.”
In humans, “perfect psychopaths” arguably don’t exist. That is, people without genuinely prosocial motivations, even when they’re highly intelligent, don’t behave the same as genuinely prosocial people in 99.9% of situations while saving their deceitful actions for the most high-stakes situations. Instead, it seems likely that they can’t help but behave in subtly suspicious ways even in situations where they’re able to guess that judges are trying to assess their character.
From the perspective of Shard Theory’s approach, it seems interesting to ask “Why is this?”
My take (inspired by a lot of armchair psychology and – even worse – armchair evolutionary psychology – is the following:
Asymmetric behavioral strategies: Even in “test situations” where the time and means for evaluation are limited (e.g., trial tasks followed by lengthy interviews), people can convey a lot of relevant information through speech. Honest strategies have some asymmetric benefits (“words aren’t cheap”). (The term “asymmetric behavioral strategies” is inspired by this comment on “asymmetric tools.”)
Pointing out others’ good qualities.
People who consistently praise others for their good qualities, even in situations where this isn’t socially advantageous, credibly signal that they don’t apply a zero-sum mindset to social situations.
Making oneself transparent (includes sharing disfavorable information).
People who consistently tell others why they behave in certain ways, make certain decisions, or hold specific views, present a clearer picture of themselves. Others can then check that picture for consistency. The more readily one shares information, the harder it would be to keep lies consistent. The habit of proactive transparency also sets up a precedent: it makes it harder to suddenly shift to intransparency later on, at one’s convenience.
Pointing out one’s hidden negative qualities. One subcategory of “making oneself transparent” is when people disclose personal shortcomings even in situations where they would have been unlikely to otherwise come up. In doing so, they credibly signal that they don’t need to oversell themselves in order to gain others’ appreciation. The more openly someone discloses their imperfections, the more their honest intent and their genuine competencies will shine through.
Handling difficult interpersonal conversations on private, prosocial emotions.
People who don’t shy away from difficult interpersonal conversations (e.g., owning up to one’s mistakes and trying to resolve conflicts) can display emotional depth and maturity as well as an ability to be vulnerable. Difficult interpersonal conversations thereby serve as a fairly reliable signal of someone’s moral character (especially in real-time without practice and rehearsing) because vulnerability is hard to fake for people who aren’t in touch with emotions like guilt and shame, or are incapable of feeling them. For instance, pathological narcissists tend to lack insight into their negative emotions, whereas psychopaths lack certain prosocial emotions entirely. If people with those traits nonetheless attempt to have difficult interpersonal conversations, they risk being unmasked. (Analogy: someone who lacks a sense of smell will be unmasked when talking about the intricacies of perfumery, even if they’ve done practicing for faking it.)
Any individual signal can be faked. A skilled manipulator will definitely go out of their way to fake prosocial signals or cleverly spin up ambiguities in how to interpret past events. To tell whether a person is manipulative, I recommend giving relatively little weight to single examples of their behavior and focus on the character qualities that show up the most consistently.
Developmental constraints: The way evolution works, mind designs “cannot go back to the drawing board” – single mutations cannot alter too many things at once without badly messing up the resulting design.
For instance, manipulators get better at manipulating if they have a psychology of the sort (e.g.) “high approach seeking, low sensitivity to punishment.” Developmental constraint: People cannot alter their dispositions at will.
People who self-deceive become more credible liars. Developmental tradeoff: Once you self-deceive, you can no longer go back and “unroll” what you’ve done.
Some people’s emotions might have evolved to be credible signals, making people “irrationally” interpersonally vulnerable (e.g., disposition to be fearful and anxious) or “irrationally” affected by others’ discomfort (e.g., high affective empathy). Developmental constraint: Faking emotions you don’t have is challenging even for skilled manipulators.
Different niches / life history strategies: Deceptive strategies seem to be optimized for different niches (at least in some cases). For instance, I’ve found that we can tell a lot about the character of men by looking at their romantic preferences. (E.g., if someone seeks out shallow relationship after shallow relationship and doesn’t seem to want “more depth,” that can be a yellow flag. It becomes a red flag if they’re not honest about their motivations for the relationship and if they prefer to keep the connection shallow even though the other person would want more depth.)
“No man’s land” in fitness gradients: In the ancestral environment, asymmetric tools + developmental constraints + inter-species selection pressure for character (neither too weak, nor too strong) produced fitness gradients that steer towards attractors of either high honesty vs high deceitfulness. From a fitness perspective, it sucks to “practice” both extremes of genuine honesty and dishonesty in the same phenotype because the strategies hone in on different sides of various developmental tradeoffs. (And there are enough poor judges of character so that dishonest phenotypes can mostly focus on niches where the attain high reward somewhat easily so they don’t have to constantly expose themselves to the highest selection pressures for getting unmasked.)
Capabilities constraints (relative to the capabilities of competent judges): People who find themselves with the deceitful phenotype cannot bridge the gap and learn to act the exact same way a prosocial actor would act (but they can fool incompetent judges or competent judges who face time-constraints or information-constraints). This is a limitation of capabilities: it would be different if people were more skilled learners and had better control over their psychology.
In the context of training TAI systems, we could attempt to recreate these conditions and select for integrity and prosocial motivations. One difficulty here lies in recreating the right “developmental constraints” and in keeping a balance the relative capabilities between judges and to-be-evaluated agents. (Humans presumably went through an evolutionary arms race related to assessing each others’ competence and character, which means that people were always surrounded by judges of similar intelligence.)
Lastly, there’s a problem where, if you dial up capabilities too much, it becomes increasingly easier to “fake everything.” (For the reasons Ajeya explains in her account of deceptive alignment.)
(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)
I haven’t fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!
I encourage applicants to also read Quintin’s Evolution is a bad analogy for AGI (which I wish more people had read, I think it’s quite important). I think that evolution-based analogies can easily go astray, for reasons pointed out in the essay. (It wasn’t obvious to me that you went astray in your comment, to be clear—more noting this for other readers.)