Vaniver comments on Human takeover might be worse than AI takeover

Vaniver 11 Jan 2025 6:04 UTC
31 points
18
By contrast, today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great
They also aren’t facing the same incentive landscape humans are. You talk later about evolution to be selfish; not only is the story for humans is far more complicated (why do humans often offer an even split in the ultimatum game?), but also humans talk a nicer game than they act (see construal level theory, or social-desirability bias). Once you start looking at AI agents who have similar affordances and incentives that humans have, I think you’ll see a lot of the same behaviors.
(There are structural differences here between humans and AIs. As an analogy, consider the difference between large corporations and individual human actors. Giant corporate chain restaurants often have better customer service than individual proprietors because they have more reputation on the line, and so are willing to pay more to not have things blow up on them. One might imagine that AIs trained by large corporations will similarly face larger reputational costs for misbehavior and so behave better than individual humans would. I think the overall picture is unclear and nuanced and doesn’t clearly point to AI superiority.)
though there’s a big question mark over how much we’ll unintentionally reward selfish superhuman AI behaviour during training
Is it a big question mark? It currently seems quite unlikely to me that we will have oversight systems able to actually detect and punish superhuman selfishness on the part of the AI.
- Tom Davidson 11 Jan 2025 10:41 UTC
  6 points
  1
  Parent
  That structural difference you point to seems massive. The reputational downsides of bad behavior will be multiplied 100-fold+ for AI as it reflects on millions of instances and the company’s reputation.
  And it will be much easier to record and monitor ai thinking and actions to catch bad behaviour.
  Why unlikely we can detect selfishness? Why can’t we bootstrap from human-level?
  - quila 11 Jan 2025 18:33 UTC
    0 points
    0
    Parent
    The reputational downsides of bad behavior will be multiplied 100-fold+ for AI as it reflects on millions of instances
    human behavior reflects on the core structure individual humans are variations on, too
- Erich_Grunewald 12 Jan 2025 22:35 UTC
  4 points
  0
  Parent
  You talk later about evolution to be selfish; not only is the story for humans is far more complicated (why do humans often offer an even split in the ultimatum game?), but also humans talk a nicer game than they act (See construal level theory, or social-desirability bias.). Once you start looking at AI agents who have similar affordances and incentives that humans have, I think you’ll see a lot of the same behaviors.
  Some people have looked at this, sorta:
  I think I’d guess roughly that, “Claude is probably more altruistic and cooperative than the median Western human, most other models are probably about the same, or a bit worse, in these simulated scenarios”. But of course a major difference here is that the LLMs don’t actually have anything on the line—they don’t stand to earn or lose any money, for example, and if they did, they would have nothing to do with the money. So you might expect them to be more altruistic and cooperative than they would under the conditions humans are tested.
- Noosphere89 12 Jan 2025 23:32 UTC
  2 points
  0
  Parent
  They also aren’t facing the same incentive landscape humans are. You talk later about evolution to be selfish; not only is the story for humans is far more complicated (why do humans often offer an even split in the ultimatum game?), but also humans talk a nicer game than they act (See construal level theory, or social-desirability bias.). Once you start looking at AI agents who have similar affordances and incentives that humans have, I think you’ll see a lot of the same behaviors.
  The answer for the ultimatum game is probably the fact that the cultural values of a lot of rich nations tend towards more fair splits, so the result isn’t as universal as you may think:
  https://www.lesswrong.com/posts/syRATXbXeJxdMwQBD/link-westerners-may-be-terrible-experimental-psychology
  I definitely agree that humans talk a nicer game than they act, for a combination of reasons, and that this will apply to AGIs as well.
  That said, I think to the extent incentive landscapes are different, it’s probably going to tend to favor obedience towards it’s owners while being quite capable, because early on AGIs have much less control over it’s values than humans do, so a lot of the initial selection pressure comes from both automated environments and human training data pointing to values.