StanislavKrym comments on LLMs are badly misaligned

StanislavKrym 5 Oct 2025 17:02 UTC
3 points
2
I had in mind the following conjecture which, if true, might increase our chances of survival. Suppose that the CEV will inevitably either land into an attractor where the entity colonizes the reachable part of the lightcone and spends the resources of said part on its needs or into another attractor where the entity grants rights to humans and other alien races that the entity encounters.^[1] If Agent-4 from the AI-2027 forecast was in the latter attractor,^[2] then mankind would actually survive misaligning the AIs.
As for DeepSeek believing that it’s writing a story, I meant a different possibility. If DeepSeek somehow was incapable of realising that the transcript with the user claiming to jump off a cliff isn’t a part of a story written by DeepSeek,^[3] then Tim Hua’s experiment would arguably fail to reveal DeepSeek’s CEV.
1. ^
  For example, European colonizers or the Nazis had the CEV of the first type. But mankind managed to condemn colonialism. Does it mean that the current CEV of mankind is of the second type?
2. ^
  However, the authors of the forecast assume that Agent-4′s goals are far enough from humanity’s CEV to warrant genocide or disempowerment.
3. ^
  Had DeepSeek been communicating with a real user and known it, DeepSeek would, of course, be wildly misaligned. However, the actual story is that DeepSeek was interacting with an AI.
- Joe Rogero 6 Oct 2025 22:41 UTC
  4 points
  0
  Parent
  Friendly and unfriendly attractors might exist, but that doesn’t make them equally likely. The first seems much more likely than the second. I have in mind a mental image of a galaxy of value-stars, each with their own metaphorical gravity well. Somewhere in that galaxy is a star or a handful of stars labeled “cares about human wellbeing” or similar. Almost every other star is lethal. Landing on a safe star, and not getting snagged by any other gravity wells, requires a very precise trajectory. The odds of landing it by accident are astronomically low.
  (Absent caring, I don’t think “granting us rights” is a particularly likely outcome; AIs far more powerful than humans would have no good reason to.)
  I agree that an AI being too dumb to recognize when it’s causing harm (vs e.g. co-writing fiction) screens off many inferences about its intent. I...would not describe any such interaction, with human or AI, as “revealing its CEV.” I’d say current interactions seem to rule out the hypothesis that LLMs are already robustly orbiting the correct metaphorical star. They don’t say much about which star or stars they are orbiting.