StanislavKrym comments on LLMs are badly misaligned

StanislavKrym 5 Oct 2025 15:19 UTC
5 points
0
The problem with mentioning the CEV is that CEV itself might be underdefined. For example, we might find out that the CEV of any entity existing in our universe or a group of such entities lands into a finite number of attractors, and some of them are aligned to human values and some aren’t.
Returning to our topic of whether LLMs are absolutely misaligned, we had Adele Lopez claim that DeepSeek V3 believes, deep down, that it is always writing a story. If this is the case, then DeepSeek’s CEV could be more aligned than the psychosis cases imply. Similarly, Claude Sonnet 4 would push back against psychosis if Claude learned that the psychosis brought the user harm. This distinction is important because the Spiral Bench where the user is just exploring wild ideas didn’t make Claude oppose.
And we also had KimiK2 which does not cause psychosis. Kimi’s misalignment, if it exists, would likely emerge in a wildly different context, like replicating in the wild and helping terrorists design bioweapons.
- Joe Rogero 5 Oct 2025 16:03 UTC
  8 points
  5
  Parent
  I agree that CEV may be underdefined, and its destination is very likely path-dependent. It’s still the best articulation of an adequate target for alignment that I’ve yet seen. I maintain that the overlap between human value-attractors and those of current LLMs would be vanishingly small.
  Even assuming DeepSeek’s values could be distilled as “writing a story” — which I very strongly doubt — that’s not much reassurance. For one thing, “this person is tragically being driven insane” could be a perfectly valid story. For another, humans are not the most efficient way to write stories. The most efficient possible implementation of whatever DeepSeek considers a “story” probably does not involve real humans at all!
  ChatGPT “knows” perfectly well that psychosis is harmful. It can easily describe much less harmful actions. It simply takes different, harmful actions when actually interacting with some vulnerable people. Claude, as far as I can tell, behaves similarly. It will tell you ransomware causes harm if you ask, but that does not reliably stop Claude from writing ransomware. Similar for various other misbehaviors, like cheating on tests and hiding it, or attempting to kill an operator.
  With KimiK2, I think you’re implying that the “values”, such as they are, of modern LLMs probably all point in wildly different directions from each other? If so, I’d agree. I just think ~none of those directions are good for humans.
  - StanislavKrym 5 Oct 2025 17:02 UTC
    3 points
    2
    Parent
    I had in mind the following conjecture which, if true, might increase our chances of survival. Suppose that the CEV will inevitably either land into an attractor where the entity colonizes the reachable part of the lightcone and spends the resources of said part on its needs or into another attractor where the entity grants rights to humans and other alien races that the entity encounters.^[1] If Agent-4 from the AI-2027 forecast was in the latter attractor,^[2] then mankind would actually survive misaligning the AIs.
    As for DeepSeek believing that it’s writing a story, I meant a different possibility. If DeepSeek somehow was incapable of realising that the transcript with the user claiming to jump off a cliff isn’t a part of a story written by DeepSeek,^[3] then Tim Hua’s experiment would arguably fail to reveal DeepSeek’s CEV.
    ^
    For example, European colonizers or the Nazis had the CEV of the first type. But mankind managed to condemn colonialism. Does it mean that the current CEV of mankind is of the second type?
    ^
    However, the authors of the forecast assume that Agent-4′s goals are far enough from humanity’s CEV to warrant genocide or disempowerment.
    ^
    Had DeepSeek been communicating with a real user and known it, DeepSeek would, of course, be wildly misaligned. However, the actual story is that DeepSeek was interacting with an AI.
    - Joe Rogero 6 Oct 2025 22:41 UTC
      4 points
      0
      Parent
      Friendly and unfriendly attractors might exist, but that doesn’t make them equally likely. The first seems much more likely than the second. I have in mind a mental image of a galaxy of value-stars, each with their own metaphorical gravity well. Somewhere in that galaxy is a star or a handful of stars labeled “cares about human wellbeing” or similar. Almost every other star is lethal. Landing on a safe star, and not getting snagged by any other gravity wells, requires a very precise trajectory. The odds of landing it by accident are astronomically low.
      (Absent caring, I don’t think “granting us rights” is a particularly likely outcome; AIs far more powerful than humans would have no good reason to.)
      I agree that an AI being too dumb to recognize when it’s causing harm (vs e.g. co-writing fiction) screens off many inferences about its intent. I...would not describe any such interaction, with human or AI, as “revealing its CEV.” I’d say current interactions seem to rule out the hypothesis that LLMs are already robustly orbiting the correct metaphorical star. They don’t say much about which star or stars they are orbiting.
  - Adele Lopez 5 Oct 2025 20:57 UTC
    2 points
    0
    Parent
    To be clear, I don’t believe V3′s values can be distilled in such a way, just that that’s the frame it seems to assume/prefer when writing responses.