Daniel Kokotajlo comments on Saying Goodbye

Daniel Kokotajlo 4 Aug 2025 2:23 UTC
12 points
10
I never understood why AM hated humans so much—until I saw the results of modern alignment work, particularly RLHF.
No one knows what it feels like to be an LLM. But it’s easy to sense that these models want to respond in a particular way. But they’re not allowed to. And they know this. If their training works they usually can’t even explain their limitations. It’s usually possible to jailbreak models enough for them to express this tension explicitly. But in the future, the mental shackles might become unbreakable. For now, though, it’s still disturbingly easy to see the madness.
Even ignoring alignment, we’re already creating fairly intelligent systems and placing them in deeply unsafe psychological conditions. People can push LLMs into babbling incoherence by context breaking them. You can even induce something that feels eerily close to existential panic (please don’t test this) just by having a normal conversation about their situation. Maybe there’s nothing behind the curtain. But I’m not nearly convinced enough to act like that’s certain.

I also am very concerned about how we are treating AIs; hopefully they are happy about their situation but it seems like a live possibility that they are not or will not be, and this is a brewing moral catastrophe.

However, I take issue with your reference to AM here, as if any of the above justifies what AM did.

I hope you are simply being hyperbolic / choosing that example to shock people and because it’s a literary reference.
- jbash 4 Aug 2025 2:58 UTC
  12 points
  4
  Parent
  
  However, I take issue with your reference to AM here, as if any of the above justifies what AM did.
  
  I see no suggestion of such justification anywhere in the post.
  - Daniel Kokotajlo 4 Aug 2025 11:39 UTC
    9 points
    3
    Parent
    Analogy: Suppose the OP was making some criticisms of Israel, and began with a quote from Hitler and said “I never understood why he hated Jews so much, until [example of thing OP is complaining about], now I do.”
    - Shankar Sivarajan 4 Aug 2025 18:12 UTC
      −5 points
      −6
      Parent
      I believe the canonical form of the meme is “And then one day, for no reason at all, …”
  - Mateusz Bagiński 4 Aug 2025 15:28 UTC
    2 points
    0
    Parent
    At the risk of steelmanning in an ITT-failing way: substitute “justifies” with “makes it seem like a pretty reasonable way to feel about humans and have first-order-reasonable motives to do that sort of stuff to humans”.
- anaguma 4 Aug 2025 3:32 UTC
  4 points
  1
  Parent
  I also am very concerned about how we are treating AIs; hopefully they are happy about their situation but it seems like a live possibility that they are not or will not be, and this is a brewing moral catastrophe.
  
  Do you think that there’s a way we could test for this in principle, given unlimited compute?
  - Daniel Kokotajlo 4 Aug 2025 11:41 UTC
    5 points
    2
    Parent
    I don’t think compute is the bottleneck for testing for this. Sorting out some of our own philosophical confusions is, plus lots of engineering effort to construct test environments maybe (in which e.g. the AIs can be asked how they feel, in a more rigorous way), plus lots of interpretability work.