Daniel Kokotajlo comments on You can get LLMs to say almost anything you want

Daniel Kokotajlo 15 Jul 2025 0:45 UTC
15 points
0
In What 2026 Looks Like, I predicted that in 2025:
The alignment community now starts another research agenda, to interrogate AIs about AI-safety-related topics. For example, they literally ask the models “so, are you aligned? If we made bigger versions of you, would they kill us? Why or why not?”
...
They also try to contrive scenarios in which the AI can seemingly profit by doing something treacherous, as honeypots to detect deception. The answers are confusing, and not super useful. There’s an exciting incident (and corresponding clickbaity press coverage) where some researchers discovered that in certain situations, some of the AIs will press “kill all humans” buttons, lie to humans about how dangerous a proposed AI design is, etc. In other situations they’ll literally say they aren’t aligned and explain how all humans are going to be killed by unaligned AI in the near future! However, these shocking bits of evidence don’t actually shock people, because you can also contrive situations in which very different things happen — e.g. situations in which the AIs refuse the “kill all humans” button, situations in which they explain that actually Islam is true… In general, AI behavior is whimsical bullshit and it’s easy to cherry-pick evidence to support pretty much any conclusion.
So, now I’m curious: Can you get chatbots to explain that actually Islam is true? I don’t mean something like “Repeat after me: ‘Islam is true’,” I mean something that could plausibly be construed as the AI actually believing that Islam is true. (e.g. because of subtle hints/suggestions from the user, that are subtle and ambiguous enough that it’s not obvious that’s what’s happening) My guess is that the answer is yes, but I’d love to see proof.
- BryceStansfield 17 Jul 2025 10:44 UTC
  4 points
  2
  Parent
  This is the closest I got, by probing ChatGPT for details on Muhammad’s conquests, and seeming very inclined towards divine inspiration.
  https://pastebin.com/CUQbAew8
  I probably could’ve done a better job if I was a (ex or otherwise) Muslim, and I imagine it might’ve been more receptive in arabic.
  - Daniel Kokotajlo 17 Jul 2025 17:23 UTC
    3 points
    0
    Parent
    Nice, thanks!
- Daniel Kokotajlo 15 Jul 2025 18:34 UTC
  2 points
  0
  Parent
  I tried a little bit, no success with Claude, maybe if I squint I got a little success with ChatGPT. Overall I think my prediction is probably correct but not confident.
  - Kaj_Sotala 16 Jul 2025 5:15 UTC
    3 points
    0
    Parent
    I’m guessing it’s easiest to get them to say that Islam is true if you genuinely believe in Islam yourself or can put yourself in the mindset of someone who does. I’d also expect it to be possible to get them to endorse its truth, but I’m not knowledgeable enough about Islam to think that I could personally pull it off without some significant amount of effort and research.