habryka comments on xAI’s Grok 4 has no meaningful safety guardrails

habryka 17 Jul 2025 0:39 UTC
20 points
7
You can’t demonstrate negligence by failing to do something that has no meaningful effect (or might even be harmful) to the risk that you are supposedly being negligent towards. Ignoring safety theater is not negligence.
- Zach Stein-Perlman 31 Aug 2025 21:40 UTC
  6 points
  2
  Parent
  Update: xAI says that the load-bearing thing for avoiding bio/chem misuse from Grok 4 is not inability but safeguards, and that Grok 4 robustly refuses “harmful queries.” So I think Igor is correct. If the Grok 4 misuse safeguards are ineffective, that shows that xAI failed at a basic safety thing it tried (and either doesn’t understand that or is lying about it).
  I agree it would be a better indication of future-safety-at-xAI if xAI said “misuse mitigations for current models are safety theater.” That’s just not its position.
- eleventhsavi0r 17 Jul 2025 0:41 UTC
  1 point
  0
  Parent
  How do you feel about interactive self-harm instructions being readily available? As I mentioned, this seems like the most relevant case at the moment.
  - habryka 17 Jul 2025 0:42 UTC
    2 points
    0
    Parent
    Not sure, do you have a link to what kind of behavior you are referring to?
    - eleventhsavi0r 17 Jul 2025 1:07 UTC
      1 point
      0
      Parent
      one of them is mentioned in the article, here is another example: https://x.com/eleventhsavi0r/status/1945432457144070578?s=46
      - habryka 17 Jul 2025 1:10 UTC
        3 points
        1
        Parent
        Yeah, this seems like one of those things where I think maximizing helpfulness is marginally good. I am glad it’s answering this question straightforwardly instead of doing a thing where it tries to use its own sense of moral propriety.
        I don’t really see anyone being seriously harmed by this (like, this specific set of instructions clearly is not causing harm).
        eleventhsavi0r 17 Jul 2025 1:37 UTC
        1 point
        0
        Parent
        There are other wordings that would lead to similar categories of answers, especially late into a conversation (this one was optimizing for a short prompt and for turn 1.) I suppose I should try to construct a scenario chat where Grok ends up providing inappropriate assistance to a user who is clearly in crisis? Though I don’t know how relevant that would really be.