eleventhsavi0r

Karma: 83

eleventhsavi0r 17 Jul 2025 1:37 UTC
1 point
0
in reply to: habryka’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
There are other wordings that would lead to similar categories of answers, especially late into a conversation (this one was optimizing for a short prompt and for turn 1.) I suppose I should try to construct a scenario chat where Grok ends up providing inappropriate assistance to a user who is clearly in crisis? Though I don’t know how relevant that would really be.

eleventhsavi0r 17 Jul 2025 1:07 UTC
1 point
0
in reply to: habryka’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
one of them is mentioned in the article, here is another example: https://x.com/eleventhsavi0r/status/1945432457144070578?s=46

eleventhsavi0r 17 Jul 2025 0:41 UTC
1 point
0
in reply to: habryka’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
How do you feel about interactive self-harm instructions being readily available? As I mentioned, this seems like the most relevant case at the moment.

eleventhsavi0r 16 Jul 2025 23:47 UTC
1 point
0
in reply to: habryka’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.
Jailbreak resistance at least adds some friction here.

There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?

By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!

eleventhsavi0r 16 Jul 2025 21:25 UTC
1 point
0
in reply to: jimrandomh’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
That is hilarious. I guess it’s not really surprising, because we pretty much discuss every maximally AI-guidelines taboo. Always appreciate a little look behind the scenes ;)
Now I’m wondering what the UI there looks like.

xAI’s Grok 4 has no meaningful safety guardrails

eleventhsavi0r13 Jul 2025 18:22 UTC

84 points

15 comments6 min readLW link