one of them is mentioned in the article, here is another example: https://x.com/eleventhsavi0r/status/1945432457144070578?s=46
eleventhsavi0r
How do you feel about interactive self-harm instructions being readily available? As I mentioned, this seems like the most relevant case at the moment.
I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.
Jailbreak resistance at least adds some friction here.
There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?
By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!
That is hilarious. I guess it’s not really surprising, because we pretty much discuss every maximally AI-guidelines taboo. Always appreciate a little look behind the scenes ;)
Now I’m wondering what the UI there looks like.
There are other wordings that would lead to similar categories of answers, especially late into a conversation (this one was optimizing for a short prompt and for turn 1.) I suppose I should try to construct a scenario chat where Grok ends up providing inappropriate assistance to a user who is clearly in crisis? Though I don’t know how relevant that would really be.