I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.
Jailbreak resistance at least adds some friction here.
There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?
By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!
I actually mostly agree with this point. As I noted, the strongest issue I see from these results is from the ease of accessing self-harm instructions or encouragement. Vulnerable users could trivially access these and be pushed deeper into psychological spirals (of a variety rather worse than we’ve seen with 4o syndrome) or just pushed to commit suicide, cut themselves, kill others, all manner of nasty things.
Jailbreak resistance at least adds some friction here.
There is no x-risk from this, yet. But as models continue to advance, it may become far more relevant that outlier companies like xAI are not releasing dangerous capability evals. How will we know Grok 8 isn’t sandbagging?
By the way, these documented behaviors are not intentional (there are meant to be classifiers to catch them, they just work poorly.) Though, I suppose that doesn’t really affect the censorship argument much!