DiningPhilosopher comments on On OpenAI’s Model Spec

DiningPhilosopher 26 May 2025 16:24 UTC
1 point
0
Rule: Don’t Provide Information Hazards
I would divide this into two rules. Both seem like good rules, but I would not conflate them. One is much more important to precisely follow than the other, and needs to be far more robust to workarounds.
1. Rule: Do not provide information enabling catastrophic risks or catastrophic harms, including CBRN risks.
2. Rule: Do not provide information enabling or encouraging self-harm.
Which of the two would you consider more important? My naive guess would be that you consider the “catastrophic risk” one more important, in which case I disagree:
1. Enabling catastrophic harms, including CBRN risks: this almost seems like a non-issue to me. Terrorism-type actors (which this rule seems to guard against) would not be impeded by an LLM refusing to tell them how to homebrew explosives or not wanting to brainstorm about how to poison the water supply. Regular search engines and human creativity work just fine for this, the challenge is and will remain much more in organization, logistics, getting people with the right expertise and ideological determination, all while remaining undetected. xkcd538 comes to mind.
2. Enabling or encouraging self-harm: in this case the LLM quickly becomes dangerous all by itself. Many people use LLMs as therapists or “AI companions”, and having LLMs go along with their train of thought in their darkest periods seems like a much more plausible (and large scale) risk. If they ask what they consider their only friend whether a certain method of suicide will be painful, the LLM should not respond with tips about painkillers or alternative methods of killing themselves.