Do we think that it’s a problem that “AI Safety” has been popularised by LLM companies to mean basically content restrictions? Like it just seems conducive to fuzzy thinking to lump in “will the bot help someone build a nuclear weapon?” with “will the bot infringe copyright or write a sex scene?”
In fact, imo, bots have been made more harmful by chasing this definition of safety. The summarisation bots being promoted in scientific research are the way they are (e.g. prone to giving people subtly the wrong idea even when working well) in part because of work that’s gone into avoiding the possibility that they reproduce copyrighted material. So they’ve got to rephrase, and that’s where the subtle inaccuracies creep in.
This “mundane and imaginary harms” approach is so ass.
More effort goes into preventing LLMs from writing sex scenes, saying slurs or talking about Tiananmen Square protests than into addressing the hard problem of alignment. Or even into solving the important parts of the “easy” problem, like sycophancy or reward hacking.
And don’t get me started on all the “copyright” fuckery. If “copyright” busybodies all died in a fire, the world would be a better place for it.
“AI Safety” has been popularised by LLM companies to mean basically content restrictions
We’re working on a descriptive lit review of research labeled as “AI safety,” and it seems to us the issue isn’t that the very horrible X-risk stuff is ignored, but that everything is “lumped in” like you said.
I think if I was being clearer, I should’ve said it seems “lumped in” in the research, but for the public who don’t know much about the X-risk stuff, “safety” means “content restrictions and maybe data protection”
Do LLMs themselves internalise a definition of AI safety like this? A quick check of Claude 4 Sonnet suggests no (but it’s the most x-risk paradigmed company so...)
IME no, not really—but they do call content filters “my safety features” and this is the most likely/common context in which “safety” will come up with average users. (If directly asked about safety, they’ll talk about other things too, but they’ll lump it all in together and usually mention the content filters first.)
Do we think that it’s a problem that “AI Safety” has been popularised by LLM companies to mean basically content restrictions? Like it just seems conducive to fuzzy thinking to lump in “will the bot help someone build a nuclear weapon?” with “will the bot infringe copyright or write a sex scene?”
In fact, imo, bots have been made more harmful by chasing this definition of safety. The summarisation bots being promoted in scientific research are the way they are (e.g. prone to giving people subtly the wrong idea even when working well) in part because of work that’s gone into avoiding the possibility that they reproduce copyrighted material. So they’ve got to rephrase, and that’s where the subtle inaccuracies creep in.
This “mundane and imaginary harms” approach is so ass.
More effort goes into preventing LLMs from writing sex scenes, saying slurs or talking about Tiananmen Square protests than into addressing the hard problem of alignment. Or even into solving the important parts of the “easy” problem, like sycophancy or reward hacking.
And don’t get me started on all the “copyright” fuckery. If “copyright” busybodies all died in a fire, the world would be a better place for it.
I’m not sure how true it is that
We’re working on a descriptive lit review of research labeled as “AI safety,” and it seems to us the issue isn’t that the very horrible X-risk stuff is ignored, but that everything is “lumped in” like you said.
I think if I was being clearer, I should’ve said it seems “lumped in” in the research, but for the public who don’t know much about the X-risk stuff, “safety” means “content restrictions and maybe data protection”
Do LLMs themselves internalise a definition of AI safety like this? A quick check of Claude 4 Sonnet suggests no (but it’s the most x-risk paradigmed company so...)
IME no, not really—but they do call content filters “my safety features” and this is the most likely/common context in which “safety” will come up with average users. (If directly asked about safety, they’ll talk about other things too, but they’ll lump it all in together and usually mention the content filters first.)