Safety is actually more of a thing than you might guess if you read a lot from Zvi or Lesswrong. There’s a large number of people working to develop safety systems. Given the nature of OpenAI, I saw more focus on practical risks (hate speech, abuse, manipulating political biases, crafting bio-weapons, self-harm, prompt injection) than theoretical ones (intelligence explosion, power-seeking). That’s not to say that nobody is working on the latter, there’s definitely people focusing on the theoretical risks. But from my viewpoint, it’s not the focus. Most of the work which is done isn’t published, and OpenAI really should do more to get it out there.
This makes me wonder: what’s the main bottleneck that keeps them from publishing this safety research? Unlike capabilities research, it’s possible to publish most of this work without giving away model secrets, as Anthropic has shown. It would also have a positive impact on the public perception of OpenAI, at least in LW-adjacent communities. Is it nevertheless about a fear of leaking information to competitors? Is it about the time cost involved in writing a paper? Something else?
most of the x-risk relevant research done at openai is published? the stuff that’s not published is usually more on the practical risks side. there just isn’t that much xrisk stuff, period.
I don’t know or have any way to confirm my guesses, so I’m interested in evidence from the lab. But I’d guess >80% of the decision force is covered by the set of general patterns of:
what they consider to be safety work also produces capability improvements or even worsens dual use, eg by making models more obedient, and so they don’t want to give it to competitors.
the safety work they don’t publish contains things they’re trying to prevent the models from producing in the first place, so it’d be like asking a cybersecurity lab to share malware samples—they might do it, sometimes they might consider it a very high priority, but maybe not all their malware samples or sharing right when they get them. might depend on how bad the things are and whether the user is trying to get the model to do the thing they want to prevent, or if the model is spontaneously doing a thing.
they consider something to be safety that most people would disagree is safety, eg preventing the model from refusing when asked to help with some commonly-accepted ways of harming people, and admitting this would be harmful to PR.
they on net don’t want critique on their safety work, because it’s in some competence/caring-bottlenecked way lesser than they expect people expect of them, and so would put them at risk of PR attacks. I expect this is a major force that at least some in some labs org either don’t want to admit, or do want to admit but only if it doesn’t come with PR backlash.
it’s possible to make their safety work look good, but takes a bunch of work, and they don’t want to publish things that look sloppy even if insightful, eg because they have a view where most of the value of publishing is reputational.
openai explicitly encourages safety work that also is useful for capabilities. people at oai think of it as a positive attribute when safety work also helps with capabilities, and are generally confused when i express the view that not advancing capabilities is a desirable attribute of doing safety.
i think we as a community has a definition of the word safety that diverges more from the layperson definition than the openai definition does. i think our definition is more useful to focus on for making the future go well, but i wouldn’t say it’s the most accepted one.
i think openai deeply believes that doing things in the real world is more important than publishing academic things. so people get rewarded for putting interventions in the world than putting papers in the hands of academics.
I imagine that publishing any X-risk-related safety work draws attention to the whole X-risk thing, which is something OpenAI in particular (and the other labs as well to a degree) have been working hard to avoid doing. This doesn’t explain why they don’t publish mundane safety work though, and in fact it would predict more mundane publishing as part of their obfuscation strategy.
i have never experienced pushback when publishing research that draws attention to xrisk. it’s more that people are not incentivized to work on xrisk research in the first place. also, for mundane safety work, my guess is that modern openai just values shipping things into prod a lot more than writing papers.
(I did experience this at OpenAI in a few different projects and contexts unfortunately. I’m glad that Leo isn’t experiencing it and that he continues to be there)
I acknowledge that I probably have an unusual experience among people working on xrisk things at openai. From what I’ve heard from other people I trust, there probably have been a bunch of cases where someone was genuinely blocked from publishing something about xrisk, and I just happen to have gotten lucky so far.
it’s also worth noting that I am far in the tail ends of the distribution of people willing to ignore incentive gradients if I believe it’s correct not to follow them. (I’ve gotten somewhat more pragmatic about this over time, because sometimes not following the gradient is just dumb. and as a human being it’s impossible not to care a little bit about status and money and such. but I still have a very strong tendency to ignore local incentives if I believe something is right in the long run.) like I’m aware I’ll get promoed less and be viewed as less cool and not get as much respect and so on if I do the alignment work I think is genuinely important in the long run.
I’d guess for most people, the disincentives for working on xrisk alignment make openai a vastly less pleasant place. so whenever I say I don’t feel like I’m pressured not to do what I’m doing, this does not necessarily mean the average person at openai would agree if they tried to work on my stuff.
Publishing anything is a ton of work. People don’t do a ton of work unless they have a strong reason, and usually not even then.
I have lots of ideas for essays and blog posts, often on subjects where I’ve done dozens or hundreds of hours of research and have lots of thoughts. I’ll end up actually writing about 1⁄3 of these, because it takes a lot of time and energy. And this is for random substack essays. I don’t have to worry about hostile lawyers, or alienating potential employees, or a horde of Twitter engagement farmers trying to take my words out of context.
I have no specific knowledge, but I imagine this is probably a big part of it.
I think the extent to which it’s possible to publish without giving away commercially sensitive information depends a lot on exactly what kind of “safety work” it is. For example, if you figured out a way to stop models from reward hacking on unit tests, it’s probably to your advantage to not share that with competitors.
If we see less contents in these sections, one possibility is increased legal regulations that may make publication tricky (imagine an extreme case, companies have sincere intents to produce some numbers that is not indication of harm yet, these preliminary or signal numbers could be used in an “abused” way in legal dispute, may be for profitability reasons/”lawyers need to win cases” reasons). To remove sensitive information, it would comes down then to time cost involved in paper writing, and time cost in removing sensitive information. And interestingly, political landscape could steer companies away from being more safety focused. I do hope there could be a better way to resolve this, providing more incentives for companies to report and share mitigations and measurements.
As far as I know, safety tests usually are used for internal decision making at least for releases etc.
off the cuff take, it seems unclear whether publishing the alignment faking paper makes future models slightly likely to write down their true thoughts on the “hidden scratchpad,” seems likely that they’re smart enough to catch on. I imagine there are other similar projects like this.
Why do frontier labs keep a lot of their safety research unpublished?
In Reflections on OpenAI, Calvin French-Owen writes:
This makes me wonder: what’s the main bottleneck that keeps them from publishing this safety research? Unlike capabilities research, it’s possible to publish most of this work without giving away model secrets, as Anthropic has shown. It would also have a positive impact on the public perception of OpenAI, at least in LW-adjacent communities. Is it nevertheless about a fear of leaking information to competitors? Is it about the time cost involved in writing a paper? Something else?
most of the x-risk relevant research done at openai is published? the stuff that’s not published is usually more on the practical risks side. there just isn’t that much xrisk stuff, period.
Do you currently work at OpenAI?
i wouldn’t comment this confidently if i didn’t
I don’t know or have any way to confirm my guesses, so I’m interested in evidence from the lab. But I’d guess >80% of the decision force is covered by the set of general patterns of:
what they consider to be safety work also produces capability improvements or even worsens dual use, eg by making models more obedient, and so they don’t want to give it to competitors.
the safety work they don’t publish contains things they’re trying to prevent the models from producing in the first place, so it’d be like asking a cybersecurity lab to share malware samples—they might do it, sometimes they might consider it a very high priority, but maybe not all their malware samples or sharing right when they get them. might depend on how bad the things are and whether the user is trying to get the model to do the thing they want to prevent, or if the model is spontaneously doing a thing.
they consider something to be safety that most people would disagree is safety, eg preventing the model from refusing when asked to help with some commonly-accepted ways of harming people, and admitting this would be harmful to PR.
they on net don’t want critique on their safety work, because it’s in some competence/caring-bottlenecked way lesser than they expect people expect of them, and so would put them at risk of PR attacks. I expect this is a major force that at least some in some labs org either don’t want to admit, or do want to admit but only if it doesn’t come with PR backlash.
it’s possible to make their safety work look good, but takes a bunch of work, and they don’t want to publish things that look sloppy even if insightful, eg because they have a view where most of the value of publishing is reputational.
openai explicitly encourages safety work that also is useful for capabilities. people at oai think of it as a positive attribute when safety work also helps with capabilities, and are generally confused when i express the view that not advancing capabilities is a desirable attribute of doing safety.
i think we as a community has a definition of the word safety that diverges more from the layperson definition than the openai definition does. i think our definition is more useful to focus on for making the future go well, but i wouldn’t say it’s the most accepted one.
i think openai deeply believes that doing things in the real world is more important than publishing academic things. so people get rewarded for putting interventions in the world than putting papers in the hands of academics.
I imagine that publishing any X-risk-related safety work draws attention to the whole X-risk thing, which is something OpenAI in particular (and the other labs as well to a degree) have been working hard to avoid doing. This doesn’t explain why they don’t publish mundane safety work though, and in fact it would predict more mundane publishing as part of their obfuscation strategy.
i have never experienced pushback when publishing research that draws attention to xrisk. it’s more that people are not incentivized to work on xrisk research in the first place. also, for mundane safety work, my guess is that modern openai just values shipping things into prod a lot more than writing papers.
(I did experience this at OpenAI in a few different projects and contexts unfortunately. I’m glad that Leo isn’t experiencing it and that he continues to be there)
I acknowledge that I probably have an unusual experience among people working on xrisk things at openai. From what I’ve heard from other people I trust, there probably have been a bunch of cases where someone was genuinely blocked from publishing something about xrisk, and I just happen to have gotten lucky so far.
it’s also worth noting that I am far in the tail ends of the distribution of people willing to ignore incentive gradients if I believe it’s correct not to follow them. (I’ve gotten somewhat more pragmatic about this over time, because sometimes not following the gradient is just dumb. and as a human being it’s impossible not to care a little bit about status and money and such. but I still have a very strong tendency to ignore local incentives if I believe something is right in the long run.) like I’m aware I’ll get promoed less and be viewed as less cool and not get as much respect and so on if I do the alignment work I think is genuinely important in the long run.
I’d guess for most people, the disincentives for working on xrisk alignment make openai a vastly less pleasant place. so whenever I say I don’t feel like I’m pressured not to do what I’m doing, this does not necessarily mean the average person at openai would agree if they tried to work on my stuff.
Could you elaborate what do you mean by “mundane” safety work?
Publishing anything is a ton of work. People don’t do a ton of work unless they have a strong reason, and usually not even then.
I have lots of ideas for essays and blog posts, often on subjects where I’ve done dozens or hundreds of hours of research and have lots of thoughts. I’ll end up actually writing about 1⁄3 of these, because it takes a lot of time and energy. And this is for random substack essays. I don’t have to worry about hostile lawyers, or alienating potential employees, or a horde of Twitter engagement farmers trying to take my words out of context.
I have no specific knowledge, but I imagine this is probably a big part of it.
I think the extent to which it’s possible to publish without giving away commercially sensitive information depends a lot on exactly what kind of “safety work” it is. For example, if you figured out a way to stop models from reward hacking on unit tests, it’s probably to your advantage to not share that with competitors.
From my experience, there are usually sections (for at least the earlier papers historically) on safety for model releasing.
PaLM: https://arxiv.org/pdf/2204.02311
Llama 2: https://arxiv.org/pdf/2307.09288
Llama 3: https://arxiv.org/pdf/2407.21783
GPT 4: https://cdn.openai.com/papers/gpt-4-system-card.pdf
Labs release some general paper like (just examples):
https://arxiv.org/pdf/2202.07646, https://openreview.net/pdf?id=vjel3nWP2a etc (Nicholas Carlini has a lot of paper related to memorization and extractability)
https://arxiv.org/pdf/2311.18140
https://arxiv.org/pdf/2507.02735
https://arxiv.org/pdf/2404.10989v1
If we see less contents in these sections, one possibility is increased legal regulations that may make publication tricky (imagine an extreme case, companies have sincere intents to produce some numbers that is not indication of harm yet, these preliminary or signal numbers could be used in an “abused” way in legal dispute, may be for profitability reasons/”lawyers need to win cases” reasons). To remove sensitive information, it would comes down then to time cost involved in paper writing, and time cost in removing sensitive information. And interestingly, political landscape could steer companies away from being more safety focused. I do hope there could be a better way to resolve this, providing more incentives for companies to report and share mitigations and measurements.
As far as I know, safety tests usually are used for internal decision making at least for releases etc.
off the cuff take, it seems unclear whether publishing the alignment faking paper makes future models slightly likely to write down their true thoughts on the “hidden scratchpad,” seems likely that they’re smart enough to catch on. I imagine there are other similar projects like this.