I don’t know or have any way to confirm my guesses, so I’m interested in evidence from the lab. But I’d guess >80% of the decision force is covered by the set of general patterns of:
what they consider to be safety work also produces capability improvements or even worsens dual use, eg by making models more obedient, and so they don’t want to give it to competitors.
the safety work they don’t publish contains things they’re trying to prevent the models from producing in the first place, so it’d be like asking a cybersecurity lab to share malware samples—they might do it, sometimes they might consider it a very high priority, but maybe not all their malware samples or sharing right when they get them. might depend on how bad the things are and whether the user is trying to get the model to do the thing they want to prevent, or if the model is spontaneously doing a thing.
they consider something to be safety that most people would disagree is safety, eg preventing the model from refusing when asked to help with some commonly-accepted ways of harming people, and admitting this would be harmful to PR.
they on net don’t want critique on their safety work, because it’s in some competence/caring-bottlenecked way lesser than they expect people expect of them, and so would put them at risk of PR attacks. I expect this is a major force that at least some in some labs org either don’t want to admit, or do want to admit but only if it doesn’t come with PR backlash.
it’s possible to make their safety work look good, but takes a bunch of work, and they don’t want to publish things that look sloppy even if insightful, eg because they have a view where most of the value of publishing is reputational.
openai explicitly encourages safety work that also is useful for capabilities. people at oai think of it as a positive attribute when safety work also helps with capabilities, and are generally confused when i express the view that not advancing capabilities is a desirable attribute of doing safety.
i think we as a community has a definition of the word safety that diverges more from the layperson definition than the openai definition does. i think our definition is more useful to focus on for making the future go well, but i wouldn’t say it’s the most accepted one.
i think openai deeply believes that doing things in the real world is more important than publishing academic things. so people get rewarded for putting interventions in the world than putting papers in the hands of academics.
I don’t know or have any way to confirm my guesses, so I’m interested in evidence from the lab. But I’d guess >80% of the decision force is covered by the set of general patterns of:
what they consider to be safety work also produces capability improvements or even worsens dual use, eg by making models more obedient, and so they don’t want to give it to competitors.
the safety work they don’t publish contains things they’re trying to prevent the models from producing in the first place, so it’d be like asking a cybersecurity lab to share malware samples—they might do it, sometimes they might consider it a very high priority, but maybe not all their malware samples or sharing right when they get them. might depend on how bad the things are and whether the user is trying to get the model to do the thing they want to prevent, or if the model is spontaneously doing a thing.
they consider something to be safety that most people would disagree is safety, eg preventing the model from refusing when asked to help with some commonly-accepted ways of harming people, and admitting this would be harmful to PR.
they on net don’t want critique on their safety work, because it’s in some competence/caring-bottlenecked way lesser than they expect people expect of them, and so would put them at risk of PR attacks. I expect this is a major force that at least some in some labs org either don’t want to admit, or do want to admit but only if it doesn’t come with PR backlash.
it’s possible to make their safety work look good, but takes a bunch of work, and they don’t want to publish things that look sloppy even if insightful, eg because they have a view where most of the value of publishing is reputational.
openai explicitly encourages safety work that also is useful for capabilities. people at oai think of it as a positive attribute when safety work also helps with capabilities, and are generally confused when i express the view that not advancing capabilities is a desirable attribute of doing safety.
i think we as a community has a definition of the word safety that diverges more from the layperson definition than the openai definition does. i think our definition is more useful to focus on for making the future go well, but i wouldn’t say it’s the most accepted one.
i think openai deeply believes that doing things in the real world is more important than publishing academic things. so people get rewarded for putting interventions in the world than putting papers in the hands of academics.