There’s an open question about whether or not there’s something useful for AI safety in the X risk sense to be learned by training models to not engage in certain behavior like creating pornographic images.
If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
Instead, it’s viewed as a sort of “safety theater,” similar to so-called “Hygiene Theater,” which caused mass harm during the COVID-19 pandemic by making people lose trust in public health institutions (deserved or not).
Even if the hygiene theater doesn’t produce direct benefits, it’s still possible to learn all sorts of things from it about how providing healthcare works.
What the CDC did wasn’t a research project but justified by object-level concern. Dall-E is primarily a research project and not a project to help artists.
This has happened in part due to the general corporate desire to frame concerns over liability as being about the user’s safety (instead of the company’s), but it’s also partially our fault.
The user and the company are not the only stakeholders. If a user creates a deep fake and harms a third person OpenAI wants to prevent it not only because of company liability but also because they think that deep fake can be a general problem for society.
If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
I was referred by 80k Hours to talk to a manager on the OpenAI safety team who argued exactly this to me. I didn’t join, so no idea to what extent it makes sense vs. just being a nice-sounding idea.
If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
This is true, but then you don’t have to force the censorship on users. This is an abusive practice that might have safety benefits, but it is already pushing forward the failure mode of wealth centralization as a result of AI. (Which is by itself an x-risk, even if the AI is dumb enough that it is not by itself dangerous.)
There’s an open question about whether or not there’s something useful for AI safety in the X risk sense to be learned by training models to not engage in certain behavior like creating pornographic images.
If alignment is about getting models to do what you want and not engaging in certain negative behavior, then researching how to get models to censor certain outputs could theoretically produce insights for alignment.
Even if the hygiene theater doesn’t produce direct benefits, it’s still possible to learn all sorts of things from it about how providing healthcare works.
What the CDC did wasn’t a research project but justified by object-level concern. Dall-E is primarily a research project and not a project to help artists.
The user and the company are not the only stakeholders. If a user creates a deep fake and harms a third person OpenAI wants to prevent it not only because of company liability but also because they think that deep fake can be a general problem for society.
I was referred by 80k Hours to talk to a manager on the OpenAI safety team who argued exactly this to me. I didn’t join, so no idea to what extent it makes sense vs. just being a nice-sounding idea.
This is true, but then you don’t have to force the censorship on users. This is an abusive practice that might have safety benefits, but it is already pushing forward the failure mode of wealth centralization as a result of AI. (Which is by itself an x-risk, even if the AI is dumb enough that it is not by itself dangerous.)