This is a linkpost (to the EA forum version of this post, which is) for a new preprint, entitled “Building a Culture of Safety for AI: Perspectives and Challenges,” and a brief explanation of the central points. Comments on the ideas in the post are welcome, but much of the content which clarifies the below is in the full manuscript.
Safety culture in AI is going to be critical for many of the other promising initiatives for AI safety.
If people don’t care about safety, most safety measures turn into box-ticking. Companies that don’t care avoid regulation, or render it useless. That’s what happens when fraudulent companies are audited, or when car companies cheat on emissions tests.
If people do care about safety, then audits, standards, and various risk-analysis tools can help get them there.
Culture can transform industries, and norms about trying to be safe can be really powerful as a way to notice and discourage bad actors.
However, there are lots of challenges to making such a culture.
Safety culture usually requires agreement about the risks. We don’t have that in AI generally.
Culture depends on the operational environment.
When people have risks reinforced by always being exposed to them, or personally being affected by failures, they pay more attention. In AI, most risks are rare, occur in the future, and/or affect others more than the people responsible.
Most safety cultures are built around routines such as checklists and exercises that deal with current risks. Most AI risks aren’t directly amenable to these approaches, so we can’t reinforce culture with routines.
Cultures are hard to change once they get started.
AI gets cultural norms from academia, where few consider risks from their work, and there are norms of openness, and from the startup world, where companies generally want to “move fast and break things.”
AI companies aren’t prioritizing safety over profits—unlike airlines, nuclear power operators, or hospitals, where there is a clear understanding that safety is a critical need, and everything will stop if there is a safety problem.
Companies aren’t hiring people who care about safety culture. But people build culture, and even if management wants to prioritize safety, lots of people who don’t care won’t add up to organizations that do care.
We need something other than routinization to reinforce safety culture.
Thankfully, there are some promising approaches, especially on the last point. These include identifying future risks proactively via various risk analysis methods, red-teaming, and audits. But as noted above, audits are most useful once safety culture is prioritized—though there is some promise in the near-term for audits to make lack of safety common knowledge.
Next steps include building the repertoire of tools that will reduce risks and can be used to routinize and inculcate safety culture in the industry, and getting real buy-in from industry leaders for prioritizing safety.
Thanks to Jonas Schuett, Shaun Ee, Simeon Campos, Tom David, Joseph Rogero, Sebastian Lodemann, and Yonaton Cale for helpful suggestions on the manuscript.
Cultural norms and egocentricity
I’ve been working fully remotely and have meaningfully contributed to global organizations without physical presence for over a decade. I see parallels with anti-remote and anti-safety arguments.
I’ve observed the robust debate regarding ‘return to work’ vs ‘remote work,’ with many traditional outlets proposing ‘return to work’ based on a series of common criteria. I’ve seen ‘return to work’ arguments assert remote employees are lazy, unreliable or unproductive when outside the controlled work environment. I would generalize the rationale as an assertion that ‘work quality cannot be assured if it cannot be directly measured.’ Given modern technology allows us to measure employee work product remotely, and given the distributed work of employees across different offices for many companies, this argument seems fundamentally flawed and perhaps even intentionally misleading. My belief in the arguments being misleading is compounded by my observations that these articles never mention related considerations like cost of rental/ownership of property and the handling of those costs, nor elements like cultural emphasis on predictable work targets or management control issues.
In my view, the reluctance to embrace remote work often distills to a failure to see beyond immediate, egocentric concerns. Along the same lines, I see failure to plan for or prioritize AI safety as stemming from a similar inability to perceive direct, observable consequences to the party promoting anti-safety mindsets.
Anecdotally, I came across an article that proposed a number of cultural goals for successful remote work. I shared the article with my company via our Slack. I emphasized that it wasn’t the goals themselves that were important, but rather adopting a culture that made those goals critical. I suggested that Goodhart’s Law applied here- once a measure becomes a target, it ceases to be a good measure. A culture that values and principals beyond the listed goals would succeed, not just a culture that blindly pursues the listed goals.
I believe the same can be said for AI Safety. Focusing on specific risks, or specific practices won’t create a culture of safety. Instead, as the post (above) suggests, a culture that does not value the principals behind a safety-first mentality will attempt to merely meet the goals, or work around the goals, or undermine the goals. Much as some advocates for “return to work” are egocentrically misrepresenting remote work, some anti-safety advocates are egocentrically misrepresenting safety. For this reason, I’ve been researching the history of adoption of a safety mentality, to see how I can promote a safety-first culture. Otherwise I think we (both my company, and the industry as a whole) risk prioritizing egocentric, short-term goals over societal benefit and long-term goals.
Observations on the history of adopting “Safety First” mentalities
I’ve been looking at the human history about adoption of safety culture, and invariably, it seems to me that safety mindsets are adopted only after loss, usually loss of human life. It is described anecdotally in the paper associated with this post.
Emphasis added by me.
NOTE: I could not find any indication of loss of human life attributed to Three Mile Island, but both Chernobyl and Fukushima happened after Three Mile Island, and both did result in loss of human life. It’s also important to note that both Chernobyl and Fukushima were both classed INES Level 7, compared to Three Mile Island which was classed INES Level 5. This evidence is contradictory to what was in the quoted part of the paper. (And, sadly, I think supports an argument that Goodhart’s Curse is in play… that safety regressed to the mean… that by establishing minimum safety criteria instead of a safety culture, certain disasters not only could not be avoided but were more pronounced than previous disasters.) So both of the worst reactor disasters in human history occurred after the safety cultures that were promoted following Three Mile Island.[1][2] The list of nuclear accidents is longer than this, but not all accidents result in loss.[3][2:1] (This is something that I’ve been looking at for a while, to inform my predictions about the probability of humans adopting AI safety practices with regards to pre- or post- AI disasters.)
Personal contribution and advocacy
In my personal capacity (read: area of employment) I’m advocating for adversarial testing of AI chatbots. I am highlighting the “accidents” that have already occurred: Microsoft Tay Tweets[4], SnapChat AI Chatbot[5], Tessa Wellness Chatbot[6], Chai Eliza Chatbot[7].
I am promoting the mindset that if we want to be successful with artificial intelligence, and do not want to become a news article, that we should test expressly for ways that the chatbot can be diverted from the chatbots primary function, and design (or train) fixes for those problems. It requires creativity, persistence and patience… but the alternative is that one day, we might be in the news if we fail to proactively address the challenges that obviously face anyone who is trying to use artificial intelligence.
And, like my advocacy about looking at what values a culture should have that wants to adopt a pro-remote culture and be successful at it, we should look at what values a culture should have that wants to adopt a pro-safety-first culture and be successful at it.
I’ll be cross posting the original paper to my work. Thank you for sharing.
DISCLAIMER: AI was used to quality check my post, assessing for consistency, logic and soundness in reasoning and presentation styles. No part of the writing was authored by AI.
https://www.processindustryforum.com/energy/five-worst-nuclear-disasters-history
https://en.wikipedia.org/wiki/Nuclear_and_radiation_accidents_and_incidents
https://ieer.org/resource/factsheets/table-nuclear-reactor-accidents/
https://en.wikipedia.org/wiki/Tay_(chatbot)
https://www.washingtonpost.com/technology/2023/03/14/snapchat-myai/
https://www.nytimes.com/2023/06/08/us/ai-chatbot-tessa-eating-disorders-association.html
https://www.complex.com/life/father-dies-by-suicide-conversing-with-ai-chatbot-wife-blames
Thanks, this is great commentary.
On your point about safety culture after 3MI, when it took hold, and regression to the mean, see this article: https://www.thenation.com/article/archive/after-three-mile-island-rise-and-fall-nuclear-safety-culture/ Also, for more background about post-3MI safety, see this report: https://inis.iaea.org/collection/NCLCollectionStore/_Public/34/007/34007188.pdf?r=1&r=1