Irresponsible Companies Can Be Made of Responsible Employees
tl;dr:
In terms of financial interests of an AI company, bankruptcy and the world ending are both equally bad. If a company acted in line with its financial interests[1], it would happily accept significant extinction risk for increased revenue.
There are plausible mechanisms which would allow a company to act like this even if virtually every employee would prefer the opposite. (For example, selectively hiring people with biased beliefs or exploiting collective action problems.)
In particular, you can hold that an AI company is completely untrustworthy even if you believe that all of its employees are fine people.
Epistemic status & disclaimers: The mechanisms I describe definitely play some role in real AI companies. But in practice, there are more things going on simultaneously and this post is not trying to give a full picture.[2][3]Also, none of this is meant to be novel, but rather just putting existing things together and applying to AI risk.
From financial point of view, bankruptcy is no worse than destroying the world
Let’s leave aside the question how real companies act. Instead, we start with a simple observation: If all a company cared about were financial interests, bankruptcy and the world getting destroyed are equivalent. Unsurprisingly, this translates to undesirable decisions in various situations.
For example, consider an over-simplified scenario where an AI company somehow has precisely these two options[4]:
Option A: 10% chance of destroying the world, 90% of nothing happening.
Option B: Certainty of losing 20% market share.
We could imagine that this corresponds to racing ahead (which risks causing the end of the world) and taking things slowly (which leads to a loss of revenue). But to simplify the discussion, we make the assumption that Option A has no benefit and everybody knows this. In this situation, if the company was following its financial interests (and knew the numbers), it should take Option A—deploy the AI and risk destroying the world.
However, companies are made of people, who might not be happy with risking the world. Shouldn’t we expect that they would decide to take Option B instead? I am going to argue that this might not necessarily be the case. That is, that there are ways in which the company might end up taking Option A even if every employee would prefer Option B instead.
How to Not Act in Line with Employee Preferences
It shouldn’t come as a surprise that companies are good at getting people to act against their preferences. The basic example of this is paying people off. By giving people salary, we override their preference for staying home rather than working. Less benignly, an AI company might use a similar tactic to override people’s reluctance to gamble with the world—bribe them with obscene amounts of money and if that is not enough, dangle the chance to shape the future of the universe. However, accepting these bribes is morally questionable to say the least, and might not work on everybody—and my claim was that AI companies might act irresponsibly even if all of their employees are good people. So later in this text, we will go over a few other mechanisms.
To preempt a possible misunderstanding: Getting a company to act like this does not require deliberate effort[5] by individuals inside the company. Sure, things might go easier if a supervillain CEO can have a meeting with mustache-twirling HR personnel, in order to figure out the best ways to get their employees to go along with profit seeking. And to some extent, fiduciary duty might imply the CEO should be doing this. But mostly, I expect most of these things to happen organically. Many of the mechanisms will be a part of the standard package for how to structure a modern business. Because companies compete and evolve over time, we should expect the most successful ones to have “adaptations” that help their bottom line.
So, what are some of the mechanisms that could help a company to pursue its financial interests even when they are at odds with what employees would naively prefer?
Fiduciary duty. This might not be the main driver of behaviour but being supposed to act in the interest of shareholders probably does make a difference.[6]
Selective hiring. (a) The company could converge on a hiring policy that selects for people whose beliefs are biased, such that they genuinely believe that the actual best option is the one that is best for the bottom line. In the toy example, Option A carries a X% chance of extinction. The true value of X is 10, but suppose that people’s beliefs about X are randomly distributed. If the company hires people who think that X is low enough to be acceptable, all of its employees will, mistakenly, genuinely believe that Option A is good for them.
(b) Similarly, the company could hire people who are shy to speak up.
(c) Or select for other traits or circumstances that make for compliant employees.Firing dissenters as a coordination problem. Let’s assume that to act, the company needs all its employees to be on board. (Or more realistically, at least some of them.) Even if all employees were against some action, the company can still take it: Suppose that the company adopts a policy where if you dissent, you will be fired (or moved to a less relevant position, etc). But then, if you get replaced by somebody else who will comply, the bad thing happens anyway.[7] So unless you can coordinate with the other employees (and potential hires), compliance will seem rational.
Compartmentalising undesirable information (and other types of information control). In practice, the employees will have imperfect information about the risks. For example, imagine that nobody would find the 10% extinction risk acceptable—but recognising that the risk exist would require knowing facts A, B, and C. This is simple to deal with: Make sure that different teams work on A, B, and C and that they don’t talk to each other much. And to be safe, fire anybody who seems like they recognised the risk—though in practice, they might even leave on their own.
(Uhm. You could also have a dedicated team that knows about the risk, as long as you keep them isolated and fire them[8] periodically, before they have time to make friends.)Many other things. For example, setting up decision processes to favour certain views. Promoting the “right” people. Nominally taking the safety seriously, but sabotaging these efforts (eg, allocating resources in proportion to how much the team helps with the bottom line). People self-selecting into teams based on beliefs, and teams with different beliefs not talking to each other much. And gazzillion other things that I didn’t think of.
(To reiterate, this list is meant as an existence proof rather than an accurate picture of the key dynamics responsible for the behaviour of AI companies.)
Well… and why does this matter?
I described some dynamics that could plausibly take place inside AI companies—that probably do take place there. But I would be curious to know what are the dynamics that actually take place, which of them matter how much, and what is the overall effect. (For all I know, this could be towards responsible behaviour.) Looking at the actions that the companies took so far gives some information, but it isn’t clear to me how, say, lobbying behaviour generalises to decisions about deploying superintelligence.
Why care about this? Partly, this just seems fascinating on its own. Partly, this seems important to understand if somebody wanted to make AI companies “more aligned” with society. Or it might be that AI companies are so fundamentally “misaligned” that gentle interventions are never going to be enough—but if that was the case, it would be important to make a clearer case that this is so. Either way, understanding this topic better seems like a good next step. (If you have any pointers, I would be curious!)
Finally, I get the impression that there is general reluctance to engage with the possibility that AI companies are basically “pure evil” and should be viewed as completely untrustworthy.[9] I am confused about why this is. But one guess is that it’s because some equate “the company is evil” with “the employees are bad people”. But this is incorrect: An AI company could be the most harmful entity in human history even if every single employee was a decent person. We should hesitate to accuse individual people, but this should not prevent us from recognising that the organisation might be untrustworthy.
- ^
When I mention following financial interests, I just mean the vague notion of seeking profits, revenue, shareholder value, influence, and things like that (and being somewhat decent at it). I don’t think the exact details matter for the point of this post. I definitely don’t mean to imply that the company acts as a perfectly rational agent or that it is free of internal inefficiencies such as those described in Immoral Mazes or Recursive Middle Manager Hell.
- ^
More precisely, there will be various dynamics in play. Some these push in the direction of following profits, others towards things like doing good or following the company’s stated mission, and some just cause internal inefficiencies. I expect that the push towards profits will be stronger when there is stronger competition and higher financial stakes. But I don’t have a confident take on where the overall balance lies. Similarly, I don’t claim that the mechanisms I give here as examples (selective hiring and miscoordination) are the most important ones among those that push towards profit-following.
- ^
Relatedly to footnotes 1 and 2, Richard Ngo made some good points about why the framing I adopt here is not the right one. (His post Power Lies Trembling is relevant and offers a good framing on dynamics inside countries—but probably companies too.) Still, I think the dynamics I mention here are relevant too, and in the absence of knowing a better pointer to their discussion, this point was cheap enough to write.
- ^
The scenario is over-simplified and unrealistic, but this shouldn’t matter too much. The same dynamics should show up in many other cases as well.
- ^
The post Unconscious Economics feels relevant here.
- ^
I am not sure how exactly this works and it interacts with “negative externalities” such as “unclear risk of extinction”. There definitely is a threat of getting sued over this, but I am not sure how much this really matters, as opposed to serving as a convenient excuse to not rock the boat.
- ^
This will be costly for the company as well. If nothing else, it causes delays and they will be replacing you by somebody less skilled (otherwise they would have hired them already). So I advocate for not complying with things you are against. But despite this, the coordination problems are definitely real and difficult to solve.
- ^
To reiterate, this does not require deliberate plotting by anybody inside the company. You don’t need to actually fire those people; it should be enough to incidentally underfund them, or perhaps converge on a company culture where they leave on their own.
- ^
I am not saying they definitely are. But I do view this as plausible enough that acting on it would seem reasonable.
my guess:
selective hiring is very real. lots of people who are xrisk pilled just refuse to join oai. people who care a lot often end up very stressed and leave in large part because of the stress.
the vast majority of people at oai do not think of xrisk from agi as a serious thing. but then again probably a majority dont really truly think of agi as a serious thing.
people absolutely do argue “well if i didn’t do it, someone else would. and even if oai stopped, some other company would do it” to justify their work.
compartmentalization is probably not a big part of the reason, at least not yet. historically things don’t get compartmentalized often, and even when they do, i don’t think it makes the difference between being worried and not being worried about xrisk for that many people
as companies get big, teams A B C not talking to each other is the default order of the world and it takes increasing effort to get them to talk to each other. and even getting them talking is not enough to change their courses of action, which often requires a lot of work from higher up. this hampers everything; this is in general why big companies have so many overlapping/redundant teams
people get promoted / allocated more resources if they do things that are obviously useful for the company, as opposed to less obviously useful for the company (i mean, as a company, you kind of understandably have to do this or else die of resource misallocation).
i think quite a few people, especially more senior people, are no longer driven by financial gain. these things are sometimes “i really want to accomplish something great in the field of ML” or “i like writing code” or “i like being part of something important / shaping the future”. my guess is anyone super competent who cares primarily about money quits after a few years and, depending on the concavity of their utility function, either retires on a beach, or founds a startup and raises a gazillion dollars from VCs
it’s pretty difficult to do weird abstract bullshit that doesn’t obviously tie into some kind of real world use case (or fit into the internally-accepted research roadmap to AGI). this has imo hampered both alignment and capabilities. it makes a lot of sense though, like, bell labs didn’t capture most of the value that bell labs created, and academia is the place where weird abstract bullshit is supposed to live, and we’re in some sense quite lucky that industry is willing to fund any of it at all
concretely this means anything alignmenty gets a huge boost if you can argue that it will (a) improve capabilities or (b) prevent some kind of embarrassing safety failure in the model we’re currently serving to gazillions of people. the kinds of things people choose to work on are strongly shaped by this as a result, and probably explains why so much work keeps taking alignment words and using them to mean aligning GPT-5 rather than AGI.
aside from the leopold situation, which had pretty complicated circumstances, people don’t really get fired for caring about xrisk. the few incidents are hard to interpret because of strong confounding factors and could be argued either way. but it’s not so far from base rates so i don’t feel like it’s a huge thing.
my guess is a lot of antipathy towards safety comes from broader antipathy against safetyism as a whole in society, which honestly i (and many people in alignment) have to admit some sympathy towards.
I am not surprised to hear this but also, this is insane.
All the lab heads are repeatedly publicly claiming they could cause human extinction, superintelligence is within reach, and a majority of people at their own labs don’t take them seriously on this.
I’m somewhat confused what causes a group of people who talk to each other everyday, work on the same projects, observe the same evidence, etc to come to such wildly different conclusions about the work they’re doing together and then be uninterested in resolving the disagreement.
Is there a taboo being enforced against discussing these disagreements inside the labs?
this is pretty normal? it’s really hard for leadership to make employees care about or believe specific things. do you really think the average Amazon employee or whatever has strong opinions on the future of delivery drones? does the average Waymo employee have extremely strong beliefs about the future of self driving?
for most people in the world, their job is just a job. people obviously avoid working on things they believe are completely doomed, and tend to work on cool trendy things. but generally most people do not really have strong beliefs about where the stuff they’re working on is going.
no specific taboo is required to ensure that people don’t really iron out deep philosophical disagreements with their coworkers. people care about all sorts of other things in life. they care about money, they care whether they’re enjoying the work, they care whether their coworkers are pleasant to be around, they care about their wife and kids and house.
once you have a company with more than 10 people, it requires constant effort to maintain culture. hiring is way harder if you can only hire people who are aligned, or if you insist on aligning people. if you grow very fast (and openai has grown very fast—it’s approximately doubled every single year I’ve been here), it’s inevitable that your culture will splinter. forget about having everyone on the same page; you’re going to have entire little googletowns and amazontowns and so on of people who bring Google or Amazon culture with them and agglomerate with other recent transplants from those companies.
I think it’s notably abnormal specifically because it wasn’t the “default” equilibrium for OpenAI specifically.
Like earlier you mentioned:
and
One model of this is “its normal that people at any company don’t have strong opinions about their work”. Another model is “lots of people in various positions did in fact have strong opinions about this given the stakes and left”.
If you send such strong signals about safety that people preemptively filter out of the hiring pipeline, then people who are already there with strong opinions on safety feel sidelined, IMO the obvious interpretation is “you actively filtered against people with strong views on safety”.
Can you gesture at what you are basing this guess on?
Ignore this if you are busy, but I was also wondering: Do you have any takes on “do people get fired over acting on beliefs about xrisk”, as opposed to “people getting fired over caring about xrisk”?
And perhaps on “whether people would get fired over acting on xrisk beliefs, so they don’t act on them”? (Though this seems difficult to operationalise.)
as far as I’m aware, the only person who can be argued to have ever been fired for acting on beliefs about x risk is leopold, and the circumstances there are pretty complicated. since I don’t think he’s the only person to have ever acted on xrisk at oai to the extent he did, I don’t think this is just because other people don’t do anything about xrisk.
most cases of xrisk people leaving are just because people felt sidelined/unhappy and chose to leave. which is ofc also bad, but quite different.
Why is that? If someone went off and consistently worked on an agenda that was directly xrisk related (that didn’t contribute to a short term capabilities or product safety) you’re saying they wouldn’t get sidelined / not allocated resources / fired?
Looks like you accidentally dropped a sentence there.
Another dynamic is that employees at AI companies often think that their AI company not going bankrupt will allow P(doom) reduction, at least via the classic mechanism of “having more power lets us do more good things” (e.g. advocating for good policies using the clout of a leading AI company, doing the cheap and important safety/security stuff that the counterfactual company would not even bother doing—and demonstrating that these cheap and important things are cheap and important, using AIs in net-good ways that other companies would have not bothered doing, not trying to spread dangerous ideologies, …).
And it’s unclear how incorrect this reasoning is. This seems to me like a sensible thing to do if you know the company will actually use its power for good—but it’s worrisome that so many people in so many different AI companies think they are in an AI company that will do something better with the power than the company that would replace them if they went bankrupt.
I think that “the company is doing sth first-order risky but employees think it’s actually good because going bankrupt would shift power towards some other more irresponsible actors” is at least as central as “the company is doing something risky to avoid going bankrupt and employees would prefer that the company would not do it, but [dynamic] prevents employees from stopping it”.
I would be interested to know more about those too! However, I don’t have any direct experience with the insides of AI companies, and I don’t have any friends who do either, so I’m hoping that other readers of this post might have insights that they are willing to share.
For those who have worked for AI companies or have reliable info from others who have worked for an AI company, these are a few things I am especially curious about, categorised by the mechanisms mentioned in the post:
Selective hiring
What kinds of personality traits are application-reviewers and interviewers paying attention for? To what extent are the following traits advantages or disadvantages to candidates getting hired?
bias to action?
agreeableness?
outspokenness?
conscientiousness?
Ignoring any explicit criteria that are supposed to factor into the decision to hire someone, what was the typical distribution of personality traits in people who were actually hired?
What about the distribution of traits in people who continued to work there for longer than two years?
To what extent are potential hires expected to have thought about the wider impacts of the work they will be doing?
Does an applicant’s level of concern on catastrophic risk typically come up during the hiring process? Is that a thing that factors into the decision to hire them? (Ignoring the question of whether it should be a factor for now.)
Firing dissenters as a coordination problem
How often have employees had consequential disagreements with the team or company’s direction and voiced it?
How often have employees had consequential disagreements with the team or company’s direction and not voiced it? Why not?
How often do employees in the company see their colleagues express consequential disagreements with the work they are doing?
How often have you seen employees express consequential disagreements and then have them addressed to their satisfaction?
Among the employees who have expressed consequential disagreements at some point, how many of them work on the same team two years later?
How many of them still work at the company at all two years later?
Compartmentalising undesirable information (and other types of information control)
How long has the average Risk Team member worked for that team?
How often do people on the teams addressing risks talk to people outside their teams about the implications of their work?
How often do people working on the teams addressing risks find themselves softening or otherwise watering down their risk assessments/recommendations when communicating with people outside their teams?
Yep, these all seem relevant, and I would be interested in answers. (And my thanks to leogao for their takes.)
I would additionally highlight[1] the complication where even if a tendency is present—say, selective hiring for a particular belief—it might not work as explicitly as people knowingly paying attention to it. It could be implicitly present in something else (“culture fit”), or purely correlational, etc. I am not sure how to best deal with this.
FWIW, I think this concern is important, but we need to be very cautious about it. Since one could always go “yeah, your results say nothing like this is going on, but the real mechanism is even more indirect”. I imagine that the fields like fairness & bias have to encounter this a lot, so they might be some insights.
To be clear, your comment seems aware of this issue. I just wanted to emphasise it.
Yes, that sounds plausible to me as well. I did not mention those because I found it much harder to think of ways to tell when those dynamics are actually in play.
If I understand this correctly:
I think you are gesturing at the same issue?
It makes sense to me that the implicit pathways for these dynamics would be an area of interest to the fields of fairness and bias. But I would not expect them to have any better tools for identifying causes and mechanisms than anyone else[1]. What kinds of insights would you expect those fields to offer?
To be clear, I have only a superficial awareness of the fields of fairness and bias.