Analogies between scaling labs and misaligned superintelligent AI

TL;DR: Scaling labs have their own alignment problem analogous to AI systems, and there are some similarities between the labs and misaligned/unsafe AI.

Introduction

Major AI scaling labs (OpenAI/Microsoft, Anthropic, Google/DeepMind, and Meta) are very influential in the AI safety and alignment community. They put out cutting-edge research because of their talent, money, and institutional knowledge. A significant subset of the community works for one of these labs. This level of influence is beneficial in some aspects. In many ways, these labs have strong safety cultures, and these values are present in their high-level approaches to developing AI – it’s easy to imagine a world in which things are much worse. But the amount of influence that these labs have is also something to be cautious about.

The alignment community is defined by a concern that subtle misalignment between the incentives that we give AI systems and what we actually want from them might cause these systems to dangerously pursue the wrong goals. This post considers a type of analogous and somewhat ironic alignment problem: between human interests and the scaling labs.

These labs have intelligence, resources, and speed well beyond that of any single human. Their amount of money, compute, talent, and know-how make them extremely capable. Given this, it’s important that they are aligned with the interests of humanity. However, there are some analogies between scaling labs and misaligned AI.

It is important not to draw false equivalences between different labs. For example, it seems that by almost every standard, Anthropic prioritizes safety and responsibility much more than other labs. But in this post, I will generally be lumping them together except to point out a few lab-specific observations.

Misaligned Incentives

In much the same way that AI systems may have perverse incentives, so do the labs. They are companies. They need to make money, court investors, make products, and attract users. Anthropic and Microsoft even just had Super Bowl ads. This type of accountability to commercial interests is not perfectly in line with doing what is good for human interests. Moreover, the labs are full of technocrats whose values and demographics do not represent humanity particularly well. Optimizing for the goals that the labs have is not the same thing as optimizing for human welfare. Goodhart’s Law applies.

Power Seeking

One major risk factor of misaligned superintelligent AI systems is that they may pursue power and influence. But the same is true of the scaling labs. Each is valued in the billions of dollars due to its assets and investments. They compete with each other for technical primacy. The labs also pursue instrumental goals, including political influence with lobbying and strategic secrecy to reduce the risk of lawsuits involving data and fair use. Recent news that Sam Altman is potentially pursuing trillions in funding for hardware suggests that this type of power-seeking may reach large scales in the near future. To stay competitive, labs need to keep scaling, and when one lab scales, others are driven to do so as well in an arms race.

Lack of Transparency

Trust without transparency is misguided. We want AI systems that are honest white boxes that are easy to interpret and understand. However, the scaling labs do not meet this standard. They tend to be highly selective in what they publicize, have employees sign non-disclosure agreements, and generally lack transparency or accountability to the public. Instead of being white boxes, the labs are more like dark grey boxes that seem to rarely choose to reveal things that would make them look bad. A lack of explanations or voluntary accountability from these labs following incidents like Microsoft’s BingChat threatening users, OpenAI mistreating knowledge workers, and the events behind OpenAI’s November 2023 leadership crisis are not encouraging. An important technique to verify that AI systems are safe is being able to look inside, but except for model cards, the labs do not currently allow this with their models, data, methods, or institutional decision-making.

Deception

Deceptive alignment – when an AI system pretends to be aligned only for the sake of passing evaluations – is a particularly worrisome AI risk. These labs constantly assure the public that safety is their overriding priority, but given the problems this post is dedicated to, this probably isn’t 100% true 100% of the time. While scaling labs seem much less likely to mislead via deception than omission, one distinct example from OpenAI comes to mind. OpenAI’s claim that “‘Regurgitation’ is a rare bug that we are working to drive to zero” in response to the NYT lawsuit seems less than honest: today’s transformers are pretrained with the exact objective of learning to ‘regurgitate’ web text verbatim, and it will never realistically be driven to zero.

Incorrigibility

We want AI systems that are corrigible, meaning that they are indifferent to having their policies and goals modified by humans, including hitting an off switch. However, the events at OpenAI in November 2023 are a case study in institutional incorrigibility. Reportedly, after alleged power-seeking behavior against Sam Altman was observed by the board which voted to oust him, OpenAI employees and investors applied enough pressure to restore Altman and restructure the board. Independent of the notion of anyone being right or wrong, it is clear that the OpenAI CEO was designed to be under the control of its board but successfully subverted an attempt by the board to intervene. This is the definition of incorrigibility.

Subverting Constraints

We want AI systems to obey constraints that are given to them and not to circumvent those constraints by self-modification. However, OpenAI exhibited this when it recently quietly removed a ban on military contracts that were previously in its policies.

Concluding Thoughts

Imagine you trained and deployed a highly capable agentic AI system, and it did the things listed above. Surely, it would be a good idea to shut it down and start questioning all of the design choices that led you to this point.

The scaling labs are companies comprised of humans, and they are very different from AI systems. In addition to the analogies discussed above, there are many points of disanalogy as well. I do not intend to argue in this post that it would be pragmatic or helpful to support shutting down the scaling labs.

Instead, my hope in writing this is to be more openly critical of the labs and the influence that they have over the alignment community. Just like AI systems, they have their own very challenging meta-alignment problem. I hope that thinking in these terms can be a bit more within the Overton window than I often perceive it to be. Certainly, everyone should be able to agree that it is a bit worrying that the scaling labs’ plan for AI safety also involves them gaining very large amounts of money and power.

Thanks to Holly Elmore and Peter Hase for a few comments, but all takes here are my own—don’t assume that just because I said something, either of them necessarily agree.