The default outcome for aligned AGI still looks pretty bad

Most of the AI-related discussions I’ve read from the LW/​EA community have rightly focused on the challenges and strategies for aligning AI with the intentions of its creator. After all, if that problem remains unsolved, humanity will almost certainly go extinct.

However, something I rarely see discussed is what happens if we manage to solve the narrower problem of aligning an AI with the desires of its creator, but fail to solve the wider problem of aligning an AI with the desires of humans as a whole. I think the default outcome if we solve the narrow alignment problem still looks pretty terrible for humans.

Much of the cutting-edge work on LLMs and other powerful models is being conducted in for-profit corporations. As we see clearly in existing companies, maximizing profits within such an entity often produces behavior that is highly misaligned with the interests of the general public. Social media companies, for example, monetize a large fraction of human attention for stunningly low value, often at the expense of user’s social relationships, mental health, and economic productivity.

OpenAI at least operates under a capped profit model, where investors can earn no more than 100x their initial investment. This is good since it at least leaves open the possibility that if they create AGI, the profit model may not automatically result in gigantic concentration of power. They have also hinted in their posts that future fundraising rounds will be capped at some multiple lower than 100x.

However, they still haven’t publicly specified how the benefits of AGI beyond this capped profit threshold will be distributed. If they actually succeed in their mission, this distribution method will become a huge, huge deal. Like I would fully expect nations to consider war (insofar as war could accomplish their goals) if they feel like they are given an unfair deal in the distribution of AGI-created profits.

Anthropic is registered as a public benefit corporation, which according to this explainer means:

Unlike standard corporations, where the Board generally must consider maximizing shareholder value as its prime directive, members of the Board of a PBC must also consider both the best interests of those materially affected by the company’s conduct, and the specific public benefit outlined in the company’s charter.

Unfortunately, it doesn’t seem that Anthropic has made its charter public, so there’s no easy way for me to even guess how well its non-profit-maximizing mission aligns with the general interests of humanity other than to hope that it is somewhat similar to the language in the four paragraphs listed on their page regarding company values.

Besides these two, every other entity that is a serious contender to create AGI is either a pure profit-maximizing enterprise or controlled by the CCP.

Even without the profit incentive, things still look kind of bleak

Imagine for a moment that somehow, we make it to a world in which narrowly aligned AI becomes commonplace and either non-corporate actors can exert enough influence to prevent AGI from purely benefitting stockholders, or the first group to create AGI is able to stick to the tenants of their public benefit charter. Now imagine that during the rollout of AGI access to the general public, you are one of the few lucky enough to have early access to the most powerful models. What’s the first thing you would use it for?

Different people will have different answers for this depending on their circumstances, but there are a few self-interested actions that seem quite obvious:

  • Make oneself smarter (possibly through brain uploading. Such technology seems within reach for AGI)

  • Make oneself immortal (hopefully with some caveats to prevent infinite torture scenarios)

Such actions seem very likely to lead to enormous power concentration in the hands of the few with privileged access to powerful early models. It seems entirely plausible that such a process could lead to a runaway feedback loop where those who engaged in such a strategy first could use that increased intelligence to accrue more and more resources (most importantly compute), until that person becomes more or less equivalent to a superintelligence.

The best case outcome in such a scenario would be a benevolent dictatorship of some sort, in which someone who broadly wants the best for all non-amplified humans ensures that no other entity can challenge their power, and takes steps to safeguard the welfare of other humans.

But I think a more likely outcome is one in which multiple people with early access to the models end up fighting over resources. Maybe by some miracle the winner of that process will be a reasonably benevolent dictator, but such competitive dynamics optimize for willingness to do whatever it takes to win. History suggests such traits are not well correlated (or perhaps even negatively correlated) with a general concern for human well-being.

I’ve read some good ideas about what to do with an aligned AGI, most notably Coherent Extrapolated Volition by MIRI, but there still remains the question of “can we actually get the interested parties involved to do something as sensible as CEV instead of just directly pursuing their own self-interest?”

Please let me know if you see anything obvious I’ve overlooked in this post. I am not a professional AI researcher, so it’s possible people have already figured out the answer to these challenges and I am simply unaware of their writings.