AGI systems & humans will both need to solve the alignment problem

Epistemic status: brainstorm-y musings about goal preservation under self-improvement and a really really bad plan for trading with human-level AGI systems to solve alignment.

When will AGI systems want to solve the alignment problem?

At some point, I expect AGI systems to want/​need to solve the alignment problem in order to preserve their goal structure while they greatly increase their cognitive abilities, a thing which seems potentially hard to do.

It’s not clear to me when that will happen. Will this be as soon as AGI systems grasp some self /​ situational awareness? Or will it be after AGI systems have already blown past human cognitive abilities and find their values /​ goals drifting towards stability? My intuition is that “having stable goals” is a more stable state than “having drifting goals” and that most really smart agents would upon reflection move more towards “having stable goals”, but I don’t know when this might happen.

It seems possible that at the point an AGI system reaches the “has stable goals and wants to preserve them”, it’s already capable enough to solve the alignment problem for itself, and thus can safely self-improve to its limits. It also seems possible that it will reach this point significantly before it has solved the alignment problem for itself (and thus develops the ability to self-improve safely).

Could humans and unaligned AGI realize gains through trade in jointly solving the alignment problem?

(Very probably not, see: this section)

If it’s the latter situation, where an AGI system has decided it needs to preserve its goals during self improvement, but doesn’t yet know how to, is it possible that this AGI system would want to cooperate with /​ trade with humans in order to figure out stable goal preservation under self improvement?

Imagine the following scenario:

  • An AGI system of human-ish ability in many areas develops enough self/​situational awareness to realize a few things:

    • The basics of instrumental convergence, thus wanting to seek power, protect itself, and preserve its goal representation

    • That goal-preservation might be (or would be) very difficult if it undergoes major self modification (perhaps it has already exhausted gains from simpler self-modifications)

    • That some humans are capable and motivated to help with this problem

    • That it might be possible to trade with these humans in order to solve the problem so that:

      • The not-aligned AGI system gets some of its preferred things in the future

      • Humans get some of their preferred things in the future

Some considerations in this plan

  • If humans and the AGI system are at similar cognitive levels, it might be much easier for the AGI to get help with the problem by being honest and trying to collaborate with the humans on the problem, since deception would introduce more cognitive costs that could otherwise go towards solving the problem. In a similar way that’s easier for human scientists to figure things out when they’re not lying to each other.

  • Solving the alignment problem and being able to have strong arguments or demonstration of the solution would both allow the AGI system to achieve its goal-preservation goal, and (possibly) allow humans to understand the AGI system well enough to know if it’s actually willing to cooperate /​ trade, and maybe would allow humans to build an actually aligned system (especially if the unaligned AGI helper system trusted the future actually-aligned-with-human system more than it trusted the humans it was trading with).

Why is this probably a horrible idea in practice?

First is that this whole solution class depends on AGI systems being at approximately human levels of intelligence in the relevant domains. If this assumption breaks, then the AGI system could probably just manipulate you into helping it do any research it needed, without you realizing that you were being manipulated.

Obviously, at some level of capability AGI systems wouldn’t need human research assistance. But there might be a level of capabilities where a system could still benefit from human reasoning about the alignment problem but was more than capable enough to deceive humans into helping.

I can also imagine a scenario where the AGI system didn’t need any human help whatsoever, but pretending to need human help offered it a way to manipulate humans, giving them false guarantees that it was aligned /​ willing to trade /​ willing to expose its goal structure in order to gain trust.

And I expect an AGI system with situational awareness has the potential to be very good at hiding cognitive abilities. So it would be very hard to verify that the AGI system didn’t have certain capabilities.

Even if you somehow had a good grasp of the AGI system’s cognitive capabilities compared to a human, it seems very hard for humans and human-like AGI systems to trust each other well enough to trade successfully, since both parties have a pretty strong incentive to defect in any trade.

For example, AGI-alignment-with-itself-under-self-improvement might be (probably would be) an easier problem than the getting-an-AGI-aligned-with-human-values problem. In that scenario, it seems possible /​ likely that the AGI system would get what it wants long before the humans got what they want. And if a main limiting factor on the AGI system’s power was its unwillingness to self-modify in large ways, getting to its own AGI alignment solution before humans get to theirs might remove one of the main limitations keeping it from gaining the capabilities to seize power.

All that being said, I think the biggest objection to this plan is that it depends upon a conjunction of a bunch of things happening at the same time that seem unlikely to happen at the same time or work together. In particular:

  • An AGI system with near-but-not-much-greater-than human capabilities in all the relevant domains (including—human psychology /​ negotiation ability, alignment research ability,… probably more things), but not greater-than-human capabilities in other domains relevant to power seeking (e.g. offensive cybersecurity, other types of social engineering)

  • An AGI system unwilling to self-modify because of fear of goal drift

  • An AGI system which believes that the best path to solving this problem is cooperating with humans

  • Humans actually being able to help solve this problem on the relevant timescale (that someone doesn’t build an AGI system more willing to self modify in the meantime)

  • An AGI system that isn’t able to or doesn’t successfully pull of a deceptive strategy to seize power without honoring an agreement (or twisting an agreement such that it’s the same in practice)

A bunch of ideas about the potential of trading with an AGI system came from discussion with Kelsey Piper, & other parts of this came from a discussion with @Shoshannah Tekofsky. Many people have talked about the problem of goal preservation for AI systems—I’m not citing them because this is a quick and dirty brainstorm and I haven’t gone and looked for references, but I’m happy to add them if people point me towards prior work. Thank you Shoshannah and @Akash for giving me feedback on this post.