How to solve deception and still fail.

A mostly finished post I’m kicking out the door. You’ll get the gist.

I

There’s a tempting picture of alignment that centers on the feeling of “As long as humans stay in control, it will be okay.” Humans staying in control, in this picture, is something like humans giving lots of detailed feedback to powerful AI, staying honestly apprised of the consequences of its plans, and having the final say on how plans made by an AI get implemented.[1]

Of course, this requires the AI to be generating promising plans to begin with, or else the humans are just stuck rejecting bad plans all day. But conveniently, in this picture we don’t have to solve the alignment problem the hard way. We could train the AI on human approval to get it to generate in-some-sense-good plans. As long as humans stay in control, it will be okay.

Normally, training even indirectly for human approval teaches AIs to learn to deceive humans to more reliably maximize approval. Which is why, in order for humans to stay in control, we need to be able to solve deception—not just detecting it after the fact, but producing AIs that actually don’t try to deceive the humans.

An AI deceiving the human (from Christiano et al. 2017). The sort of thing we’d like to understand how to categorically avoid.

The hope goes that human’s un-deceived approval is a sufficient quality check, and that we’ll pick good AI-generated plans and make the future go well.

This post is about how that picture is flawed—how an AI can generate a plan that humans approve of, based on lots of human feedback, that isn’t deceptive in the narrow sense, but that’s still bad.

Put a certain way, my thesis sounds nuts. What’s supposed to count as a valid human preference if human-guided plans, honestly approved of, aren’t sufficient? Is this just paranoid essentialism, insisting that there’s some essence of rightness that I magically know AIs risk missing, no matter their relationship with humans?

II

I’m not being essentialist. Humans are just too easily gamed. An AI that convinces humans to do bad stuff doesn’t have to be dishonest—true statements presented in the right order can convince humans of too many things. The AI doesn’t even need to be thinking manipulative thoughts—it can just be the result of an optimization process that rewards AIs for being convincing to humans.

The question remains, what standard other than human approval am I using to judge whether a human-approved plan is “actually” good or bad? Briefly: Human values are a collection of abstractions useful for describing human actions, their decision-making, their interpersonal bargaining, and their self-reflective cognition. Judgments about the “actual goodness” of some plan come from how that plan coheres or conflicts with various perspectives on what I value.

Some AI-generated plans might be selected in a way that de-correlates present approval from other data relevant to human values. A plan might be approved but still be bad from perspectives that emphasize implied preferences in behavior, or neural correlates of liking, or healthy self-reflection, or community-based perspectives on value, or even future approval.

III

To gradually get more specific, there are two ways I expect human-approved plans to go wrong: object-level and meta-level.

Object level wrongness is when you endorse a plan, but you’re wrong in the straightforward way that King Midas was wrong for endorsing turning everything into gold. Midas did want more gold, but there were things missing from his wish that he didn’t think about until it was too late. In the myth, these were super-obvious things like “being able to eat food,” because Midas is a fictional character meant to warn you to be careful what you wish for.

It’s true that if it’s sufficiently non-obvious that something is wrong, then it’s not wrong at all. The plausibility of object-level wrongness relies on human approval being so fallible that we can expect future people to sometimes make decisions that are fairly obviously wrong to us.

Do people ever make such mistakes? Yes! Regularly! People could fail to ask about an impact that they’d immediately see as important if they did consider it. People could feel pressure to evaluate plans on legible or socially-respectable metrics that leave out important parts of the human experience. People could get stuck in motivated reasoning where they get personally invested in a particular plan and find reasons to discount problems with it.

For each example one can give, it’s easy to respond “oh, I’d never pick that,” especially because giving it as an example requires showing a framing where it seems obviously bad. But at the same time, we can acknowledge that we’re fallible, and often we think something is a good idea (especially if its presentation has been optimized to appeal to us) and only later realize it’s not.

Meta-level wrongness can only occur when the plan you approve is indirect—when the plan contains steps like “learn more about the world and use some procedure to decide what to do next.” This indirect plan can be bad by meta-level standards—e.g. standards for what processes of self-reflection are good processes.

I often think about the Atlantic Ocean. Self-modification is like water flowing downhill—as an AI takes actions that modify itself, it flows around the landscape of possible desires, eventually pooling into more stable configurations. But if self modification were water flowing downhill, we wouldn’t want our AI to end up in the Atlantic Ocean. Not because we know what object-level positions the Atlantic Ocean corresponds to, and we disagree, but because too many different rivers flow into the ocean—ending there means you’ve lost most of the information about where you came from. Indirect plans that would self-modify down to the ocean are bad plans.

Would an AI have to be deceptive to get us to approve of a plan that’s a river to the ocean? No, if there’s selection pressure to find an ocean-going river that sounds appealing to humans, such a river will be found. See perhaps Wirehead gods on lotus thrones. The important question, the question we can’t sidestep even if we solve deception, is where your selection pressure is coming from.

IV

A way to solve deception and still fail is to find your AI-generated plans via an optimization process that selects a goal you approve of without solving the rest of the alignment problem. Such a selection process will swiftly find us an AI that’s honestly and persuasively trying to drive what we approve of apart from every other facet of goodness.

How all this shakes out depends on what kind of communication we’re having with the AI, what counts as “deception,” etc.[2] But broadly, if we solve deception, I don’t think the danger is the AI wanting something hidden and alien and manipulating humans to get it, I think it’s that we inadvertently select for an AI that honestly wants some Midas-like bad wish.[3]

You can think of this as still a kind of deception, but perpetrated by the combined training-and-selection-and-deployment process. To the extent that deception involves thinking about how humans will react when it should be thinking about the object level matter at hand, such a system would be implicitly “thinking” about human reactions during the training and selection processes, when we’re e.g. training an AI on human approval, or selecting a soft prompt based on human ratings of the results.

To solve deception and not fail, you still have to solve most of the alignment problem. To get a powerful AI to do good and impactful things you need a robust standard of goodness involved in the process that leads to your AI-generated plans, which requires the AI to use its information about humans more cleverly than our current designs.[4]

V

What would it be like to get fooled? I don’t want you to think that this is something that only happens to those other people, it would never actually happen to us. I would absolutely get fooled.

Imagine you’re the designated AI auditor, and you’re trying to only give the go-ahead to AIs that produce plans that are good according to a broad variety of ways of viewing human preferences, amalgamated using meta-level rules that are likewise good according to a variety of ways of viewing human meta-level reasoning, and isn’t adversarially selected in some weird way.

And then you get faced with an AI that’s optimizing for some buggy and not-actually-very good model of human values, but information about auditing was used in its training and it’s learned to be really, really good at managing PR about itself.[5]

During the inspection, it feels like everything goes smoothly. Maybe some details of its plans are extra-intriguing, and pondering them causes you to reconsider your very standards for judging plans, and uncover new hidden depths to how hard you’re willing to approve of things.[6] Or maybe its plans frame the world to flatter your politics, and lead you to the conclusion that if you don’t approve of the AI you’re letting the other side win. Or maybe some intermediate point of changing your mind and exploiting your biases.

Then it’s unleashed on the world, and you get negative utilitarian utopia, or a boot stamping on the face of the hated outgroup forever, or entities that rate as “happy” according to a mistaken model of cognition, or some similarly silly thing that you don’t actually want.

VI

This post has been, in some sense, too should-universe. It assumes that people are trying to use the AI to transform the world to suit their values. But lots of real people wouldn’t actually do that. Given honest AI, many people would just try to use it to modestly improve their own lives without doing anything too dangerous. And other people would recklessly use it in transformative ways without even the token attempt to leverage honesty into safety.[7]

This post has also been a strawman. No faction in AI safety has a master plan of “Solve deception and then, lacking other ideas about how to get the AI to generate good plans, just start optimizing for human approval.”

But I still think there’s too much faith, in many corners, in human oversight. In plans like building an AI that is very plausibly dangerous, but auditing it before deployment. In the feeling that as long as humans stay in control, it will be okay. This is not necessarily true (or perhaps not necessarily possible), so long as some selection pressure on human approval is sneaking into the AI. It’s not true even if we suppose that we’ve solved deception.

Don’t jam optimization pressure at a problem first and think second.

We still have to solve the alignment problem.

A lot of this post was written at MAIA. Thanks MAIA!

  1. ^

    A practical picture of “what control looks like” is necessary because the notion of “control in general” gets very fuzzy when dealing with an intelligent AI that predicts your reactions as it makes plans.

  2. ^

    There’s a heavy overlap with the Eliciting Latent Knowledge research program here. It might be possible to “solve deception” so broadly that you also let humans give good feedback about the generalization behavior of AIs, which basically requires tackling the whole alignment problem under the banner of “deception.”

  3. ^

    Still bad. Midas starved because his food turned to gold, after all (in some versions).

  4. ^

    Asking a really big language model what “goodness” is doesn’t really sidestep the problem—it won’t work if it’s just predicting human text because humans in the dataset tend to write nonsense, nor will it work if it’s just predicting human-approved text. It might work if you had some cleverer-than-SOTA way to leverage data about humans to elicit abstractions from the model.

    I may be too harsh when I say “humans in the dataset tend to write nonsense”—plausibly you can get human-level good behavior with near-future training and prompting, and human level is enough to do quite a lot. But I’m mostly focusing here on superintelligent AI, which will need to safely generalize beyond human-level good behavior.

  5. ^

    Real life isn’t either/​or; the degree of information leakage about the evaluation procedure into the optimization process can lie on a spectrum, ranging from “we trained directly on the evaluation procedure” to “what kind of evaluations we would do was implicit in the training data in a way that influenced the generalization behavior.” We likely have to accept some degree of the latter.

  6. ^

    Why yes, this is a lot like Stuart Armstrong’s Siren Worlds.

  7. ^

    I guess technically that would be a way to solve deception and still fail.