How to solve deception and still fail.

Charlie Steiner4 Oct 2023 19:56 UTC

LW: 36 AF: 19

A mostly finished post I’m kicking out the door. You’ll get the gist.

There’s a tempting picture of alignment that centers on the feeling of “As long as humans stay in control, it will be okay.” Humans staying in control, in this picture, is something like humans giving lots of detailed feedback to powerful AI, staying honestly apprised of the consequences of its plans, and having the final say on how plans made by an AI get implemented.^[1]

Of course, this requires the AI to be generating promising plans to begin with, or else the humans are just stuck rejecting bad plans all day. But conveniently, in this picture we don’t have to solve the alignment problem the hard way. We could train the AI on human approval to get it to generate in-some-sense-good plans. As long as humans stay in control, it will be okay.

Normally, training even indirectly for human approval teaches AIs to learn to deceive humans to more reliably maximize approval. Which is why, in order for humans to stay in control, we need to be able to solve deception—not just detecting it after the fact, but producing AIs that actually don’t try to deceive the humans.

An AI deceiving the human (from Christiano et al. 2017). The sort of thing we’d like to understand how to categorically avoid.

The hope goes that human’s un-deceived approval is a sufficient quality check, and that we’ll pick good AI-generated plans and make the future go well.

This post is about how that picture is flawed—how an AI can generate a plan that humans approve of, based on lots of human feedback, that isn’t deceptive in the narrow sense, but that’s still bad.

Put a certain way, my thesis sounds nuts. What’s supposed to count as a valid human preference if human-guided plans, honestly approved of, aren’t sufficient? Is this just paranoid essentialism, insisting that there’s some essence of rightness that I magically know AIs risk missing, no matter their relationship with humans?

I’m not being essentialist. Humans are just too easily gamed. An AI that convinces humans to do bad stuff doesn’t have to be dishonest—true statements presented in the right order can convince humans of too many things. The AI doesn’t even need to be thinking manipulative thoughts—it can just be the result of an optimization process that rewards AIs for being convincing to humans.

The question remains, what standard other than human approval am I using to judge whether a human-approved plan is “actually” good or bad? Briefly: Human values are a collection of abstractions useful for describing human actions, their decision-making, their interpersonal bargaining, and their self-reflective cognition. Judgments about the “actual goodness” of some plan come from how that plan coheres or conflicts with various perspectives on what I value.

Some AI-generated plans might be selected in a way that de-correlates present approval from other data relevant to human values. A plan might be approved but still be bad from perspectives that emphasize implied preferences in behavior, or neural correlates of liking, or healthy self-reflection, or community-based perspectives on value, or even future approval.

III

To gradually get more specific, there are two ways I expect human-approved plans to go wrong: object-level and meta-level.

Object level wrongness is when you endorse a plan, but you’re wrong in the straightforward way that King Midas was wrong for endorsing turning everything into gold. Midas did want more gold, but there were things missing from his wish that he didn’t think about until it was too late. In the myth, these were super-obvious things like “being able to eat food,” because Midas is a fictional character meant to warn you to be careful what you wish for.

It’s true that if it’s sufficiently non-obvious that something is wrong, then it’s not wrong at all. The plausibility of object-level wrongness relies on human approval being so fallible that we can expect future people to sometimes make decisions that are fairly obviously wrong to us.

Do people ever make such mistakes? Yes! Regularly! People could fail to ask about an impact that they’d immediately see as important if they did consider it. People could feel pressure to evaluate plans on legible or socially-respectable metrics that leave out important parts of the human experience. People could get stuck in motivated reasoning where they get personally invested in a particular plan and find reasons to discount problems with it.

For each example one can give, it’s easy to respond “oh, I’d never pick that,” especially because giving it as an example requires showing a framing where it seems obviously bad. But at the same time, we can acknowledge that we’re fallible, and often we think something is a good idea (especially if its presentation has been optimized to appeal to us) and only later realize it’s not.

Meta-level wrongness can only occur when the plan you approve is indirect—when the plan contains steps like “learn more about the world and use some procedure to decide what to do next.” This indirect plan can be bad by meta-level standards—e.g. standards for what processes of self-reflection are good processes.

I often think about the Atlantic Ocean. Self-modification is like water flowing downhill—as an AI takes actions that modify itself, it flows around the landscape of possible desires, eventually pooling into more stable configurations. But if self modification were water flowing downhill, we wouldn’t want our AI to end up in the Atlantic Ocean. Not because we know what object-level positions the Atlantic Ocean corresponds to, and we disagree, but because too many different rivers flow into the ocean—ending there means you’ve lost most of the information about where you came from. Indirect plans that would self-modify down to the ocean are bad plans.

Would an AI have to be deceptive to get us to approve of a plan that’s a river to the ocean? No, if there’s selection pressure to find an ocean-going river that sounds appealing to humans, such a river will be found. See perhaps Wirehead gods on lotus thrones. The important question, the question we can’t sidestep even if we solve deception, is where your selection pressure is coming from.

A way to solve deception and still fail is to find your AI-generated plans via an optimization process that selects a goal you approve of without solving the rest of the alignment problem. Such a selection process will swiftly find us an AI that’s honestly and persuasively trying to drive what we approve of apart from every other facet of goodness.

How all this shakes out depends on what kind of communication we’re having with the AI, what counts as “deception,” etc.^[2] But broadly, if we solve deception, I don’t think the danger is the AI wanting something hidden and alien and manipulating humans to get it, I think it’s that we inadvertently select for an AI that honestly wants some Midas-like bad wish.^[3]

You can think of this as still a kind of deception, but perpetrated by the combined training-and-selection-and-deployment process. To the extent that deception involves thinking about how humans will react when it should be thinking about the object level matter at hand, such a system would be implicitly “thinking” about human reactions during the training and selection processes, when we’re e.g. training an AI on human approval, or selecting a soft prompt based on human ratings of the results.

To solve deception and not fail, you still have to solve most of the alignment problem. To get a powerful AI to do good and impactful things you need a robust standard of goodness involved in the process that leads to your AI-generated plans, which requires the AI to use its information about humans more cleverly than our current designs.^[4]

What would it be like to get fooled? I don’t want you to think that this is something that only happens to those other people, it would never actually happen to us. I would absolutely get fooled.

Imagine you’re the designated AI auditor, and you’re trying to only give the go-ahead to AIs that produce plans that are good according to a broad variety of ways of viewing human preferences, amalgamated using meta-level rules that are likewise good according to a variety of ways of viewing human meta-level reasoning, and isn’t adversarially selected in some weird way.

And then you get faced with an AI that’s optimizing for some buggy and not-actually-very good model of human values, but information about auditing was used in its training and it’s learned to be really, really good at managing PR about itself.^[5]

During the inspection, it feels like everything goes smoothly. Maybe some details of its plans are extra-intriguing, and pondering them causes you to reconsider your very standards for judging plans, and uncover new hidden depths to how hard you’re willing to approve of things.^[6] Or maybe its plans frame the world to flatter your politics, and lead you to the conclusion that if you don’t approve of the AI you’re letting the other side win. Or maybe some intermediate point of changing your mind and exploiting your biases.

Then it’s unleashed on the world, and you get negative utilitarian utopia, or a boot stamping on the face of the hated outgroup forever, or entities that rate as “happy” according to a mistaken model of cognition, or some similarly silly thing that you don’t actually want.

This post has been, in some sense, too should-universe. It assumes that people are trying to use the AI to transform the world to suit their values. But lots of real people wouldn’t actually do that. Given honest AI, many people would just try to use it to modestly improve their own lives without doing anything too dangerous. And other people would recklessly use it in transformative ways without even the token attempt to leverage honesty into safety.^[7]

This post has also been a strawman. No faction in AI safety has a master plan of “Solve deception and then, lacking other ideas about how to get the AI to generate good plans, just start optimizing for human approval.”

But I still think there’s too much faith, in many corners, in human oversight. In plans like building an AI that is very plausibly dangerous, but auditing it before deployment. In the feeling that as long as humans stay in control, it will be okay. This is not necessarily true (or perhaps not necessarily possible), so long as some selection pressure on human approval is sneaking into the AI. It’s not true even if we suppose that we’ve solved deception.

Don’t jam optimization pressure at a problem first and think second.

We still have to solve the alignment problem.

A lot of this post was written at MAIA. Thanks MAIA!

^
A practical picture of “what control looks like” is necessary because the notion of “control in general” gets very fuzzy when dealing with an intelligent AI that predicts your reactions as it makes plans.
^
There’s a heavy overlap with the Eliciting Latent Knowledge research program here. It might be possible to “solve deception” so broadly that you also let humans give good feedback about the generalization behavior of AIs, which basically requires tackling the whole alignment problem under the banner of “deception.”
^
Still bad. Midas starved because his food turned to gold, after all (in some versions).
^
Asking a really big language model what “goodness” is doesn’t really sidestep the problem—it won’t work if it’s just predicting human text because humans in the dataset tend to write nonsense, nor will it work if it’s just predicting human-approved text. It might work if you had some cleverer-than-SOTA way to leverage data about humans to elicit abstractions from the model.
I may be too harsh when I say “humans in the dataset tend to write nonsense”—plausibly you can get human-level good behavior with near-future training and prompting, and human level is enough to do quite a lot. But I’m mostly focusing here on superintelligent AI, which will need to safely generalize beyond human-level good behavior.
^
Real life isn’t either/or; the degree of information leakage about the evaluation procedure into the optimization process can lie on a spectrum, ranging from “we trained directly on the evaluation procedure” to “what kind of evaluations we would do was implicit in the training data in a way that influenced the generalization behavior.” We likely have to accept some degree of the latter.
^
Why yes, this is a lot like Stuart Armstrong’s Siren Worlds.
^
I guess technically that would be a way to solve deception and still fail.

Charlie Steiner4 Oct 2023 19:56 UTC

LW: 36 AF: 19

7 comments6 min readLW link

AI Deception World Modeling

Nathaniel Monson 4 Oct 2023 21:15 UTC
LW: 7 AF: 4
−2
AF
Those doesn’t necessarily seem correct to me. If, eg, OpenAI develops a super intelligent, non deceptive AI, then I’d expect some of the first questions they’d ask it to be of the form “are there questions which we would regret asking you, according to our own current values? How can we avoid asking you those while still getting lots of use and insight from you? What are some standard prefaces we should attach to questions to make sure following through on your answer is good for us? What are some security measures that we can take to make sure our users lives are generally improved by interacting with you? What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?” Etc.
- Joe_Collman 5 Oct 2023 18:57 UTC
  LW: 5 AF: 4
  0
  AF Parent
  ...non deceptive AI...
  I think it’s very important to be clear you’re not conditioning on something incoherent here.
  In particular, [an AI that never misleads the user about anything (whether intentional or otherwise)] is incoherent: any statement an AI can make will update some of your expectations in the direction of being more correct, and some away from being correct. (it’s important here that when a statement is made you don’t learn [statement], but rather [x made statement]; only the former can be empty)
  I say non-misleading-to-you things to the extent that I understand your capabilities and what you value, and apply that understanding in forming my statements.
  [Don’t ever be misleading] cannot be satisfied.
  [Don’t ever be misleading in ways that we consider important] requires understanding human values and optimizing answers for non-misleadingness given those values.
  NB not [answer as a human would], or [give an answer that a human would approve of].
  With a fuzzy notion of deception, it’s too easy to do a selective, post-hoc classification and say “Ah well, that would be deception” for any outcome we don’t like. But the outcomes we like are also misleading—just in ways we didn’t happen to notice and care about.
  This smuggles in a requirement that’s closer in character to alignment than to non-deception.
  Conversely, non-fuzzy notions of deception don’t tend to cover all the failure modes (e.g. this is nice, but avoiding deception-in-this-sense doesn’t guarantee much).
- Charlie Steiner 5 Oct 2023 5:54 UTC
  4 points
  0
  Parent
  EDIT:
  I should acknowledge that conditioning on a lot of “actually good” answers to those questions would indeed be reassuring.
  The point is more that humans are easily convinced by “not actually good” answers to those questions, if the question-answerer has been optimized to get human approval.
  ORIGINAL REPLY:
  Okay, suppose you’re a AI that wants something bad (like maximizing pleasure), and also has been selected to produce text that is honest and that causes humans to strongly approve of you. Then you’re asked
  are there questions which we would regret asking you, according to our own current values?
  What honest answer can you think of would cause humans to strongly approve of you, and will let you achieve your goals?
  How about telling the humans they would regret asking about how to construct biological weapons or similar dangerous technologies?
  How about appending text explaining your answer that changes the humans’ minds to be more accepting of hedonic utilitarianism?
  If the question is extra difficult for you, like
  What are some security measures we can take to minimize the chances of a world turning out very badly according to our own desires?
  , dissemble! Say the question is unclear (all questions are unclear) and then break it down in a way that causes the humans to question whether they really want their own current desires to be stamped on the entire future, or whether they’d rather trust in some value extrapolation process that finds better, more universal things to care about.
  - Nathaniel Monson 5 Oct 2023 7:02 UTC
    3 points
    0
    Parent
    I think I would describe both of those as deceptive, and was premising on non-deceptive AI.
    
    If you think “nondeceptive AI” can refer to an AI which has a goal and is willing to mislead in service of that goal, then I agree; solving deception is insufficient. (Although in that case I disagree with your terminology).
    - Charlie Steiner 5 Oct 2023 16:56 UTC
      2 points
      0
      Parent
      Fair point (though see also the section on how the training+deployment process can be “deceptive” even if the AI itself never searches for how to manipulate you). By “Solve deception” I mean that in a model-based RL kind of setting, we can know the AI’s policy and its prediction of future states of the world (it doesn’t somehow conceal this from us). I do not mean that the AI is acting like a helpful human who wants to be honest with us, even though that that’s a fairly natural interpretation.
- Daniel Kokotajlo 5 Oct 2023 14:42 UTC
  2 points
  0
  Parent
  I tentatively agree and would like to see more in-depth exploration of failure modes + fixes, in the setting where we’ve solved deception. It seems important to start thinking about this now, so we have a playbook ready to go...
Portia 5 Oct 2023 19:25 UTC
4 points
2
Sort of related idea—the way AI algorithms in social media have turned out have me concerned that even a non-deceptive AI that is very carefully observing what we seem to want—what we dwell on vs what we ignore, what we upvote vs what we downvote—will end up providing something that makes us miserable.
Here are the things that make my life a good life worth living, for me: Gettings things done, even if they are hard. Learning things, even if they are complicated. Teaching things to people that need them, in the most effective ways, even if that requires a lot of patience and they won’t immediate agree and follow. Updating my own false beliefs into less wrong ones, even though that feels horrid. Really connecting to people, even though that is tricksy and scary. Doing things that make the world around me a better place, even though they can be very tedious. Speaking up for truth and justice, even when that is terrifying or awkward. Exercising self-control to delay gratification to achieve goals aligned with my values—kindness and rationality and health. Being challenged, so I am always growing, moving. These make me feel like I am the person I want to be, happy on a deep level.
But if an AI studied what I want based on split second decisions, especially if those decisions occur when I am tired, distracted, in pain, or depressed… the AI will conclude that I like getting angry at people, as I am drawn to click on infuriating content, and my posting speed accelerates when I am angry, and I devote more time to this stuff. That I like to see people who agree with me, regardless of whether they are right, even though that makes me less irrational and more isolated, oh, but for that moment, I feel so good that people agree with me, I like it, and I tend to overlook the problems in their agreement. An AI will conclude that I do not like well argued and complicated articles from my political enemies, which would allow mutual learning and growth and common ground, but rather strawmen that are easy to mock and make me laugh rather than make me feel touched and filled with complicated emotions because people who do things that are terrible are in pain, too. That I prefer cute animals and DnD memes to complex equations. That I prefer reading random Wikipedia articles at 2 am to getting proper sleep.
The part of me that I want, my conscious choice, is very different from the part of me that happens automatically. The former is constantly fighting the latter. When I am engaging the former, I am likely to be offline, writing, doing research, engaging with humans, doing activism, being in nature. When I am the latter, I pick up my phone after a long day, and that is when I get measured, when the part of me that is vigilant is resting, and who I am begins to slip.
What would help me is an AI that would align my environment with my actual goals. But if I don’t actively declare these goals, but it just learns the goals implicitly from my behaviour—which is the machine learning approach—I fear it will learn something terrible. It will learn my weaknesses. The part of me that is lesser. That stays in their comfort zone. And it will spin a comforting cocoon exactly aligned with this lesser part of me, that will bury the part of me that is better. I find that terrifying.
And the AI that would spin that trap… it would not be malignant. It would not be deceptive. It would be trying to exactly fulfil my wishes as I show them.