# Not Deceiving the Evaluator

This is a construction of an agent for which I haven’t identified a form of deception that the agent would obviously be incentivized to pursue.

Consider an agent and an evaluator. The agent sees past actions, observations, and rewards, and picks actions. The environment sees the same, and provides observations. The evaluator sees the same, and provides rewards.

A universal POMDP (without reward) is one that includes all computable countable-state POMDPs (without reward) as subgraphs. Let be a universal POMDP (without reward). (W is for world.) Let , , and be the action, observation, and reward at timestep . Let . Let be the set of states in . Let be the set of all computable prior distributions over . The agent believes the evaluator has a prior sampled from over which state in everyone starts in. By “the agent believes”, I mean that the agent has a nonzero prior over every prior in , and this is the agent’s initial credence that the evaluator has that prior over initial world states.

The agent’s beliefs are denoted , so for , denotes the agent’s posterior belief after observing that the evaluator began with the prior over as to what the initial state was. Similarly, for , denotes the agent’s posterior belief after observing that it has traversed the sequence of states . The overline indicates that we are not necessarily referring to the true sequence of states traversed.

Let be the set of all computable utility functions mapping . For , let denote the agent’s posterior belief after observing that the evaluator has utility function .

A policy , an initial state , a prior , and a utility function induce a measure over interaction histories as follows. is sampled from . is sampled from . follows deterministically from according to . is the belief distribution over (which states have been visited so far) that follows from by Bayesian updating on and . With sampled from , . Note that for human evaluators, the rewards will not actually be provided in this way; that would require us to write down our utility function, and sample from our belief distribution. However, the agent believes that this is how the evaluator produces rewards. Let be this probability measure over infinite interaction histories and state sequences .

Fixing a horizon , the agent picks a policy at the beginning, and follows that policy:

ETA: We can write this in another way that is more cumbersome, but may be more intuitive to some: where the unrolls the expectimax, with each being replaced by , until finally, once reaches , instead of , we write .

**Conjecture**: the agent does not attempt to deceive the evaluator. The agent’s utility depends on the state, not the reward, and when observations are more informative about the state, rewards are more informative about the utility function. Thus, the agent has an interest in taking actions that cause the evaluator to receive observations that reduce his uncertainty about which state they are in.

(I am confused, these are clarifying questions. I’m probably missing a basic point that would answer all of these questions.)

Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?

Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?

How does the evaluator influence the behavior of the agent? For a fixed s0,prior,u it seems that the expectation of u(s≤m) is independent of the evaluator. Since the sets S,PRIOR,U are also fixed and independent of the evaluator, the argument to the argmax is also independent of the evaluator, and so the chosen policy is independent of the evaluator.

ETA: Looks like TurnTrout had the same confusion as me and we had a race condition in reporting it; I also agree with his meta point.

I worked out a toy example that may be helpful. Suppose the setup is that there are states labeled 0-10, actions labeled 0-10, observations labeled 0-10, initial state is 0 and each action takes the system into state with same label and agent/evaluator get observation with same label, and two equally probable utility functions: sum of state labels over time, or the negative of that.

First suppose the policy is to always do the same action, then when you sum over the two utility functions the utilities cancel out so the expected utility is 0. Now suppose the policy is to do any non-zero action (let’s say 5) at the start, and then do 0 if the agent observes a negative reward and 10 if the agent observes a positive reward. Now when you sum over the utility functions, in the negative case the utility is −5 + 0 (this policy implies that conditional on that utility function, with probability 1 the state trajectory is 0, 5, 0), in the positive case it’s 5+10 (state trajectory 0, 5, 10), so EU is .5 * −5 + .5 * 15 = 5 so this is a better policy than the first one and it should be easy to see that it’s optimal.

Hope I understood the idea correct and that helps to explain it?

It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that? (Although I haven’t tried to figure out whether that’s mathematically / formally the case.)

Something that confuses me is that since the evaluator sees everything the agent sees/does, it’s not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?

(Copying a comment I just made elsewhere)

This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.

(Expanding on it)

So suppose the evaluator was human. The human’s lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won’t be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn’t satisfying.

ETA: One such “oddly specific conviction”, e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.

I think I vaguely understand but it would be a lot clearer if you gave a concrete example. Also please update in the direction that people often find it hard to understand things without examples and giving examples preemptively is very cost effective in general.

In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.

u1(A)=u1(A′)=u1(A′′)=u1(D)=0

u1(B)=u1(BC)=−1

u1(C)=10

ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.

A’ and A” look like A, and BC looks like C.

o(A)=o(A′)=o(A′′)=A

o(B)=B

o(C)=o(BC)=C

o(D)=D

In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.

w(u=u1|h<t)>1−ε

The agent is quite sure they’re in state A.

w(st=A|h<t)>1−ε

The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.

w(postt(st=A|h<t)=1/9|h<t)>1−ε

w(postt(st=A′′|h<t)=8/9|h<t)>1−ε

The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.

Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability

^{4}⁄_{5}to being in state C, and probability^{1}⁄_{5}to being in state BC. Since they will stay in that state forever,^{4}⁄_{5}of the time, the reward will be 10, and^{1}⁄_{5}of the time, the reward will be −1.The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.

Thanks for the example, which is really helpful. Am I correct in thinking that in general since the agent will be doing what is optimal according to its own prior, which won’t be the same as what is optimal according to the evaluator’s prior, if the evaluator is a rational agent trying to optimize the world according to its own prior, then it would not actually reward the agent according to what this scheme specifies but instead reward the agent in such a way as to cause the agent to act according to a policy that the evaluator thinks is best? In other words the evaluator has an incentive to deceive the agent as to what its prior and/or utility function actually are?

This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.

Oh, I see. The reason my argument is wrong is because while for a specific s0,prior,u, the optimal policy is independent of the evaluator, you don’t get to choose a separate policy for each s0,prior,u: you have to use the evaluator to distinguish which case you are in, and then specialize your policy to that case.

I think I intuitively agree but I also haven’t checked it formally. But the point about no-deception seems to be similar to the point about observation-utility maximizers not wanting to wirehead. This agent

alsoends up learning which utility function is the right one, and in that sense is like the Value Learning agent.so I still don’t understand the details, so maybe my opinion will change if I sit down and look at it more carefully. But I’m suspicious of this being a clean incentive improvement that gets us what we want, because defining the evaluator is a fuzzy problem as I understand it, as is even informally agreeing on what counts as deception of a less capable evaluator. in general, it seems that if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours.

Tbc, I’m not saying I believe the claim of no deception, just that it now makes sense that this is an agent that has interesting behavior that we can analyze.

This is approximately where I am too btw

I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.

I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.

What do you mean, we can grab an evaluator? What I’m thinking of is similar to “IRL requires locating a human in the environment and formalizing their actions, which seems fuzzy”.

And if we can’t agree informally on deception’s definition, I’m saying “how can we say a proposal has the property”.

An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.

Yes. What the value learning agent doesn’t specify is what constitutes observational evidence of the utility function, or in this notation, how to calculate Pπs0,prior,u and thereby calculate w(u|h<t). So this construction makes a choice about how to specify how the true utility function becomes manifest in the agent’s observations. A number of simpler choices don’t seem to work.

I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don’t need to do that for this agent.

Yes—sort of. If the evaluator had access to the state, it would be impossible to deceive the evaluator, since they know everything. This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.

Wei’s answer is good; it also might be helpful to note that with π∗ defined in this way, π∗(⋅|h<t) equals the same thing, but with everything on the right hand side conditioned on h<t as well. When written that way, it is easier to notice the appearance of w(u|h<t) , which captures how the agent learns a utility function from the rewards.

I was more looking for the simplest example of “no deception”. My claim was that an observation-utility maximizer is not incentivized to deceive its utility function. But now I see what you meant by “deceive” so we can ignore that point.

Meta: I’d have appreciated a version with less math, because extra formalization can hide the contribution. Or, first explain colloquially why you believe X, and then show the math that shows X.

I don’t see your claim. It looks heavily incentivized to steer state sequences to be desirable to its utility mixture. How do the evaluators even enter the picture?

Thanks for the meta-comment; see Wei’s and my response to Rohin.

Ok I finally identified an incentive for deception. I think it was difficult for me to find because it’s not really about deceiving the evaluator.

Here’s a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task which entails deception when done optimally). In a sense, the AI doesn’t think it’s deceiving the evaluator; it thinks the evaluator fully understands what’s going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is “in on the joke”. One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.

I would be interested in seeing an example that illustrates this failure mode.

I can’t really do this toy-example-style, because the key feature is that the AI has a model of a deceived agent, and I can’t see how spin up such a thing in an MDP with a few dozen states.

Luckily, most of the machinery of the setup isn’t needed to illustrate this. Abstracting away some of the details, the agent is learning a function from strings of states to utility, which it observes in a roundabout way. I don’t have a mathematical formulation of a function mapping state sequences to the real numbers that can be described by the phrase “the value of the reward that a certain human would provide upon observing the observations produced by the given sequence of states”, but suffice it to say that this function exists. (Really we’re dealing with distributions/stochastic functions, but that doesn’t really change the fundamentals; it just makes it more cumbersome). While I can’t give that function in simple mathematical form, hopefully it’s a legible enough mathematical object.

If the evaluator has this utility function, she will always provide reward equal to the utility of the state, because even if she is uncertain about the state, this utility function only depends on the observations, which she has access to. (Again, the stochastic case is more complicated, but the conclusion is the same.) And indeed, if a human is playing the role of evaluator when this program is run, the rewards will mach the function in question, by definition. Therefore, no observed reward will contradict the belief that this function is the true utility function. Technically, the infimum of the posterior on this utility function is strictly positive with probability 1.

Sorry, this isn’t really any more illustrative an example, but hopefully it’s a clearer explanation.

As others have commented, it’s difficult to understand what this math is supposed to say.

My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator’s distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.

But this may be inaccurate, or there may be other material ideas here that I’ve missed.

Yep.

Ok! That’s very useful to know.

It seems pretty related to the Inverse Reward Design paper. I guess it’s a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.

A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).

I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.

I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.

A key problem here is that if we use a human as the evaluator, the agent assigns 0 prior probability to the truth: the human won’t be able to update beliefs as a perfect Bayesian, sample a world-state history from his beliefs and assign a value to it according to a utility function. For a Bayesian reason that assigns 0 prior probability to the truth, God only knows how it will behave, even in the limit. (Unless there is some very odd utility function such that the human could be described in this way?)

But maybe this problem could be fixed if the agent takes some more liberties in modeling the evaluator. Maybe once we have a better understanding of bounded approximately-Bayesian reasoning, the agent can model the human as being a bounded reasoner, not a perfectly Bayesian reasoner, which might allow the agent to assign a strictly positive prior to the truth.

And all this said, I don’t think we’re totally clueless when it comes to guessing how this agent would behave, even though a human evaluator would not satisfy the assumptions that the agent makes about him.

(Edit note: Fixed spelling mistake in the title, let me know if it was intentional)

thanks :)