Ok I finally identified an incentive for deception. I think it was difficult for me to find because it’s not really about deceiving the evaluator.
Here’s a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task, which entails deception when done optimally). In a sense, the AI doesn’t think it’s deceiving the evaluator; it thinks the evaluator fully understands what’s going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is ``in on the joke”. One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.
Okay maybe we don’t disagree on anything. I was trying to make different point with the unidentifiability problem, but it was tangential to begin with, so never mind.
No, that’s helpful. If it were the right way, do you think this reasoning would apply?
Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?
Also, I don’t agree that “see if an AIXI-like agent would be aligned” is the correct “gauntlet” to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.
I’m going to do my best to describe my intuitions around this.
Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn’t have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.
Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent’s world-model will converge as close to the Bayesian world-model as it does to the truth.
Proposition 3: If the version of an “idea” that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is “based on that idea” will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).
Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it’s approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.
Maybe you can point me toward the steps that seem the most opaque/fishy.
IRL to get the one true utility function
I think I’m understanding you to be conceptualizing a dichotomy between “uncertainty over a utility function” vs. “looking for the one true utility function”. (I’m also getting this from your comment below:
One caveat is that I think the uncertainty over preferences/rewards is key to this story, which is a bit different from getting a single true utility function.
I can’t figure out on my own a sense in which this dichotomy exists. To be uncertain about a utility function is to believe there is one correct one, while engaging in the process of updating probabilities about its identity.
Also, for what it’s worth, in the case where there is an unidentifiability problem, as there is here, even in the limit, a Bayesian agent won’t converge to certainty about a utility function.
I’m sorry it sounded like a dig at CHAI’s work, and you’re right that “typically described” is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it’s nearly complete—I don’t think I’ve seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator’s action might jeopardizing the whole thing.
I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human’s actions instead of from a reward signal gets us any closer to safety, but I’m sorry to have misrepresented views. (And maybe it’s worth mentioning that I’m fiddling with something that bears strong resemblance to Inverse Reward Design, so I’m definitely not that bearish on the whole idea).
This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
The agent is quite sure they’re in state A.
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.
defining the evaluator is a fuzzy problem
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
A key problem here is that if we use a human as the evaluator, the agent assigns 0 prior probability to the truth: the human won’t be able to update beliefs as a perfect Bayesian, sample a world-state history from his beliefs and assign a value to it according to a utility function. For a Bayesian reason that assigns 0 prior probability to the truth, God only knows how it will behave, even in the limit. (Unless there is some very odd utility function such that the human could be described in this way?)
But maybe this problem could be fixed if the agent takes some more liberties in modeling the evaluator. Maybe once we have a better understanding of bounded approximately-Bayesian reasoning, the agent can model the human as being a bounded reasoner, not a perfectly Bayesian reasoner, which might allow the agent to assign a strictly positive prior to the truth.
And all this said, I don’t think we’re totally clueless when it comes to guessing how this agent would behave, even though a human evaluator would not satisfy the assumptions that the agent makes about him.
This is approximately where I am too btw
Thanks for the meta-comment; see Wei’s and my response to Rohin.
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?
Yes. What the value learning agent doesn’t specify is what constitutes observational evidence of the utility function, or in this notation, how to calculate Pπs0,prior,u and thereby calculate w(u|h<t). So this construction makes a choice about how to specify how the true utility function becomes manifest in the agent’s observations. A number of simpler choices don’t seem to work.
Something that confuses me is that since the evaluator sees everything the agent sees/does, it’s not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?
(Copying a comment I just made elsewhere)
This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
(Expanding on it)
So suppose the evaluator was human. The human’s lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won’t be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn’t satisfying.
ETA: One such “oddly specific conviction”, e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.
Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?
I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don’t need to do that for this agent.
Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?
Yes—sort of. If the evaluator had access to the state, it would be impossible to deceive the evaluator, since they know everything. This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
How does the evaluator influence the behavior of the agent?
Wei’s answer is good; it also might be helpful to note that with π∗ defined in this way, π∗(⋅|h<t) equals the same thing, but with everything on the right hand side conditioned on h<t as well. When written that way, it is easier to notice the appearance of w(u|h<t) , which captures how the agent learns a utility function from the rewards.