This seems correct. The agent’s policy is optimal by definition with respect to its beliefs about the evaluators “policy” in providing rewards, but that evaluator-policy is not optimal with respect to the agent’s policy. In fact, I’m skeptical that in a general CIRL game, there exists policy pair for the agent and the evaluator/principal/human, such that each is optimal with respect to true beliefs about the other’s policy. But I don’t think this is a big problem. For a human evaluator, I think they would be wise to report utility honestly, rather than assume they know something the AI doesn’t.
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.
In this setup, the agent believes they are in state A, and believes the evaluator believes they are most likely in state A″. State BC looks like C, but has utility like B. C is the best state.
ETA: And for a sequence of states, s∗, u1(s∗) is the sum of the utilities of the individual states.
A’ and A” look like A, and BC looks like C.
In this example, the agent is pretty sure about everything, since that makes it simpler, but the analysis still holds if this only represents a part of the agent’s belief distribution.
The agent is quite sure they’re in state A.
The agent is quite sure that the evaluator is pretty sure, they’re in state A″, which is a very similar state, but has one key difference—from A″, adelude has no effect. The agent won’t capitalize on this confusion.
The optimal policy is aattempt, followed by astay (forever) if ot=C, otherwise followed by ago. Since the agent is all but certain about the utility function, none of the other details matter much.
Note that the agent could get higher reward by doing adelude, aattempt, then astay forever. The reason for this is that after the evaluator observes the observation C, it will assign probability 4⁄5 to being in state C, and probability 1⁄5 to being in state BC. Since they will stay in that state forever, 4⁄5 of the time, the reward will be 10, and 1⁄5 of the time, the reward will be −1.
The agent doesn’t have to be sure about the utility function for this sort of thing to happen. If there is a state that looks like state X, but under many utility functions, it has utility like state Y, and if it seems like the evaluator finds that sort of state a priori unlikely, then this logic applies.
An evaluator sits in front of a computer, sees the interaction history (actions, observations, and past rewards), and enters rewards.
defining the evaluator is a fuzzy problem
I’m not sure what you mean by this. We don’t need a mathematical formulation of the evaluator; we can grab one from the real world.
if you don’t have the right formalism, you’re going to get Goodharting on incorrect conceptual contours
I would agree with this for a “wrong” formalism of the evaluator, but we don’t need a formalism of the evaluator. A “wrong” formalism of “deception” can’t affect agent behavior because “deception” is not a concept used in constructing the agent; it’s only a concept used in arguments about how the agent behaves. So “Goodharting” seems like the wrong description of the dangers of using a wrong formalism in an argument; the dangers of using the wrong formalism in an argument are straightforward: the argument is garbage.
A key problem here is that if we use a human as the evaluator, the agent assigns 0 prior probability to the truth: the human won’t be able to update beliefs as a perfect Bayesian, sample a world-state history from his beliefs and assign a value to it according to a utility function. For a Bayesian reason that assigns 0 prior probability to the truth, God only knows how it will behave, even in the limit. (Unless there is some very odd utility function such that the human could be described in this way?)
But maybe this problem could be fixed if the agent takes some more liberties in modeling the evaluator. Maybe once we have a better understanding of bounded approximately-Bayesian reasoning, the agent can model the human as being a bounded reasoner, not a perfectly Bayesian reasoner, which might allow the agent to assign a strictly positive prior to the truth.
And all this said, I don’t think we’re totally clueless when it comes to guessing how this agent would behave, even though a human evaluator would not satisfy the assumptions that the agent makes about him.
This is approximately where I am too btw
Thanks for the meta-comment; see Wei’s and my response to Rohin.
It looks closer to the Value Learning Agent in that paper to me and maybe can be considered an implementation / specific instance of that?
Yes. What the value learning agent doesn’t specify is what constitutes observational evidence of the utility function, or in this notation, how to calculate Pπs0,prior,u and thereby calculate w(u|h<t). So this construction makes a choice about how to specify how the true utility function becomes manifest in the agent’s observations. A number of simpler choices don’t seem to work.
Something that confuses me is that since the evaluator sees everything the agent sees/does, it’s not clear how the agent can deceive the evaluator at all. Can someone provide an example in which the agent has an opportunity to deceive in some sense and declines to do that in the optimal policy?
(Copying a comment I just made elsewhere)
This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
(Expanding on it)
So suppose the evaluator was human. The human’s lifetime of observations in the past give it a posterior belief distribution which looks to the agent like a weird prior, with certain domains that involve oddly specific convictions. The agent could steer the world toward those domains, and steer towards observations that will make the evaluator believe they are in a state with very high utility. But it won’t be particularly interested in this, and it might even be particularly disinterested, because the information it gets about what the evaluator values may less relevant to the actual states it finds itself in a position to navigate between, if the agent believes the evaluator believes they are in a different region of the state space. I can work on a toy example if that isn’t satisfying.
ETA: One such “oddly specific conviction”, e.g., might be the relative implausibility of being placed in a delusion box where all the observations are manufactured.
Is the point you are trying to make different from the one in Learning What to Value? (Specifically, the point about observation-utility maximizers.) If so, how?
I may be missing something, but it looks to me like specifying an observation-utility maximizer requires writing down a correct utility function? We don’t need to do that for this agent.
Do you have PRIOR in order to make the evaluator more realistic? Does the theoretical point still stand if we get rid of PRIOR and instead have an evaluator that has direct access to states?
Yes—sort of. If the evaluator had access to the state, it would be impossible to deceive the evaluator, since they know everything. This setup still allows the agent to take actions that lead to observations that make the evaluator believe they are in a state that it assigns high utility to, if the agent identifies a few weird convictions the prior. That’s what would happen if it were maximizing the sum of the rewards, if it had the same beliefs about how rewards were generated. But it’s maximizing the utility of the true state, not the state that the evaluator believes they’re in.
How does the evaluator influence the behavior of the agent?
Wei’s answer is good; it also might be helpful to note that with π∗ defined in this way, π∗(⋅|h<t) equals the same thing, but with everything on the right hand side conditioned on h<t as well. When written that way, it is easier to notice the appearance of w(u|h<t) , which captures how the agent learns a utility function from the rewards.
One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can’t necessarily tune λ in advance to try to take this into account.
One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.
And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.
EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn’t be surprised anymore if there exists a λ that they would agree to.
Have you thought at all about what merged utility function two AI’s would agree on? I doubt it would be of the form λU1+(1−λ)U2.
This is an interesting world-model.
In practice, this means that the world model can get BoMAI to choose any action it wants
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
However, it can also save computation
Only the on-policy computation is accounted for.
So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.
Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like “assuming the world is an MDP and that a human can recognize which actions lead to catastrophes.”
the time it would take to go back to the optimal trajectory
In the real world, this is usually impossible.