Research Scientist at DeepMind
tom4everitt(Tom Everitt)
Sequential Extensions of Causal and Evidential Decision Theory
“The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps).”
This sounds quite nice. But how is it possible to achieve this if the advisor is a soft-maximiser? Doesn’t that mean that there is a positive probability that the advisor falls into the trap?
That is a good question. I don’t think it is essential that the agent can move from to , only that the agent is able to force a stay in if it wants to.
The transition from to could instead happen randomly with some probability.
The important thing is that the human’s action in does not reveal any information about .
CIRL Wireheading
Adversarial examples for neural networks make situations where the agent misinterprets the human action seem plausible.
But it is true that the situation where the human acts irrationally in some state (e.g. because of drugs, propaganda) could be modeled in much the same way.
I preferred the sensory error since it doesn’t require a irrational human. Perhaps I should have been clearer that I’m interested in the agent wireheading itself (in some sense) rather than wireheading of the human.
(Sorry for being slow to reply—I didn’t get notified about the comments.)
Hi Vanessa!
So basically the advisor will be increasingly careful as the cost of falling into the trap goes to infinity? Makes sense I guess.
What is the incentive for the agent not to always let the advisor choose? Is there always some probability that the advisor saves them from infinite loss, or only in certain situations that can be detected by the agent?
So this requires the agent’s prior to incorporate information about which states are potentially risky?
Because if there is always some probability of there being a risky action (with infinitely negative value), then regardless how small the probability is and how large the penalty is for asking, the agent will always be better off asking.
(Did you see Owain Evans recent paper about trying to teach the agent to detect risky states.)
My confusion is the following:
Premises (*) and inferences (=>):
-
The primary way for the agent to avoid traps is to delegate to a soft-maximiser.
-
Any action with boundedly negative utility, a soft-maximiser will take with positive probability.
-
Actions leading to traps do not have infinitely negative utility.
=> The agent will fall into traps with positive probability.
If the agent falls into a trap with positive probability, then it will have linear regret.
=> The agent will have linear regret.
So when you say in the beginning of the post “a Bayesian DIRL agent is guaranteed to attain most of the value”, you must mean that in a different sense than a regret sense?
-
Nice writeup. Is one-boxing in Newcomb an equilibria?
Thanks for your comments, much appreciated! I’m currently in the middle of moving continents, will update the draft within a few weeks.
Thank you! Really inspiring to win this prize. As John Maxwell stated in the previous round, the recognition is more important than the money. Very happy to receive further comments and criticism by email tom4everitt@gmail.com. Through debate we grow :)
Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.
However, I’m less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:
Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[4] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.
According to Dennett, many systems behave as if they are optimizing some objective. For example, a tree may behave as if optimizes the amount of sun that it can soak up with its leaves. This is a useful description of the tree, offering real predictive power. Whether there is some actual search process going on in the tree is not that important, the intentional stance is useful in either case.
Similarly, a fully trained DQN algorithm will behave as if it optimizes the score of the game, even though there is no active search process going on at a given time step (especially not if the network parameters are frozen). In neither of these example is it necessary to distinguish between mesa and behavior objectives.
At this point, you may object that the mesa objective will be more predictive “off training distribution”. Perhaps, but I’m not so sure.
First, the behavioral objective may be predictive “off training distribution”: For example, the DQN agent will strive to optimize reward as long as the Q-function generalizes.
Second, the mesa-objective may easily fail to be predictive off distribution. Consider a model-based RL agent with a learned model of the environment, that uses MCTS to predict the return of different policies. The mesa-objective is then the expected return. However, this objective may not be particularly predictive outside the training distribution, because the learned model may only make sense on the distribution.
So the behavioral objective may easily be predictive outside the training distribution, and the mesa-objective easily fail to be predictive.
While I haven’t read the follow-up posts yet, I would guess that most of your further analysis would go through without the distinction between mesa and behavior objective. One possible difference is that you may need to be even more paranoid about the emergence of behavior objectives, since they can emerge even in systems that are not mesa-optimizing.
I would also like to emphasize that I really welcome this type of analysis of the emergence of objectives, not the least because it nicely complements my own research on how incentives emerge from a given objective.
- Is the term mesa optimizer too narrow? by 14 Dec 2019 23:20 UTC; 38 points) (
- Why You Should Care About Goal-Directedness by 9 Nov 2020 12:48 UTC; 38 points) (
- 28 Dec 2020 16:37 UTC; 31 points) 's comment on Risks from Learned Optimization: Introduction by (
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.
For example, does it make sense to say that a tree is *trying to* soak up sun, even though it doesn’t have any mental representation itself? Many biologists would hesitate to use such language other than metaphorically.
In contrast, Dennett’s answer is yes: Basically, it doesn’t matter if the computation is done by the tree, or by the evolution that produced the tree. In either case, it is right to think of the tree as an agent. (Same goes for DQN, I’d say.)
There are other situations where the location of the computation matters, such as for consciousness, and for some “self-reflective” skills that may be hard to pre-compute.
Basically, I would recommend looking closer at Dennett to
avoid reinventing the wheel (more than necessary), and
connect to his terminology (since he’s so influential).
He’s a very lucid writer, so quite a joy to read him really. His most recent book Bacteria to Bach summarizes and references a lot of his earlier work.
I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.
Yes, starting with more assumptions is often a good strategy, because it makes the questions more concrete. As you say, the results may potentially generalize.
But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function.
I see, maybe PPO would have been a better example.
Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).
Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there’s a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don’t have any concrete ideas at the moment—I can be in touch if I think of something suitable for collaboration!
Hey Charlie,
Thanks for your comment! Some replies:
sometimes one makes different choices in how to chop an AI’s operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)
There is definitely a modeling choice involved in choosing how much “to pack” in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on the key dynamics of each framework.
As for the CIRL and IDA difference, this is a direct effect of the different levels the frameworks are specified at. CIRL is a high-level framework, roughly saying “somehow you infer the human preferences from their actions”. IDA, in contrast, provides a reasonably detailed supervised learning criteria. So I think the frameworks themselves are already like apples and oranges, it’s not just the diagrams. (And drawing the diagrams, this is something you notice.)
But I am skeptical that there’s a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it’s intended to make.
We don’t want to claim the CIDs are the one-and-only diagram to always use, but as you mentioned above, they do allow for quite some flexibility in what aspects to highlight.
I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow.
Interesting. A while back I was looking at information flow diagram myself, and was surprised to discover how hard it was to make them formally precise (there seems to be no formal semantics for them). In contrast, causal graphs and CIDs have formal semantics, which is quite useful.
For hierarchical representations, there are networks of influence diagrams https://arxiv.org/abs/1401.3426
I really like this layout, this idea, and the diagrams. Great work.
Glad to hear it :)
I don’t agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like “how is the automated system not vulnerable to manipulation” and “why do we think the system correctly formally measures the quantity in question?” (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don’t see how to break (and probably not safety measures that don’t break).
Yes, the argument is only valid under the assumptions that you mention. Thanks for pointing to the discussion post about the assumptions.
Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, this isn’t usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).
Fair point, we should probably weaken this claim somewhat.
Hey Charlie,
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Tom
Hey Steve,
Thanks for linking to Abram’s excellent blog post.
We should have pointed this out in the paper, but there is a simple correspondence between Abram’s terminology and ours:
Easy wireheading problem = reward function tampering
Hard wireheading problem = feedback tampering.
Our current-RF optimization corresponds to Abram’s observation-utility agent.
We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.
Yes, that is partly what we are trying to do here. By summarizing some of the “folklore” in the community, we’ll hopefully be able to get new members up to speed quicker.
Two questions about the paper:
Would it be possible to specify FairBot directly as a formula (rather than a program, which is now the case)?
Also, I’m somewhat confused about the usage of FB(FB)=C, as FB(FB) should be a formula/sentence and not a term. Is it correct to read “FB(FB)=C” as “\box FB(FB)” and “FB(FB)=D” as “\box ~FB(FB)”? Or should it be without boxes?