Research Scientist at DeepMind
tom4everitt(Tom Everitt)
What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?
Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.
For example, does it make sense to say that a tree is *trying to* soak up sun, even though it doesn’t have any mental representation itself? Many biologists would hesitate to use such language other than metaphorically.
In contrast, Dennett’s answer is yes: Basically, it doesn’t matter if the computation is done by the tree, or by the evolution that produced the tree. In either case, it is right to think of the tree as an agent. (Same goes for DQN, I’d say.)
There are other situations where the location of the computation matters, such as for consciousness, and for some “self-reflective” skills that may be hard to pre-compute.
Basically, I would recommend looking closer at Dennett to
avoid reinventing the wheel (more than necessary), and
connect to his terminology (since he’s so influential).
He’s a very lucid writer, so quite a joy to read him really. His most recent book Bacteria to Bach summarizes and references a lot of his earlier work.
I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.
Yes, starting with more assumptions is often a good strategy, because it makes the questions more concrete. As you say, the results may potentially generalize.
But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function.
I see, maybe PPO would have been a better example.
Hey Charlie,
Thanks for bringing up these points. The intended audience is researchers more familiar with RL than the safety literature. Rather than try to modify the paper to everyone’s liking, let me just give a little intro / context for it here.
The paper is the culmination of a few years of work (previously described in e.g. my thesis and alignment paper). One of the main goals has been to understand whether it is possible to redeem RL from a safety viewpoint, or whether some rather different framework would be necessary to build safe AGI.
As a first step along this path, I tried to categorize problems with RL, and see which solutions applied to which categories. For this purpose, I found causal graphs valuable (thesis), and I later realized that causal influence diagrams (CID) provided an even better foundation. Any problem corresponds to an ‘undesired path’ in a CID, and basically all the solutions corresponded to ways of getting rid of that path. As highlighted in the introduction of the paper, I now view this insight as one of the most useful ones.
Another important contribution of the paper is pinpointing which solution idea solves which type of reward tampering problem, and a discussion of how the solutions might fit together. I see this as a kind of stepping stone towards more empirical RL work in this area.
Third, the paper puts a fair bit of emphasis on giving brief but precise summaries of previous ideas in the safety literature, and may therefore serve as a kind of literature review. You are absolutely right that solutions to reward function tampering (often more loosely referred to as wireheading) have been around for quite some time. However, the explanations of these methods have been scattered across a number of papers, using a number of different frameworks and formalisms.
Tom
Thanks Marius and David, really interesting post, and super glad to see interest in causality picking up!
I very much share your “hunch that causality might play a role in transformative AI and feel like it is currently underrepresented in the AI safety landscape.”
Most relevant, I’ve been working with Mary Phuong on a project which seems quite related to what you are describing here. I don’t want to share too many details publicly without checking with Mary first, but if you’re interested perhaps we could set up a call sometime?I also think causality is relevant to AGI safety in several additional ways to those you mention here. In particular, we’ve been exploring how to use causality to describe agent incentives for things like corrigibility and tampering (summarized in this post), formalizing ethical concepts like intent, and understanding agency.
So really curious to see where your work is going and potentially interested in collaborating!
Hey Steve,
Thanks for linking to Abram’s excellent blog post.
We should have pointed this out in the paper, but there is a simple correspondence between Abram’s terminology and ours:
Easy wireheading problem = reward function tampering
Hard wireheading problem = feedback tampering.
Our current-RF optimization corresponds to Abram’s observation-utility agent.
We also discuss the RF-input tampering problem and solutions (sometimes called the delusion box problem), which I don’t fit into Abram’s distinction.
We didn’t expect this to be surprising to the LessWrong community. Many RL researchers tend to be surprised, however.
I really like this layout, this idea, and the diagrams. Great work.
Glad to hear it :)
I don’t agree that counterfactual oracles fix the incentive. There are black boxes in that proposal, like “how is the automated system not vulnerable to manipulation” and “why do we think the system correctly formally measures the quantity in question?” (see more potential problems). I think relying only on this kind of engineering cleverness is generally dangerous, because it produces safety measures we don’t see how to break (and probably not safety measures that don’t break).
Yes, the argument is only valid under the assumptions that you mention. Thanks for pointing to the discussion post about the assumptions.
Also, on page 10 you write that during deployment, agents appear as if they are optimizing the training reward function. As evhub et al point out, this isn’t usually true: the objective recoverable from perfect IRL on a trained RL agent is often different (behavioral objective != training objective).
Fair point, we should probably weaken this claim somewhat.
Thanks Ilya for those links, in particular the second one looks quite relevant to something we’ve been working on in a rather different context (that’s the benefit of speaking the same language!)
We would also be curious to see a draft of the MDP-generalization once you have something ready to share!
There is a paper which I believe is trying to do something similar to what you are attempting here:
Are you aware of it? How do you think their ideas relate to yours?
Thanks for the Dewey reference, we’ll add it.
Thanks for a nice post about causal diagrams!
Because our universe is causal, any computation performed in our universe must eventually bottom out in a causal DAG.
Totally agree. This is a big part of the reason why I’m excited about these kinds of diagrams.
This raises the issue of abstraction—the core problem of embedded agency. … how can one causal diagram (possibly with symmetry) represent another in a way which makes counterfactual queries on the map correspond to some kind of counterfactual on the territory?
Great question, I really think someone should look more carefully into this. A few potentially related papers:
https://arxiv.org/abs/1105.0158
https://arxiv.org/abs/1812.03789
In general, though, how to learn causal DAGs with symmetry is still an open question. We’d like something like Solomonoff Induction, but which can account for partial information about the internal structure of the causal DAG, rather than just overall input-output behavior.
Again, agreed. It would be great if we could find a way to make progress on this question.
Hey Charlie,
Thanks for your comment! Some replies:
sometimes one makes different choices in how to chop an AI’s operation up into causally linked boxes, which can lead to an apples-and-oranges problem when comparing diagrams (for example, the diagrams you use for CIRL and IDI are very different choppings-up of the algorithms)
There is definitely a modeling choice involved in choosing how much “to pack” in each node. Indeed, most of the diagrams have been through a few iterations of splitting and combining nodes. The aim has been to focus on the key dynamics of each framework.
As for the CIRL and IDA difference, this is a direct effect of the different levels the frameworks are specified at. CIRL is a high-level framework, roughly saying “somehow you infer the human preferences from their actions”. IDA, in contrast, provides a reasonably detailed supervised learning criteria. So I think the frameworks themselves are already like apples and oranges, it’s not just the diagrams. (And drawing the diagrams, this is something you notice.)
But I am skeptical that there’s a one-size-fits-all solution, and instead think that diagram usage should be tailored to the particular point it’s intended to make.
We don’t want to claim the CIDs are the one-and-only diagram to always use, but as you mentioned above, they do allow for quite some flexibility in what aspects to highlight.
I actually have a draft sitting around of how one might represent value learning schemes with a hierarchical diagram of information flow.
Interesting. A while back I was looking at information flow diagram myself, and was surprised to discover how hard it was to make them formally precise (there seems to be no formal semantics for them). In contrast, causal graphs and CIDs have formal semantics, which is quite useful.
For hierarchical representations, there are networks of influence diagrams https://arxiv.org/abs/1401.3426
Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).
Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there’s a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don’t have any concrete ideas at the moment—I can be in touch if I think of something suitable for collaboration!
I really like this articulation of the problem!
To me, a way to point to something similar is to say that preservation (and enhancement) of human agency is important (value change being one important way that human agency can be reduced). https://www.alignmentforum.org/s/pcdHisDEGLbxrbSHD/p/Qi77Tu3ehdacAbBBe
One thing I’ve been trying to argue for is that we might try to pivot agent foundations research to focus more on human agency instead of artificial agency. For example, I think value change is an example of self-modification, which has been studied a fair bit for artificial agents.
Thanks Stuart, nice post.
I’ve moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent’s observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we’re talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you’re post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like “wireheading = tampering with goal measurement”.
Yes, that is partly what we are trying to do here. By summarizing some of the “folklore” in the community, we’ll hopefully be able to get new members up to speed quicker.
The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:
Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)
If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.
From a quick read, your proposal seems closely related to Jessica Taylor’s causal-counterfactual utility indifference. Ryan Carey and I also recently had a paper formalising some similar ideas, with some further literature review https://arxiv.org/abs/2305.19861
Sure, I think we’re saying the same thing: causality is frame dependent, and the variables define the frame (in your example, you and the sensor have different measurement procedures for detecting the purple cube, so you don’t actually talk about the same random variable).
How big a problem is it? In practice it seems usually fine, if we’re careful to test our sensor / double check we’re using language in the same way. In theory, scaled up to super intelligence, it’s not impossible it would be a problem.
But I would also like to emphasize that the problem you’re pointing to isn’t restricted to causality, it goes for all kinds of linguistic reference. So to the extent we like to talk about AI systems doing things at all, causality is no worse than natural language, or other formal languages.
I think people sometimes hold it to a higher bar than natural language, because it feels like a formal language could somehow naturally intersect with a programmed AI. But of course causality doesn’t solve the reference problem in general. Partly for this reason, we’re mostly using causality as a descriptive language to talk clearly and precisely (relative to human terms) about AI systems and their properties.
I had intended to be using the program’s output as a time series of bits, where we are considering the bits to be “sampling” from A and B. Let’s say it’s a program that outputs the binary digits of pi. I have no idea what the bits are (after the first few) but there is a sense in which P(A) = 0.5 for either A = 0 or A = 1, and at any timestep. The same is true for P(B). So P(A)P(B) = 0.25. But clearly P(A = 0, B = 0) = 0.5, and P(A = 0, B = 1) = 0, et cetera. So in that case, they’re not probabilistically independent, and therefore there is a correlation not due to a causal influence.
Just to chip in on this: in the case you’re describing, the numbers are not statistically correlated, because they are not random in the statistics sense. They are only random given logical uncertainty.
When considering logical “random” variables, there might well be a common logical “cause” behind any correlation. But I don’t think we know how to properly formalise or talk about that yet. Perhaps one day we can articulate a logical version of Reichenbach’s principle :)
For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?
Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won’t be systematically deceiving humans to pursue some particular agenda of the agent.
As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.
Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.
However, I’m less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:
According to Dennett, many systems behave as if they are optimizing some objective. For example, a tree may behave as if optimizes the amount of sun that it can soak up with its leaves. This is a useful description of the tree, offering real predictive power. Whether there is some actual search process going on in the tree is not that important, the intentional stance is useful in either case.
Similarly, a fully trained DQN algorithm will behave as if it optimizes the score of the game, even though there is no active search process going on at a given time step (especially not if the network parameters are frozen). In neither of these example is it necessary to distinguish between mesa and behavior objectives.
At this point, you may object that the mesa objective will be more predictive “off training distribution”. Perhaps, but I’m not so sure.
First, the behavioral objective may be predictive “off training distribution”: For example, the DQN agent will strive to optimize reward as long as the Q-function generalizes.
Second, the mesa-objective may easily fail to be predictive off distribution. Consider a model-based RL agent with a learned model of the environment, that uses MCTS to predict the return of different policies. The mesa-objective is then the expected return. However, this objective may not be particularly predictive outside the training distribution, because the learned model may only make sense on the distribution.
So the behavioral objective may easily be predictive outside the training distribution, and the mesa-objective easily fail to be predictive.
While I haven’t read the follow-up posts yet, I would guess that most of your further analysis would go through without the distinction between mesa and behavior objective. One possible difference is that you may need to be even more paranoid about the emergence of behavior objectives, since they can emerge even in systems that are not mesa-optimizing.
I would also like to emphasize that I really welcome this type of analysis of the emergence of objectives, not the least because it nicely complements my own research on how incentives emerge from a given objective.