I don’t understand this proposal so far. I’m particularly confused by the last paragraph in the “to get away” section:
What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following π0 forever, thus not following π∗k for any k at any point.
If A follows π0 for τ turns and then follows π∗k, how are τ and k chosen?
It’s not clear to me that SA can act to ensure the baseline value of V′k for all values of k and τ unless it does nothing.
I think it might help to illustrate this proposal in your original gridworld example to make it clearer what’s going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.
I don’t think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don’t need to tell apart other states that are not the starting state.
I agree that we need to compare to the penalty if the subagent is not created—I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).
As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the ‘winding road’ problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.
I think this problem is about capturing delayed effects of the agent’s actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don’t think it’s useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).
Here, the inaction rollouts don’t capture the effects of the subagent, and thus the agent doesn’t receive a penalty for creating the subagent. There are several factors that contribute to this:
1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it.
2. The subagent never goes to the red button—it only goes to the blue button.
The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.
In the general case, the agent can only get away with creating subagents that don’t achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.
Thanks Stuart for your thought-provoking post! I think your point about the effects of the baseline choice on the subagent problem is very interesting, and it would be helpful to separate it more clearly from the effects of the deviation measure (which are currently a bit conflated in the table). I expect that AU with the inaction baseline would also avoid this issue, similarly to RR with an inaction baseline. I suspect that the twenty billion questions measure with the stepwise baseline would have the subagent issue too.
I’m wondering whether this issue is entirely caused by the stepwise baseline (which is indexed on the agent, as you point out), or whether the optionality-based deviation measures (RR and AU) contribute to it as well. So far I’m adding this to my mental list of issues with the stepwise baseline (along with the “car on a winding road” scenario) that need to be fixed.
I’ve been pleasantly surprised by how much this resource has caught on in terms of people using it and referring to it (definitely more than I expected when I made it). There were 30 examples on the list when was posted in April 2018, and 20 new examples have been contributed through the form since then. I think the list has several properties that contributed to wide adoption: it’s fun, standardized, up-to-date, comprehensive, and collaborative.
Some of the appeal is that it’s fun to read about AI cheating at tasks in unexpected ways (I’ve seen a lot of people post on Twitter about their favorite examples from the list). The standardized spreadsheet format seems easier to refer to as well. I think the crowdsourcing aspect is also helpful—this helps keep it current and comprehensive, and people can feel some ownership of the list since can personally contribute to it. My overall takeaway from this is that safety outreach tools are more likely to be impactful if they are fun and easy for people to engage with.
This list had a surprising amount of impact relative to how little work it took me to put it together and maintain it. The hard work of finding and summarizing the examples was done by the people putting together the lists that the master list draws on (Gwern, Lehman, Olsson, Irpan, and others), as well as the people who submit examples through the form. What I do is put them together in a common format and clarify and/or shorten some of the summaries. I also curate the examples to determine whether they fit the definition of specification gaming (as opposed to simply a surprising behavior or solution). Overall, I’ve probably spent around 10 hours so far on creating and maintaining the list, which is not very much. This makes me wonder if there is other low hanging fruit in the safety resources space that we haven’t picked yet.
I have been using it both as an outreach and research tool. On the outreach side, the resource has been helpful for making the argument that safety problems are hard and need general solutions, by making it salient just in how many ways things could go wrong. When presented with an individual example of specification gaming, people often have a default reaction of “well, you can just close the loophole like this”. It’s easier to see that this approach does not scale when presented with 50 examples of gaming behaviors. Any given loophole can seem obvious in hindsight, but 50 loopholes are much less so. I’ve found this useful for communicating a sense of the difficulty and importance of Goodhart’s Law.
On the research side, the examples have been helpful for trying to clarify the distinction between reward gaming and tampering problems. Reward gaming happens when the reward function is designed incorrectly (so the agent is gaming the design specification), while reward tampering happens when the reward function is implemented incorrectly or embedded in the environment (and so can be thought of as gaming the implementation specification). The boat race example is reward gaming, since the score function was defined incorrectly, while the Qbert agent finding a bug that makes the platforms blink and gives the agent millions of points is reward tampering. We don’t currently have any real examples of the agent gaining control of the reward channel (probably because the action spaces of present-day agents are too limited), which seems qualitatively different from the numerous examples of agents exploiting implementation bugs.
I’m curious what people find the list useful for—as a safety outreach tool, a research tool or intuition pump, or something else? I’d also be interested in suggestions for improving the list (formatting, categorizing, etc). Thanks everyone who has contributed to the resource so far!
Thanks Ben! I’m happy that the list has been a useful resource. A lot of credit goes to Gwern, who collected many examples that went into the specification gaming list: https://www.gwern.net/Tanks#alternative-examples.
Yes, decoupling seems to address a broad class of incentive problems in safety, which includes the shutdown problem and various forms of tampering / wireheading. Other examples of decoupling include causal counterfactual agents and counterfactual reward modeling.
Thanks Evan, glad you found this useful! The connection with the inner/outer alignment distinction seems interesting. I agree that the inner alignment problem falls in the design-emergent gap. Not sure about the outer alignment problem matching the ideal-design gap though, since I would classify tampering problems as outer alignment problems, caused by flaws in the implementation of the base objective.
I think the discussion of reversibility and molecules is a distraction from the core of Stuart’s objection. I think he is saying that a value-agnostic impact measure cannot distinguish between the cases where the water in the bucket is or isn’t valuable (e.g. whether it has sentimental value to someone).
If AUP is not value-agnostic, it is using human preference information to fill in the “what we want” part of your definition of impact, i.e. define the auxiliary utility functions. In this case I would expect you and Stuart to be in agreement.
If AUP is value-agnostic, it is not using human preference information. Then I don’t see how the state representation/ontology invariance property helps to distinguish between the two cases. As discussed in this comment, state representation invariance holds over all representations that are consistent with the true human reward function. Thus, you can distinguish the two cases as long as you are using one of these reward-consistent representations. However, since a value-agnostic impact measure does not have access to the true reward function, you cannot guarantee that the state representation you are using to compute AUP is in the reward-consistent set. Then, you could fail to distinguish between the two cases, giving the same penalty for kicking a more or less valuable bucket.
Thanks Stuart for the example. There are two ways to distinguish the cases where the agent should and shouldn’t kick the bucket:
Relative value of the bucket contents compared to the goal is represented by the weight on the impact penalty relative to the reward. For example, if the agent’s goal is to put out a fire on the other end of the pool, you would set a low weight on the impact penalty, which enables the agent to take irreversible actions in order to achieve the goal. This is why impact measures use a reward-penalty tradeoff rather than a constraint on irreversible actions.
Absolute value of the bucket contents can be represented by adding weights on the reachable states or attainable utility functions. This doesn’t necessarily require defining human preferences or providing human input, since human preferences can be inferred from the starting state. I generally think that impact measures don’t have to be value-agnostic, as long as they require less input about human preferences than the general value learning problem.
Thanks Abram for this sequence—for some reason I wasn’t aware of it until someone linked to it recently.
Would you consider the observation tampering (delusion box) problem as part of the easy problem, the hard problem, or a different problem altogether? I think it must be a different problem, since it is not addressed by observation-utility or approval-direction.
Definitely agree that the AI community is not biased towards short timelines. Long timelines are the dominant view, while the short timelines view is associated with hype. Many researchers are concerned about the field losing credibility (and funding) if the hype bubble bursts, and this is especially true for those who experienced the AI winters. They see the long timelines view as appropriately skeptical and more scientifically respectable.
Some examples of statements that AGI is far away from high-profile AI researchers:
Geoffrey Hinton: https://venturebeat.com/2018/12/17/geoffrey-hinton-and-demis-hassabis-agi-is-nowhere-close-to-being-a-reality/
Yann LeCun: https://www.facebook.com/yann.lecun/posts/10153426023477143 https://futurism.com/conscious-ai-decades-away https://www.facebook.com/yann.lecun/posts/10153368458167143
Yoshua Bengio: https://www.lesswrong.com/posts/4qPy8jwRxLg9qWLiG/yoshua-bengio-on-ai-progress-hype-and-risks
Rodney Brooks: https://rodneybrooks.com/the-seven-deadly-sins-of-predicting-the-future-of-ai/ https://rodneybrooks.com/agi-has-been-delayed/
Janos and I are coming for the weekend part of the unconference
I’m confused about the difference between a mesa-optimizer and an emergent subagent. A “particular type of algorithm that the base optimizer might find to solve its task” or a “neural network that is implementing some optimization process” inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?
Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.
AU is a function of the world state, but intends to capture some general measure of the agent’s influence over the environment that does not depend on the state representation.
Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) → observations (e.g. pixels) → state representation / coarse-graining (which defines macrostates as equivalence classes over observations) → featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.
Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.
An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.
An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.
There are various parts of your explanation that I find vague and could use a clarification on:
“AUP is not about state”—what does it mean for a method to be “about state”? Same goes for “the direct focus should not be on the state”—what does “direct focus” mean here?
“Overfitting the environment”—I know what it means to overfit a training set, but I don’t know what it means to overfit an environment.
“The long arms of opportunity cost and instrumental convergence”—what do “long arms” mean?
“Wirehead a utility function”—is this the same as optimizing a utility function?
“Cut out the middleman”—what are you referring to here?
I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.
I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.
Thanks for the detailed explanation—I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn’t understand the state representation invariance claim in the AUP proposal, though I didn’t realize that it is as central to the proposal as you describe here.
I am still confused about what you means by penalizing ‘power’ and what exactly it is a function of. The way you describe it here sounds like it’s a measure of the agent’s optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I’m not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.
I agree with Daniel’s comment that if AUP is not penalizing effects on the world, then it is confusing to call it an ‘impact measure’, and something like ‘optimization regularization’ would be better.
Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.
Are you thinking of an action observation formalism, or some kind of reward function over inferred state?
I don’t quite understand what you’re asking here, could you clarify?
If you had to pose the problem of impact measurement as a question, what would it be?
Something along the lines of: “How can we measure to what extent the agent is changing the world in ways that we care about?”. Why?
What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?
I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.
What makes you think that?
This is an intuitive claim based on a general observation of how people attribute responsibility. For example, if I walk into a busy street and get hit by a car, I will be considered responsible for this because it’s easy to predict. On the other hand, if I am walking down the street and a brick falls on my head from the nearby building, then I will not be considered responsible, because this event would be hard to predict. There are probably other reasons that humans don’t consider themselves responsible for butterfly effects.