Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.
AU is a function of the world state, but intends to capture some general measure of the agent’s influence over the environment that does not depend on the state representation.
Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) → observations (e.g. pixels) → state representation / coarse-graining (which defines macrostates as equivalence classes over observations) → featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.
Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.
An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.
An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.
There are various parts of your explanation that I find vague and could use a clarification on:
“AUP is not about state”—what does it mean for a method to be “about state”? Same goes for “the direct focus should not be on the state”—what does “direct focus” mean here?
“Overfitting the environment”—I know what it means to overfit a training set, but I don’t know what it means to overfit an environment.
“The long arms of opportunity cost and instrumental convergence”—what do “long arms” mean?
“Wirehead a utility function”—is this the same as optimizing a utility function?
“Cut out the middleman”—what are you referring to here?
I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.
I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.
Thanks for the detailed explanation—I feel a bit less confused now. I was not intending to express confidence about my prediction of what AU does. I was aware that I didn’t understand the state representation invariance claim in the AUP proposal, though I didn’t realize that it is as central to the proposal as you describe here.
I am still confused about what you means by penalizing ‘power’ and what exactly it is a function of. The way you describe it here sounds like it’s a measure of the agent’s optimization ability that does not depend on the state at all. Did you mean that in the real world the agent always receives the same AUP penalty no matter which state it is in? If that is what you meant, then I’m not sure how to reconcile your description of AUP in the real world (where the penalty is not a function of the state) and AUP in an MDP (where it is a function of the state). I would find it helpful to see a definition of AUP in a POMDP as an intermediate case.
I agree with Daniel’s comment that if AUP is not penalizing effects on the world, then it is confusing to call it an ‘impact measure’, and something like ‘optimization regularization’ would be better.
Since I still have lingering confusions after your latest explanation, I would really appreciate if someone else who understands this could explain it to me.
Are you thinking of an action observation formalism, or some kind of reward function over inferred state?
I don’t quite understand what you’re asking here, could you clarify?
If you had to pose the problem of impact measurement as a question, what would it be?
Something along the lines of: “How can we measure to what extent the agent is changing the world in ways that we care about?”. Why?
What does this mean, concretely? And what happens with the survival utility function being the sole member of the attainable set? Does this run into that problem, in your model?
I meant that for attainable set consisting of random utility functions, I would expect most of the variation in utility to be based on irrelevant factors like the positions of air molecules. This does not apply to the attainable set consisting of the survival utility function, since that is not a random utility function.
What makes you think that?
This is an intuitive claim based on a general observation of how people attribute responsibility. For example, if I walk into a busy street and get hit by a car, I will be considered responsible for this because it’s easy to predict. On the other hand, if I am walking down the street and a brick falls on my head from the nearby building, then I will not be considered responsible, because this event would be hard to predict. There are probably other reasons that humans don’t consider themselves responsible for butterfly effects.
Thanks Alex for starting this discussion and thanks everyone for the thought-provoking answers. Here is my current set of concerns about the usefulness of impact measures, sorted in decreasing order of concern:
Irrelevant factors. When applied to the real world, impact measures are likely to be dominated by things humans don’t care about (heat dissipation, convection currents, positions of air molecules, etc). This seems likely to happen to value-agnostic impact measures, e.g. AU with random utility functions, which would mostly end up rewarding specific configurations of air molecules.
This may be mitigated by inability to perceive the irrelevant factors, which results in a more coarse-grained state representation: if the agent can’t see air molecules, all the states with different air molecule positions will look the same, as they do to humans. Some human-relevant factors can also be difficult to perceive, e.g. the presence of poisonous gas in the room, so we may not want to limit the agent’s perception ability to human level. Automatically filtering out irrelevant factors does seem difficult, and I think this might imply that it is impossible to design an impact measure that is both useful and truly value-agnostic.
However, the value-agnostic criterion does not seem very important in itself. I think the relevant criterion is that designing impact measures should be easier than the general value learning problem. We already have a non-value-agnostic impact measure that plausibly satisfies this criterion: RLSP learns what is effectively an impact measure (the human theta parameter) using zero human input just by examining the starting state. This could also potentially be achieved by choosing an attainable utility set that rewards a broad enough sample of things humans care about, and leaves the rest to generalization. Choosing a good attainable utility set may not be easy but it seems unlikely to be as hard as the general value learning problem.
Butterfly effects. Every action is likely to have large effects that are difficult to predict, e.g. taking a different route to work may result in different people being born. Taken literally, this means that there is no such thing as a low-impact action. Humans get around this by only counting easily predictable effects as impact that they are considered responsible for. If we follow a similar strategy of not penalizing butterfly effects, we might incentivize the agent to deliberately cause butterfly effects. The easiest way around this that I can currently see is restricting the agent’s capability to model the effects of its actions, though this has obvious usefulness costs as well.
Chaotic world. Every action, including inaction, is irreversible, and each branch contains different states. While preserving reversibility is impossible in this world, preserving optionality (attainable utility, reachability, etc) seems possible. For example, if the attainable set contains a function that rewards the presence of vases, the action of breaking a vase will make this reward function more difficult to satisfy (even if the states with/without vases are different in every branch). If we solve the problem of designing/learning a good utility set that is not dominated by irrelevant factors, I expect chaotic effects will not be an issue.
If any of the above-mentioned concerns are not overcome, impact measures will fail to distinguish between what humans would consider low-impact and high-impact. Thus, penalizing high-impact actions would come with penalizing low-impact actions as well, which would result in a strong safety-capability tradeoff. I think the most informative direction of research to figure out whether these concerns are a deal-breaker is to scale up impact measures to apply beyond gridworlds, e.g. to Atari games.
I don’t see how representation invariance addresses this concern. As far as I understand, the concern is about any actions in the real world causing large butterfly effects. This includes effects that would be captured by any reasonable representation, e.g. different people existing in the action and inaction branches of the world. The state representations used by humans also distinguish between these world branches, but humans have limited models of the future that don’t capture butterfly effects (e.g. person X can distinguish between the world state where person Y exists and the world state where person Z exists, but can’t predict that choosing a different route to work will cause person Z to exist instead of person Y).
I agree with Daniel that this is a major problem with impact measures. I think that to get around this problem we would either need to figure out how to distinguish butterfly effects from other effects (and then include all the butterfly effects in the inaction branch) or use a weak world model that does not capture butterfly effects (similarly to humans) for measuring impact. Even if we know how to do this, it’s not entirely clear whether we should avoid penalizing butterfly effects. Unlike humans, AI systems would be able to cause butterfly effects on purpose, and could channel their impact through butterfly effects if they are not penalized.
As a result of the recent attention, the specification gaming list has received a number of new submissions, so this is a good time to check out the latest version :).
Awesome, thanks Oliver!
Thanks, glad you liked the breakdown!
The agent would have an incentive to stop anyone from doing anything new in response to what the agent did
I think that the stepwise counterfactual is sufficient to address this kind of clinginess: the agent will not have an incentive to take further actions to stop humans from doing anything new in response to its original action, since after the original action happens, the human reactions are part of the stepwise inaction baseline.
The penalty for the original action will take into account human reactions in the inaction rollout after this action, so the agent will prefer actions that result in humans changing fewer things in response. I’m not sure whether to consider this clinginess—if so, it might be useful to call it “ex ante clinginess” to distinguish from “ex post clinginess” (similar to your corresponding distinction for offsetting). The “ex ante” kind of clinginess is the same property that causes the agent to avoid scapegoating butterfly effects, so I think it’s a desirable property overall. Do you disagree?
Thanks Rohin for a great summary as always!
I think the property of handling shutdown depends on the choice of absolute value or truncation at 0 in the deviation measure, not the choice of the core part of the deviation measure. RR doesn’t handle shutdown because by default it is set to only penalize reductions in reachability (using truncation at 0). I would expect that replacing the truncation with absolute value (thus penalizing increases in reachability as well) would result in handling shutdown (but break the asymmetry property from the RR paper). Similarly, AUP could be modified to only penalize reductions in goal-achieving ability by replacing the absolute value with truncation, which I think would make it satisfy the asymmetry property but not handle shutdown.
More thoughts on independent design choices here.
There are several independent design choices made by AUP, RR, and other impact measures, which could potentially be used in any combination. Here is a breakdown of design choices and what I think they achieve:
Starting state: used by reversibility methods. Results in interference with other agents. Avoids ex post offsetting.
Inaction (initial branch): default setting in Low Impact AI and RR. Avoids interfering with other agent’s actions, but interferes with their reactions. Does not avoid ex post offsetting if the penalty for preventing events is nonzero.
Inaction (stepwise branch) with environment model rollouts: default setting in AUP, model rollouts are necessary for penalizing delayed effects. Avoids interference with other agents and ex post offsetting.
Core part of deviation measure
AUP: difference in attainable utilities between baseline and current state
RR: difference in state reachability between baseline and current state
Low impact AI: distance between baseline and current state
Function applied to core part of deviation measure
Absolute value: default setting in AUP and Low Impact AI. Results in penalizing both increase and reduction relative to baseline. This results in avoiding the survival incentive (satisfying the Corrigibility property given in AUP post) and in equal penalties for preventing and causing the same event (violating the Asymmetry property given in RR paper).
Truncation at 0: default setting in RR, results in penalizing only reduction relative to baseline. This results in unequal penalties for preventing and causing the same event (satisfying the Asymmetry property) and in not avoiding the survival incentive (violating the Corrigibility property).
Hand-tuned: default setting in RR (sort of provisionally)
ImpactUnit: used by AUP
I think an ablation study is needed to try out different combinations of these design choices and investigate which of them contribute to which desiderata / experimental test cases. I intend to do this at some point (hopefully soon).
Another issue with equally penalizing decreases and increases in power (as AUP does) is that for any event A, it equally penalizes the agent for causing event A and for preventing event A (violating property 3 in the RR paper). I originally thought that satisfying Property 3 is necessary for avoiding ex post offsetting, which is actually not the case (ex post offsetting is caused by penalizing the given action on future time steps, which the stepwise inaction baseline avoids). However, I still think it’s bad for an impact measure to not distinguish between causation and prevention, especially for irreversible events.
This comes up in the car driving example already mentioned in other comments on this post. The reason the action of keeping the car on the highway is considered “high-impact” is because you are penalizing prevention as much as causation. Your suggested solution of using a single action to activate a self-driving car for the whole highway ride is clever, but has some problems:
This greatly reduces the granularity of the penalty, making credit assignment more difficult.
This effectively uses the initial-branch inaction baseline (branching off when the self-driving car is launched) instead of the stepwise inaction baseline, which means getting clinginess issues back, in the sense of the agent being penalized for human reactions to the self-driving car.
You may not be able to predict in advance when the agent will encounter situations where the default action is irreversible or otherwise undesirable.
In such situations, the penalty will produce bad incentives. Namely, the penalty for staying on the road is proportionate to how bad a crash would be, so the tradeoff with goal achievement resolves in an undesirable way. If we keep the reward for the car arriving to its destination constant, then as we increase the badness of a crash (e.g. the number of people on the side of the road who would be run over if the agent took a noop action), eventually the penalty wins in the tradeoff with the reward, and the agent chooses the noop. I think it’s very important to avoid this failure mode.
Actually, I think it was incorrect of me to frame this issue as a tradeoff between avoiding the survival incentive and not crippling the agent’s capability. What I was trying to point at is that the way you are counteracting the survival incentive is by penalizing the agent for increasing its power, and that interferes with the agent’s capability. I think there may be other ways to counteract the survival incentive without crippling the agent, and we should look for those first before agreeing to pay such a high price for interruptibility. I generally believe that ‘low impact’ is not the right thing to aim for, because ultimately the goal of building AGI is to have high impact—high beneficial impact. This is why I focus on the opportunity-cost-incurring aspect of the problem, i.e. avoiding side effects.
Note that AUP could easily be converted to a side-effects-only measure by replacing the |difference| with a max(0, difference). Similarly, RR could be converted to a measure that penalizes increases in power by doing the opposite (replacing max(0, difference) with |difference|). (I would expect that variant of RR to counteract the survival incentive, though I haven’t tested it yet.) Thus, it may not be necessary to resolve the disagreement about whether it’s good to penalize increases in power, since the same methods can be adapted to both cases.
If the agent isn’t overcoming obstacles, we can just increase N.
Wouldn’t increasing N potentially increase the shutdown incentive, given the tradeoff between shutdown incentive and overcoming obstacles?
I think eliminating this survival incentive is extremely important for this kind of agent, and arguably leads to behaviors that are drastically easier to handle.
I think we have a disagreement here about which desiderata are more important. Currently I think it’s more important for the impact measure not to cripple the agent’s capability, and the shutdown incentive might be easier to counteract using some more specialized interruptibility technique rather than an impact measure. Not certain about this though—I think we might need more experiments on more complex environments to get some idea of how bad this tradeoff is in practice.
And why is this, given that the inputs are histories? Why can’t we simply measure power?
Your measurement of “power” (I assume you mean Q_u?) needs to be grounded in the real world in some way. The observations will be raw pixels or something similar, while the utilities and the environment model will be computed in terms of some sort of higher-level features or representations. I would expect the way these higher-level features are chosen or learned to affect the outcome of that computation.
I discussed in “Utility Selection” and “AUP Unbound” why I think this actually isn’t the case, surprisingly. What are your disagreements with my arguments there?
I found those sections vague and unclear (after rereading a few times), and didn’t understand why you claim that a random set of utility functions would work. E.g. what do you mean by “long arms of opportunity cost and instrumental convergence”? What does the last paragraph of “AUP Unbound” mean and how does it imply the claim?
Oops, noted. I had a distinct feeling of “if I’m going to make claims this strong in a venue this critical about a topic this important, I better provide strong support”.
Providing strong support is certainly important, but I think it’s more about clarity and precision than quantity. Better to give one clear supporting statement than many unclear ones :).
Great work! I like the extensive set of desiderata and test cases addressed by this method.
The biggest difference from relative reachability, as I see it, is that you penalize increasing the ability to achieve goals, as well as decreasing it. I’m not currently sure whether this is a good idea: while it indeed counteracts instrumental incentives, it could also “cripple” the agent by incentivizing it to settle for more suboptimal solutions than necessary for safety.
For example, the shutdown button in the “survival incentive” gridworld could be interpreted as a supervisor signal (in which case the agent should not disable it) or as an obstacle in the environment (in which case the agent should disable it). Simply penalizing the agent for increasing its ability to achieve goals leads to incorrect behavior in the second case. To behave correctly in both cases, the agent needs more information about the source of the obstacle, which is not provided in this gridworld (the Safe Interruptibility gridworld has the same problem).
Another important difference is that you are using a stepwise inaction baseline (branching off at each time step rather than the initial time step) and predicting future effects using an environment model. I think this is an improvement on the initial-branch inaction baseline, which avoids clinginess towards independent human actions, but not towards human reactions to the agent’s actions. The environment model helps to avoid the issue with the stepwise inaction baseline failing to penalize delayed effects, though this will only penalize delayed effects if they are accurately predicted by the environment model (e.g. a delayed effect that takes place beyond the model’s planning horizon will not be penalized). I think the stepwise baseline + environment model could similarly be used in conjunction with relative reachability.
I agree with Charlie that you are giving out checkmarks for the desiderata a bit too easily :). For example, I’m not convinced that your approach is representation-agnostic. It strongly depends on your choice of the set of utility functions and environment model, and those have to be expressed in terms of the state of the world. (Note that the utility functions in your examples, such as u_closet and u_left, are defined in terms of reaching a specific state.) I don’t think your method can really get away from making a choice of state representation.
Your approach might have the same problem as other value-agnostic approaches (including relative reachability) with mostly penalizing irrelevant impacts. The AUP measure seems likely to give most of its weight to utility functions that are irrelevant to humans, while the RR measure could give most of its weight to preserving reachability of irrelevant states. I don’t currently know a way around this that’s not value-laden.
Meta point: I think it would be valuable to have a more concise version of this post that introduces the key insight earlier on, since I found it a bit verbose and difficult to follow. The current writeup seems to be structured according to the order in which you generated the ideas, rather than an order that would be more intuitive to readers. FWIW, I had the same difficulty when writing up the relative reachability paper, so I think it’s generally challenging to clearly present ideas about this problem.