There are various parts of your explanation that I find vague and could use a clarification on:

“AUP is not about state”—what does it mean for a method to be “about state”? Same goes for “the direct focus should not be on the state”—what does “direct focus” mean here?

“Overfitting the environment”—I know what it means to overfit a training set, but I don’t know what it means to overfit an environment.

“The long arms of opportunity cost and instrumental convergence”—what do “long arms” mean?

“Wirehead a utility function”—is this the same as optimizing a utility function?

“Cut out the middleman”—what are you referring to here?

I think these intuitive phrases may be a useful shorthand for someone who already understands what you are talking about, but since I do not understand, I have not found them illuminating.

I sympathize with your frustration about the difficulty of communicating these complex ideas clearly. I think the difficulty is caused by the vague language rather than missing key ideas, and making the language more precise would go a long way.

Thanks Rohin! Your explanations (both in the comments and offline) were very helpful and clarified a lot of things for me. My current understanding as a result of our discussion is as follows.

AU is a function of the world state, but intends to capture some general measure of the agent’s influence over the environment that does not depend on the state representation.

Here is a hierarchy of objects, where each object is a function of the previous one: world states / microstates (e.g. quark configuration) → observations (e.g. pixels) → state representation / coarse-graining (which defines macrostates as equivalence classes over observations) → featurization (a coarse-graining that factorizes into features). The impact measure is defined over the macrostates.

Consider the set of all state representations that are consistent with the true reward function (i.e. if two microstates have different true rewards, then their state representation is different). The impact measure is representation-invariant if it has the same values for any state representation in this reward-compatible set. (Note that if representation invariance was defined over the set of all possible state representations, this set would include the most coarse-grained representation with all observations in one macrostate, which would imply that the impact measure is always 0.) Now consider the most coarse-grained representation R that is consistent with the true reward function.

An AU measure defined over R would remain the same for a finer-grained representation. For example, if the attainable set contains a reward function that rewards having a vase in the room, and the representation is refined to distinguish green and blue vases, then macrostates with different-colored vases would receive the same reward. Thus, this measure would be representation-invariant. However, for an AU measure defined over a finer-grained representation (e.g. distinguishing blue and green vases), a random reward function in the attainable set could assign a different reward to macrostates with blue and green vases, and the resulting measure would be different from the measure defined over R.

An RR measure that only uses reachability functions of single macrostates is not representation-invariant, because the observations included in each macrostate depend on the coarse-graining. However, if we allow the RR measure to use reachability functions of sets of macrostates, then it would be representation-invariant if it is defined over R. Then a function that rewards reaching a macrostate with a vase can be defined in a finer-grained representation by rewarding macrostates with green or blue vases. Thus, both AU and this version of RR are representation-invariant iff they are defined over the most coarse-grained representation consistent with the true reward.