One way to think about this might be to cast it in the language of conditional probability. Perhaps we are modeling our agent as it makes choices between two world states, A and B, based on their predicted levels of X and Y. If P(A) is the probability that the agent chooses state A, and P(A|X) and P(A|Y) are the probabilities of choosing A given knowledge of predictions about the level of X and Y respectively in state A vs. state B, then it seems obvious to me that “cares about X only because it leads to Y” can be expressed as P(A|XY) = P(A|Y). Once we know its predictions about Y, X tells us nothing more about its likelihood of choosing state A. Likewise, “cares about Y only because it leads to X” could be expressed as P(A|XY) = P(A|X). In the statement “the agent cares about X only because it leads to Y, and it cares about Y only because it leads to X,” it seems like it’s saying that P(A|XY) = P(A|Y) ∧ P(A|XY) = P(A|X), which implies that P(A|Y) = P(A|X) -- there is perfect mutual information shared between X and Y about P(A).
However, I don’t think that this quite captures the spirit of the question, since the idea that the agent “cares about X and Y” isn’t the same thing as X and Y being predictive of which state the agent will choose. It seems like what’s wanted is a formal way to say “the only things that ‘matter’ in this world are X and Y,” which is not the same thing as saying “X and Y are the only dimensions on which world states are mapped.” We could imagine a function that takes the level of X and Y in two world states, A and B, and returns a preference order {A > B, B > A, A = B, incomparable}. But who’s to say this function isn’t just capturing an empirical regularity, rather than expressing some fundamental truth about why X and Y control the agent’s preference for A or B? However, I think that’s an issue even in the absence of any sort of circular reasoning.
A machine learning model’s training process is effectively just a way to generate a function that consistently maps an input vector to an output that’s close to a zero output from the loss function. The model doesn’t “really” value reward or avoidance of loss any more than our brains “really” value dopamine, and as far as I know, nobody has a mathematical definition of what it means to “really” value something, as opposed to behaving in a way that consistently tends to optimize for a target. From that point of view, maybe saying that P(A|Y) = P(A) really is the best we can do to mathematically express “he only cares about Y” and P(A|X) = P(A|Y) is the best way to express “he only cares about Y to get X and only cares about X to get Y.”
One way to think about this might be to cast it in the language of conditional probability. Perhaps we are modeling our agent as it makes choices between two world states, A and B, based on their predicted levels of X and Y. If P(A) is the probability that the agent chooses state A, and P(A|X) and P(A|Y) are the probabilities of choosing A given knowledge of predictions about the level of X and Y respectively in state A vs. state B, then it seems obvious to me that “cares about X only because it leads to Y” can be expressed as P(A|XY) = P(A|Y). Once we know its predictions about Y, X tells us nothing more about its likelihood of choosing state A. Likewise, “cares about Y only because it leads to X” could be expressed as P(A|XY) = P(A|X). In the statement “the agent cares about X only because it leads to Y, and it cares about Y only because it leads to X,” it seems like it’s saying that P(A|XY) = P(A|Y) ∧ P(A|XY) = P(A|X), which implies that P(A|Y) = P(A|X) -- there is perfect mutual information shared between X and Y about P(A).
However, I don’t think that this quite captures the spirit of the question, since the idea that the agent “cares about X and Y” isn’t the same thing as X and Y being predictive of which state the agent will choose. It seems like what’s wanted is a formal way to say “the only things that ‘matter’ in this world are X and Y,” which is not the same thing as saying “X and Y are the only dimensions on which world states are mapped.” We could imagine a function that takes the level of X and Y in two world states, A and B, and returns a preference order {A > B, B > A, A = B, incomparable}. But who’s to say this function isn’t just capturing an empirical regularity, rather than expressing some fundamental truth about why X and Y control the agent’s preference for A or B? However, I think that’s an issue even in the absence of any sort of circular reasoning.
A machine learning model’s training process is effectively just a way to generate a function that consistently maps an input vector to an output that’s close to a zero output from the loss function. The model doesn’t “really” value reward or avoidance of loss any more than our brains “really” value dopamine, and as far as I know, nobody has a mathematical definition of what it means to “really” value something, as opposed to behaving in a way that consistently tends to optimize for a target. From that point of view, maybe saying that P(A|Y) = P(A) really is the best we can do to mathematically express “he only cares about Y” and P(A|X) = P(A|Y) is the best way to express “he only cares about Y to get X and only cares about X to get Y.”