Relative Abstracted Agency
Note: This post was pasted without much editing or work put into formatting. I may come back and make it more presentable at a later date, but the concepts should still hold.
Relative abstracted agency is a framework for considering the extent to which a modeler models a target as an agent, what factors lead a modeler to model a target as an agent, and what sort of models have the nature of being agent-models. The relative abstracted agency of a target relative to a reasonably efficient modeler is based on the most effective strategies that the modeler uses to model the target, which exist on a spectrum from terminalizing strategies to simulating strategies.
Relies most heavily on models of the target’s goals or utility function, and weighs future outcomes heavily based on how they might score in the target’s utility function. The modeler might ask, “what outcomes rank highly in the target’s utility function?” and use its approximations of the answer to predict future outcomes without predicting much in the way of particular actions, strategies, or paths that the target might take, except possibly to come up with lower bounds on rankings of outcomes in the target’s preferences.
Examples: the efficient market hypothesis, a savvy amateur predicting the outcome of a chess game against Stockfish or Magnus Carlsen, humans predicting what the world might look like 1 year after a superintelligent paperclip maximizer is created
Combines models of the target’s goals or utility function with the modeler’s ability to find actions or strategies to achieve goals. The modeler might ask, “what would be the best actions or strategies to take if I had the target’s goals and resources?” and uses its approximations of the answer to predict and model the target.
Examples: a competent chess player playing chess against a level of stockfish that they have a small but non-negligible chance of defeating, large portions of game theory and decision theory, AlphaChess training itself using self-play
Combines models of more specific aspects of the target’s processes, tendencies, flaws, weaknesses, or strengths with the modeler’s ability to find actions or strategies to achieve goals. The modeler might ask, “what would be the best actions or strategies to take if I had the target’s goals or motivations?”, and also ask “in what ways are the target’s strategies and tendencies likely to differ from mine and those of an ideal agent, and how can these help me predict the target?” It is at this level that concepts like yomi are most in play.
Examples: rhetorical persuasion, Gary Kasparov offering material to steer Deep Blue into a positional game based on a model of Deep Blue as good at calculation and tactics but weak at strategy and positioning, humans modeling LLMs.
Relies most heavily on models or approximations of processes specific to the target rather than using much of the modeler’s own ability to find actions or strategies to achieve goals. Does not particularly model the target as an agent, but as a process similar to an unagentic machine; considers predicting the target as an engineering problem rather than a psychological, strategic, or game-theoretic one.
Examples: humans modeling the behavior of single-celled organisms, humans modeling the behavior of jellyfish, humans setting up a fungus to solve a maze
Simulates the target with high fidelity rather than approximating it or using heuristics. What Omega, Solomonoff Induction, and AIXI do to everything.
Examples: Omega, Solomonoff Induction, AIXI
Factors that affect relative abstracted agency of a target:
Complexity of target and quantity of available information about the target. AIXI can get away with simulating everything even with a mere single bit of information about a target because it has infinite computing power. In practice, any modeler that fits in the universe likely needs a significant fraction of the bits of complexity of a nontrivial target to model it well using simulating strategies. Having less information about a target tends to, but doesn’t always, make agent-abstracted strategies more effective than less agent-abstracted ones. For example, a modeler may best predict a target using mechanizing strategies until it has information suggesting that the target acts agentically enough to be better modeled using psychologizing or projecting strategies.
Predictive or strategic ability of the modeler relative to the target. Targets with overwhelming predictive superiority over a modeler are usually best modeled using terminalizing strategies, whereas targets that a modeler has an overwhelming predictive advantage over are usually best modeled using mechanizing or simulating strategies.
Relevance of this framework to AI alignment:
We would prefer that AI agents not model humans using high-fidelity simulating or mechanizing strategies, both because such computations could create moral patients, and because an AI using simulating or mechanizing strategies to model humans has potential to “hack” us or manipulate us with overwhelming efficacy.
Regarding inner alignment, subcomponents of an agent which the agent can model well using simulating or mechanizing strategies are unlikely to become dangerously inner-misaligned in a way that the agent cannot prevent or fix. it may be possible to construct an agent structured such that each superagent has some kind of verifiable “RAA superiority” over its subagents, such that it is impossible or unlikely for subagents to become dangerously misaligned with respect to their superagents.
Regarding embedded agency, an obstacle to amending theoretical agents like AIXI to act more like properly embedded agents is that they are heavily reliant on simulating strategies, but cannot use these strategies to simulate themselves. If we can formalize strategies beyond simulating, this could provide an angle for better formalizations of self-modeling. Human self-modeling tends to occur around the psychologizing level.
(draws on parts of https://carado.moe/predca.html, particularly Kosoy’s model of agenthood)
Suppose there is a correct hypothesis for the world in the form of a non-halting turing program. Hereafter I’ll simply refer to this as “the world.”
Consider a set of bits of the program at one point in its execution which I will call the target. This set of bits can also be interpreted as a cartesian boundary around an agent executing some policy in Vanessa’s framework. We would like to evaluate the degree to which the target is usefully-approximated as an agent, relative to some agent that (instrumentally or terminally) attempts to make accurate predictions under computational constraints using partial information about the world, which we will call the modeler.
Vanessa Kosoy’s framework outlines a way of evaluating the probability that an agent G has a utility function U which takes into account the agent’s efficacy at satisfying U as well as the complexity of U. Consider some utility function which the target is most kosoy-agentic with respect to. Hereafter I’ll simply refer to this as the target’s utility function.
Suppose the modeler can choose between gaining 1 bit of information of its choice about the target’s physical state in the world, and gaining 1 bit of information of its choice about the target’s utility function. (Effectively, the modeler can choose between obtaining an accurate answer to a binary question about the target’s physical state, and obtaining an accurate answer to a binary question about the target’s utility function). The modeler, as an agent, should assign some positive amount of utility to each option relative to a null option of gaining no additional information. Let’s call the amount of utility it assigns to the former option SIM and the amount it assigns to the latter option TERM.
A measure of the relative abstracted agency of the target, relative to the modeler, is given by TERM/SIM. Small values indicate that the target has little relative abstracted agency, while large values indicate that the target has significant abstracted agency. The RAA of a rock relative to myself should be less than one, as I expect information about its physical state to be more useful to me than information about its most likely utility function. On the other hand, the RAA of an artificial superintelligence relative to myself should be greater than one, as I expect information about its utility function to be more useful to me than information about its physical state.