Introduction

The Preference Fulfillment Hypothesis (PFH) asserts that “Humans have an innate motivation (‘preference fulfillment’, PF) to fulfill the preferences of those they care about”. For the rest of this post, I model PF as a relationship between:

Some model of an external utility function (exists independently of the primary agent)
An external agent
An action that the primary agent interprets as granting positive utility in accordance with the external agent’s utility function
The primary agent
Some shared environment

To provide an example, Agent A assigns positive utility to the outcome of Agent B experiencing positive utility due to PF, and thus positively associates between actions that result in positive utility for Agent B and positive utility for itself. Agent A engaging in PF implies that it has access to some model of Agent B’s utility function (a model that isn’t necessarily correct), which in practice is both learned and internal (i.e. Agent A is simulating Agent B). Humans converge upon abstractions of the utility functions of agents whose preferences they fulfill (as opposed to perfect low-level models of them), implicating that some agent specifications might abstract well. As stated in the original PFH post, this has interesting implications for corrigibility.^[1]

Select Agent Specifications as Natural Abstractions

Why Might Some Agent Specifications be Natural Abstractions?

Some agent specifications might be natural abstractions because:

All humans appear to converge upon abstractions of the people they are trying to simulate
Cooperation requires simulation, and might be a convergently useful capability

Neither of these are standalone arguments for why some agent specifications should be considered to abstract well, and are mere seed evidence. For example, all agent specifications could abstract incredibly poorly, but this could be such an essential skill in the human context that we learn it regardless. To counter this; any complex game in which an agent is trying to optimize for some outcome likely involves both an interaction with another agent, as well as some outcome affected by the behavior of that agent. As a result, it is a logical capability to develop under selection pressures like natural selection, as the simulation of other agents can enhance an agent’s ability to forecast outcomes. Regardless of whether or not all agent specifications abstract well, humans conclusively simulate other humans, and this is why we do not need to express our moral preferences in formal ways. For example, in most societies death is considered a net negative, and yet we do not actively pursue the containing of all members of a society in small boxes fitted with feeding tubes such that their lives are maximally extended. The caveat of not wanting to live in a feeding tube box is implied, and thus from a notion as abstract as “dying is bad”, we deduce a wealth of useful information. Importantly, human specifications abstracting well increases the viability of this as an alignment strategy, but leveraging the simulation of other agents as an alignment technique isn’t dependent on this.

Which Agent Specifications Could be Natural Abstractions?

Another aspect to consider is that just because some agent specifications might abstract well does not mean that all do. For example, whilst human specifications may abstract well, the same does not necessarily go for alien specifications. In keeping with the Universality Hypothesis, most minds should form similar abstractions of the same specifications, but not all specifications need to be able to be abstracted naturally by most minds.

There is no reason for this to certainly influence the agenda stated in the introduction; regardless of whether or not all agent specifications abstract well, the primary concern from the perspective of a safety researcher is whether or not human specifications do. I am interested in questions like “What if agents are more likely to form abstractions of specifications similar to their own?”, and I believe that this hypothesis could again be tested with existing technology. I’m unsure of how a result like “agents are more likely to form abstractions of specifications similar to their own” would conform to the Universality Hypothesis, as it is unclear to me if an object of abstraction like “specifications similar to that of the agent’s own” could be treated as an object in the same manner “tree” is. This confusion stems from the fact that the former (although a static concept) varies from agent to agent in its implementation.

Testing for Select Agent Specifications as Natural Abstractions

A successful method should attempt to answer one or more of the following questions:

Do non-human minds engage in the simulation of other agents?
Do some agent specifications abstract well?
Can these abstractions be modified? If so, how?
How can we interpret these abstractions?
How can we estimate these abstractions? (What information about the external agent might we need to make these predictions? What about information about the primary agent?)

My first idea:

Approximate a low-level model of some external agent as well as that model’s complexity
Obtain probability distributions over actions for the same external agent specifications from non-human sources (e.g. LLMs, RL agents)
Forecast the behavior of the external agent using the produced probability distribution
Calculate the divergence of the non-human and low-level distributions using some distance measure, as well as difference in complexity approximations
Using the complexity measures, make inferences regarding the degree of abstraction applied to the external agent specification (e.g. if some agent produces a model with significantly lower estimated complexity than another but maintains a similar forecasting proficiency, it can be assumed that its abstraction contains less redundant information)
Based on the complexity and distance dissensions between non-human and low-level abstractions, produce information that could be helpful when answering “Do non-human minds engage in the simulation of other agents?” and “Do some agent specifications abstract well?” (e.g. Most agents converged at a similar abstraction of x but not of y, they might do this because…)

Cons for this approach:

It seems difficult to construct some agent with specifications simple enough that it is possible to analyze, but complicated enough as to maintain uniqueness from examples seen during training
- In this case, the experimental data becomes very useless, because it tells us nothing about the architecture of the language model or the abstractability of the concept we are testing
- Imagine you applied this method to an LLM trained on tens of thousands of discussions and instances of literature about game theory and your scenario bears similarity to one mentioned there, and now you have invalid experimental data
The game in which the external agent exists needs to be very simple so as to make developing a low-level approximation of the agents plausible

Some pros:

Easy (not necessarily accurate) means of quantifying the degree of abstraction as well as the similarity between those abstractions and some lower-level model of the external agent
Easily extended to answer questions like “Can these abstractions be modified?” as you would already have comparative data for some baseline unmodified abstraction, as well as a framework for quantifying the distance between the modified and unmodified abstractions
Provides useful information regarding the questions “Do non-human intelligences engage in the simulation of other agents?” and “Do some agent specifications abstract well?”

A robust proof that human specifications abstract well probably looks like a selection theorem, and might describe a phenomena like “under the selection pressure of PF, a system will simulate external agents”. I would then try to generalize this to describe agent simulation under less niche selection pressures, which I would assume is possible as humans appear to simulate external agents, a feature that emerged as a result of the selection pressures we are under.

Implications for Corrigibility

If future models naturally converge upon useful abstractions of human specifications, and these abstractions can be expressed in an interpretable manner, perhaps shaping these abstractions is possible. RLHF could already be doing this, and if we could estimate these abstractions we might be able to apply it in a significantly more targeted manner, increasing its scalability. I am skeptical of RLHF being the only way to shape abstractions of human specifications, and if a PF-esque selection pressure holds true for non-human intelligence, more advanced applied alignment strategies likely become a lot easier to devise.

Distribution Shift

“The hardest part of alignment is getting the AGI to generalize the values we give it to new and different environments.” Aligning an AGI to abstractions of those values could result in better generalization. If those values can be estimated via simulating other agents, and the subject of that simulation has specifications that abstract well, solving distribution shift could be a mildly smaller hurdle. It seems intuitive that this is so due to it appearing to be how humans account for distribution shift. In place of complex morality formalizations, we use convenient abstractions and can use them to make decisions under pressures like PF.

Conclusion

It seems almost certain that humans simulate other humans, e.g. through PF. It is probable that alien intelligences do the same (although not necessarily through PF), and that this would entail an abstraction of the specifications of the agent being simulated. By being able to estimate alien abstractions of some general human prefernces, inferences could be made regarding whether or not these abstractions are robust or safe.

^
Note that the term “positive utility” in this hypothetical refers to positive utility in a relative sense. If Agent B were to pick lottery numbers that Agent A knew would result in Agent B attaining 100$, and Agent A then suggested numbers that would result in Agent B attaining 10$, (assuming Agent B is trying to optimize for fiscal reward) this would not conform to this definition of PF, as Agent A is lowering the utility of the outcome for Agent B in spite of it still bearing positive utility relative to not having picked lottery numbers at all.