Project proposal: Testing the IBP definition of agent

Context

Our team in SERI MATS needs to choose a project to work on for the next month. We spent the first two weeks discussing the alignment problem and what makes it difficult, and proposing (lots of) projects to look for one that we think would directly address the hard parts of the alignment problem.

We’re writing this post to get feedback and criticism of this project proposal. Please let us know if you think this is a suboptimal project in any way.

Project

Disclaimer: We’ve probably misunderstood some things, don’t assume anything in this post accurately represents Vanessa’s ideas.

Our project is motivated by Vanessa Kosoy’s PreDCA proposal. We want to understand this proposal in enough detail that we can simplify it, as well as see and patch any holes.

IBP gives us several key tools:

A “Bridge Transform”, that takes in a hypothesis about the universe and tells us which programs are running in the universe.
An “Agentometer”^[1] that takes in a program and tell us how agentic it is, which is operationalized as how well the agent does according to a fixed loss function relative to a random policy.
A “Utiliscope”^[1] that, given an agent, outputs a distribution over the utility functions of the agent.

Together these tools could give a solution to the pointers problem, which we believe is a core problem in alignment. We will start this by understanding and testing Vanessa’s definition of agency.

Definition of Agency

The following is Vanessa’s definition of the intelligence of an agent, where an agent is a program, denoted by $G$ , that outputs policies (as described in Evaluating Agents in IBP). This can be used to identify agents in a world model.

Definition 1.6: Denote $G^{*} : H \to A$ the policy actually implemented by $G$ . Fix $ξ \in Δ (A^{H})$ . The physicalist intelligence of $G$ relative to the baseline policy mixture $ξ,$ prior $ζ$ and loss function $L$ is defined by:

g (G ∣ ξ; ζ, L) := - log {P r}_{π \sim ξ} [L^{pol} (┌ G ┐, π, ζ) \leq L^{pol} (┌ G ┐, G^{*}, ζ)]

In words, this says that the intelligence of the agent $G$ , given a loss function $L$ , is the negative log of the probability that a random policy $π$ is better than the actual policy the agent implements, denoted by $G^{*}$ .

The next part is how to extract (a distribution over) the utility function of a given agent (from video on PreDCA):

Here, $L^{pol}$ is just the negative of the utility function $U$ . Combining this with the definition of intelligence above gives a simpler representation:

$P (U) \propto 2^{- K (U) + g (G | ξ; ζ, U)}$ .

In words, the probability that agent $G$ has utility function $U$ is exponentially increasing in the intelligence of $G$ implied by $U$ and exponentially decreasing in the Kolmorogov complexity of $U$ .

Path to Impact

We want a good definition of agency, and methods of identifying agents and inferring their preferences.
If we have these tools, and if they work really well even in various limits (including the limit of training data/compute/model size/distribution shifts), then this solves the hardest part of the alignment problem (by pointing precisely to human values via a generalized version of Inverse Reinforcement Learning).
These tools also have the potential to be useful for identifying mesa-optimizers, which would help us to avoid inner alignment problems.

How we plan to do it

Theoretically:

Constructing prototypical examples and simple edge cases, i.e. weird almost-agents that don’t really have a utility function, and theoretically confirming that the utility function ascribed to various agents matches our intuitions. Confirming that the maximum of the utility function corresponds to a world that the agent intuitively does want.
Examining what happens when we mess around with the priors over policies and the priors over utility functions.
Exploring simplifications and modifications to the assumptions and definitions used in IBP, in order to see if this lends itself to a more implementable theory.

Experimentally:

Working out ways of approximating the algorithm for identifying an agent and extracting its utility function, to make it practical and implementable.
Working out priors that are easy to use.
Constructing empirical demonstrations of identifying an agent’s utility function to test whether a reasonable approximation is found.
Doing the same for identifying agents in an environment.

Distillation

In order to do this properly, we will need to understand and distill large sections of Infra-Bayesian Physicalism. Part of the project will be publishing our understanding, and we hope that other people looking to understand and build on IBP will benefit from this distillation.

Conclusion

That’s where we are right now—let us know what you think!

^
“Agentometer” and “Utiliscope” are not Vanessa’s terminology.