Theory of Ideal Agents, or of Existing Agents?

Within the context of AI safety, why do we want a theory of agency? I see two main reasons:

  • We expect AI have agenty properties, so we want to know how to design agenty systems which e.g. perform well in a maximally wide array of environments, and which reason about themselves and self-modify while maintaining their goals. The main use-case is to design an artificial agent.

  • We want AI to model humans as agenty, so we need a theory of e.g. how things made of atoms can “want” things, model the world, or have ontologies, and how to reliably back out those wants/​models/​ontologies from physical observables. The main use-case is to describe existing agenty systems.

The ideas involved overlap to a large extent, but I’ve noticed some major differences in what kinds of questions researchers ask, depending on which of these two goals they’re usually thinking about.

One type of difference seems particularly central: results which identify one tractable design within some class, vs characterizing all designs within a class.

  • If the goal is to design an artificial agenty system with ideal properties, then the focus is on existence-type proofs: given some properties, find a design which satisfies them.

  • If the goal is to describe agenty systems in the wild, then the focus is on characterizing all such systems: given some properties, show that any system which satisfies them must have some specific form or additional properties.

This difference suggests different trade-offs:

  • If the goal is to design one system with the best performance we can achieve in the widest variety of environments possible, then we’ll want the strongest properties we can get. On the other hand, it’s fine if there’s lots of possible agenty things which don’t satisfy our strong properties.

  • If the goal is to describe existing agents, then we’ll want to describe the widest variety of possible agenty things we can. On the other hand, it’s ok if the agents we describe don’t have very strong properties, as long as the properties they do have are realistic.

As an example, consider logical induction. Logical induction was a major step forward in designing agent systems with strong properties—e.g. eventually having sane beliefs over logic statements despite finite resources. On the other hand, for the most part it doesn’t help us describe existing agenty systems much—bacteria or cats or (more debatably) humans probably don’t have embedded logical inductors.

Diving more into different questions/​subgoals:

  • Logical counterfactuals, Lobian reasoning, and the like are much more central to designing artificial agents than to describing existing agents (although still relevant).

  • Detecting agents in the environment, and backing out their models/​goals, is much more central to describing existing agents than to designing artificial agents (although still relevant).

  • The entire class of robust delegation problems is mainly relevant to designing ideal agents, and only tangentially relevant to describing existing agents.

  • Questions about the extent to which agent-like behavior requires agent-like architecture are mainly relevant to describing existing agents, and only tangentially relevant to designing artificial agents.

I’ve been pointing out differences, but of course there’s a huge amount of overlap between the theoretical problems of these two use-cases. Most of the problems of embedded world-models are central to both use-cases, as is the lack of a Cartesian boundary and all the problems which stem from that.

My general impression is that most MIRI folks (at least within the embedded agents group) are more focused on the AI design angle. Personally, my interest in embedded agents originally came from wanting to describe biological organisms, neural networks, markets and other real-world systems as agents, so I’m mostly focused on describing existing agents. I suspect that a lot of the disagreements I have with e.g. Abram stem from these differing goals.

In terms of how the two use-cases “should” be prioritized, I certainly see both as necessary for best-case AI outcomes. Description of existing agents seems more directly relevant to human alignment: in order to reliably point to human values, we need a theory of how things made of atoms can coherently “want” things in a world they can’t fully model. AI design problems seem more related to “scaling up” a human-aligned AI, i.e. having it self-modify and copy itself and whatnot without wireheading or value drift.

I’d be interested to hear from some agency researchers who focus more on the design use-case if all this sounds like an accurate representation of how you’re thinking, or if I’m totally missing the target here. Also, if you think that design-type use-cases really are more central to AI safety than description-type use-cases, or that the distinction isn’t useful at all, I’d be interested to hear why.