What’s next for the field of Agent Foundations?


Alexander, Matt and I want to chat about the field of Agent Foundations (AF), where it’s at and how to strengthen and grow it going forward.

We will kick off by each of us making a first message outlining some of our key beliefs and open questions at the moment. Rather than giving a comprehensive take, the idea is to pick out 1-3 things we each care about/​think are important, and/​or that we are confused about/​would like to discuss. We may respond to some subset of the following prompts:

Where is the field of AF at in your view? How do you see the role of AF in the larger alignment landscape/​with respect to making AI futures go well? Where would you like to see it go? What do yo use as some of the key bottlenecks for getting there? What are some ideas you have about how we might overcome them?

Before we launch in properly, just a few things that seem worth clarifying:

  • By Agent Foundations, we mean roughly speaking conceptual and formal work towards understanding the foundations of agency, intelligent behavior and alignment. In particular, we mean something broader than what one might call “old-school MIRI-type Agent Foundations”, typically informed by fields such as decision theory and logic.

  • We will not specifically be discussing the value or theory of change behind Agent Foundations research in general. We think these are important conversations to have, but in this specific dialogue, our goal is a different one, namely: assuming AF is valuable, how can we strengthen the field?

Should it look more like a normal research field?


The main question I’m interested in about agent foundations at the moment is whether it should continue in its idiosyncratic current form, or whether it should start to look more like an ordinary academic field.

I’m also interested in discussing theories of change, to the extent it has bearing on the other question.

Why agent foundations?

My own reasoning for foundational work on agency being a potentially fruitful direction for alignment research is:

  • Most misalignment threat models are about agents pursuing goals that we’d prefer they didn’t pursue (I think this is not controversial)

  • Existing formalisms about agency don’t seem all that useful for understanding or avoiding those threats (again probably not that controversial)

  • Developing new and more useful ones seems tractable (this is probably more controversial)

The main reason I think it might be tractable is that so far not that many person-hours have gone into trying to do it. A priori it seems like the sort of thing you can get a nice mathematical formalism for, and so far I don’t think that we’ve collected much evidence that you can’t.

So I think I’d like to get a large number of people with various different areas of expertise thinking about it, and I’d hope that some small fraction of them discovered something fundamentally important. And a key question is whether the way the field currently works is conducive to that.

Does it need a new name?

Alexander Gietelink Oldenziel

Does Agent Foundations-in-the-broad-sense need a new name?

Is the name ‘Agent Foundations’ cursed?

Suggestions I’ve heard are

‘What are minds’, ‘what are agents’. ‘mathematical alignment’. ‘Agent Mechanics’

Epistemic Pluralism and Path to Impact


Some thought snippets:

(1) Clarifying and creating common knowledge about the scope of Agent Foundations and strengthening epistemic pluralism

  • I think it’s important for the endeavors of meaningfully improving our understanding of such fundamental phenomena as agency, intelligent behavior, etc. that one has a relatively pluralistic portfolio of angles on it. The world is very detailed, phenomena like agency/​intelligent behavior/​etc. seem like maybe particularly “messy”/​detailed phenomena. Insofar as every scientific approach necessarily abstracts on a bunch of detail, and we don’t apriori know which bits of reality are fine to abstract away from and which aren’t in what contexts, having plural perspectives on the same phenomena is a productive approach to coming to “triangulate” the desired phenomena.

  • This is why I am pretty keen on having a scope of AF that includes but is not limited to “old-school MIRI type AF”. As I see it, the field has already well started producing a larger plurality of perspectives, which is exciting to me. I am further in favour of

    • creating more common knowledge the scope of AF—I want relative breadth in terms of methodologies, bodies of knowledge, epistemic practices and underlying assumptions, and relative narrowness in terms of the leading questions/​epistemic aims of the field.

    • increasing the pluralism further—I think there are some fairly obviously interesting angles, fields, knowledge basis to bring to bear on the questions of AF, and to integrate into the current conversations in AF and alignment.

    • work on creating and maintaining surface area between these plural approaches—“triangulation” as described above can only really happen when different perspectives interface and communicate, and as such we need places & interfaces where/​through which this can happen

(2) Where does AF sit on the “path to impact”

  • At a high-level, I think it’s useful to ask: what are the (epistemic) inputs that need to feed into AF? What are the epistemic outputs we want to come out AF, and where do we want them to feed into, such that at the end of this chain we get to something like “safe and aligned AI systems” or similar?

  • With respect to this, I’m particularity excited for AF to have tight interfaces/​iteration loops with more applied aspects of AI alignment work (e.g. interpretability, evals, alignment proposals).

(3) possible prompt: if you had 2 capable FTE and 500′000 USD for AF field building, what would you do?

..suffering from a lack of time, and will stop here for now.

Pockets of Deep Expertise

Alexander Gietelink Oldenziel

One of my favorite blogposts is Schubert’s “Against Cluelessness Pockets of Predictability” introducing ‘Pockets of Predicability’:

(...)intuitions about low-variance predictability long held back scientific and technological progress.*** Much of the world was once unknowable to humans, and people may have generalised from that, thinking that systematic study wouldn’t pay off. But in fact knowability varied widely: there were pockets of knowability or predictability that people could understand even with the tools of the day (e.g. naturally simple systems like the planetary movements, or artificially simple systems like low-friction planes). Via these pockets of knowability, we could gradually expand our knowledge—and thus the world was more knowable than it seemed. As Ernest Gellner points out, the Scientific and Industrial Revolutions largely consisted in the realisation that the world is surprisingly knowable:

“the generic or second-order discovery that successful systematic investigation of Nature, and the application of the findings for the purpose of increased output, are feasible, and, once initiated, not too difficult.”

I really like this way of thinking about the possibility of knowledge and development of science. I see a very similar ‘predictability skepticism’ across the field of Alignment.

This predictabilility skepticism is reflected in the Indefinite Optimism of lab-based alignment groups and the Indefinite Pessimism of doomers.

I want to introduce the idea of ‘Pockets of Deep Expertise’. That is—I think much of scientific progress is made by small groups of people, mostly opague from the outside, (‘pockets’) building up a highly specific knowledge over fairly-long time stretches (‘deep expertise’).

These pockets are

  • often highly opaque & illegible from the outside.

  • progress is often partial & illegible. The Pocket has solved subproblem A,B, and C of Question X for some reason their methods have not yet been able to solve D. This prevents them from competely answering Question X or building technology Y

  • progress is made over long-time periods

  • there are many False Prophets. Not everybody claiming (deep) expertise is actually doing valuable things. Some are outright frauds, others are simply barking up the wrong tree.

  • As a conservative estimate 90-95% of (STEM) academia is doing work that is ‘predictably irrelevant’, p-hacking and/​or bad in various other ways.
    So most of academia is indeed not doing useful work. But some pockets are

  • The variance of pockets is huge.

For the purpose of technical alignment, we need to think like a VC:

bet on a broad range of highly specific bets

To my mind we are currently only employing a tiny fraction of the world’s scientific talent.

Although Alignment now attracts very large group of promising young people, much of their energy and talent is being wasted on reinventing the wheel .

How to Get a Range of Bets


Everyone has mentioned something along the lines of wanting to get a broad range of specific bets or types of people. We could take that as read and discuss how to do it?

(Although if we are going to talk about how we want the field to look, that probably most naturally comes first)


Ok, great. Let’s take stock quickly.

I think we are all interested in some version of “bet on a broad/​plural range of highly specific bets”. Maybe we should talk about that more at some point.

To help with the flow of this, it might be useful however to go a bit more concrete first. I suggest we take the follow prompt:

if you had 2 capable FTE and 500′000 USD for AF field building, what would you do?

Reverse MATS


I’ll give the idea I was chatting about with Alexander yesterday as my first answer.

There are probably a large number of academics with expertise in a particular area which seems potentially useful for alignment, and who might be interested in doing alignment research. But they might not know that there’s a connection, or know anything about alignment. And unlike with junior researchers they’re not gonna attend some MATS-type programme to pick it up.

So the idea is “instead of senior alignment researchers helping onboard junior people to alignment research, how about junior alignment people help onboard senior researchers from other areas?” Anti-MATS.

EDIT: Renamed to Reverse MATS because people glancing at the sidebar thought someone in the dialogue was anti MATS. We are pro MATS!

We have a large pool of junior people who’ve read plenty about alignment, but don’t have mentorship. And there’s a large pool of experienced researchers in potentially relevant subjects who don’t know anything about alignment. So we send a junior alignment person to work as a research assistant or something with an experienced researcher in complexity science or active inference or information theory or somewhere else we think there might be a connection, and they look for one together and if they find it perhaps a new research agenda develops.


Yeah, I like this direction. I agree with the problem statement. I’m not sure “junior helping senior person” is maybe helpful but unsure it’s the crux to getting this thing right. Here is what i think might be some cruxes/​bottlenecks:

  • “Getting self-selection right”: how do ‘senior scholars’ find the ‘anti-MATS’ program, and what makes them decide to do it?

    • One thing I think you need here is to create a surface area to the sorts of questions that agent foundations for alignment is interested in such that people with relevant expertise can grok those problems and as how their expertise is relevant to them

    • For identifying more senior people, I think you need some things like workshops, conferences and network rather than being able to rely on open applications.


I think you’d have to approach researchers individually to see if they’d like to be involved.

The most straightforward examples would be people who work in a pretty obviously related area or who are known to have some interest in alignment already (I think both were true in the case of Dan Murfet and SLT?) or who know some alignment people personally. My guess is this category is reasonably large.

Beyond that, if you have to make a cold pitch to someone about the relevance of alignment (in general and as a research problem for them) I think it’s a lot more difficult.

I don’t think, for example, there’s a good intro resource you can send somebody that makes a common-sense case for “basic research into agency could be useful for avoiding risks from powerful AI”, especially not one that has whatever hallmarks of legitimacy make it easy for an academic to justify a research project based on.


Yeah cool. I guess another question is: once you identified them, what do they need to succeed?

I’ve definitely also seen the failure mode where someone is only or too focused on “the puzzles of agency” without having an edge in linking those questions up with AI risk/​alignment. Some ways of asking about/​investigating agency are more and less relevant to alignment, so I think it’s important that there is a clear/​strong enough “signal” from the target domain (here: AI risk/​alignment) to guide the search/​research directions


Yes, I agree with this.

I wonder whether focusing on agency is not even the right angle for this, and ‘alignment theory’ is more relevant. Probably what would be most useful for those researchers would be to have the basic problems of alignment made clear to them, and if they think that focusing on agency is a good way to attack those problems given their expertise then they can do that, but if they don’t see that as a good angle they can pursue a different one.

I do think that having somebody who’s well-versed in the alignment literature around (i.e. the proposed mentee) is potentially very impactful. There’s a bunch of ideas that are very obvious to people in the alignment community because they’re talked about so often (e.g. the training signal is not necessarily the goal of the trained model) that might not be obvious to someone thinking from first principles. A busy person coming in from another area could just miss something, and end up creating a whole research vision which is brought down by a snag that would have been obvious to an inexperienced researcher who’s read a lot of LW.

Alexander Gietelink Oldenziel

seniorMATS—a care home for ai safety researchers in the twilight of their career


Yes, good surface area to the problem is important. I think there is a good deal of know-how around on this by now. From introductory materials, to people with experience running the sort of research retreats that provide good initial contact with the space, to (as you describe) indivuals who could help/​assist/​facilitate along the way. Also worth asking what the role of a peer environment should/​could be (e.g. a AF discord type thing, and/​or something a bit more high-bandwidth)

Also, finding good general “lines of attack” might pretty useful here. For example, I have fond Evan’s “model organism” to be a pretty good/​generative frame getting AF type work to be more productively oriented towards concrete/​applied alignment work.


Alignment Noob Training—Inexperienced Mentees Actually Teach Seniors (ANTIMATS)


My model here puts less emphasise on “juniro researchers mentoring up”, and more on “creating the right surface area for people wiht the relevanat expertise” more generaly; one way to do do this may be junior researchers with more alginment exposure, but I don’t think that should be the central pillar.

Alexander Gietelink Oldenziel

the three things I am looking for in an academic (or nonacademic) researcher with scientific potential is

  1. alignmentpilled—important.

You don’t want them to run off doing capability work. There is an almost just as pernicious failure where people say they care about ‘alignment’ but they don’t really. Often this is variants where alignment and safety becomes a vague buzzword that gets co-opted for whatever their hobbyhorse was.

2. belief in ‘theory’ - they think alignment is a deep technical problem and believe that we will need scientific & conceptual progress. Experiments are important but pure empirics is not sufficient to guarantee safety. Many people conclude (perhaps rightly so!) that technical alignment is too difficult and governance is the answer.

3. swallowed the bitter lesson - unfortunately, there are still researchers who do not accept that LLMs are here. These are especially common, surprisingly perhaps, in AI and ML departments. Gary Marcus adherents in various guises. More generally, there is a failure mode of disinterest in deep learning practice.


“creating the right surface area for people wiht the relevanat expertise”

That seems right. Creating a peer network for more senior people coming into the field from other areas seems like it could be similarly impactful.

Appealing to Researchers

Alexander Gietelink Oldenziel

You don’t convince academics with money. You convince them with ideas. Academics are mental specialists. They have honed very specific mental skills over many years. To convince them to work on something you have to convince them that 1. the problems is tractable 2. fruitful & interesting and most importantly 3. vulnerable to the specific methods that this academic researcher has in their toolkit.

Another idea that Matt suggested was a BlueDot -style “Agent Foundations-in-the-broad-sense’ course.

\EuclideanGeometry rant

The impact of Euclidean Geometry on Western Intellectual thought has been immense. But it is slighly surprising: Euclid’s geometry has approximately no application. Here I mean Euclid’s geometry as in the proof-based informal formal system of Euclidean geometry as put forward in Euclid’s Elements.

It is quite interesting how the impact actually worked. Many thinkers cite euclidean geometry as decisive for their thinking—Descartes, Newton, Benjamin Franklin, Kant to name just a few. I think the reason is that it formed the ‘model organism’ of what conceptual, theoretical progress could look like. The notion of proof (which is interestingly unique to the Western mathematical tradition, despite ?15th century Kerala, India e.g. discovering Taylor series before Newton), the notion of true certainty, notion of modelling and idealizations, the idea of stacking many lemmas etc.

I think this kind of ‘succesful conceptual/​theoretical progress’ is highly important in inspiring people both historically and currently.

I think the purpose of such an AF course would be to show academic researchers that there is real intellectual substance to conceptual Alignment work


[at this point we ran out of our time box and decided to stop]