Agent Meta-Foundations and the Rocket Alignment Problem

Many people are quite skeptical of the value of agent foundations. The kinds of problems that MIRI worries about, in terms of accounting for perfect predictors, co-operation with clones and being easily predictable are a world away from the kinds of problems that are being faced in machine learning. Many people think that proper research in this area would involve code. They may also think that this kind of research consists purely of extremely rare edge cases of no practical importance, won’t be integratable into the kinds of AI systems that are likely to be produced or is just much, much less pressing than solving the kinds of safety challenges that we can already see arising in our current AI system.

In order to convey his intuitions that Agent Foundations being important, Eliezier wrote the Rocket Alignment Problem. The argument is roughly that any attempt to define what an AI should do is build upon shaky premises. This means that it is practically impossible to provide guarantees that things will go well. The hope is that by exploring highly simplified, theoretical problems we may learn something important that would deconfuse us. However, it is also noted that this might not pan out, as it is hard to see how useful improved theoretical understanding is before you’ve obtained it. Further it is argued that AI safety is an unusually difficult area where things that sound like pretty good solutions could result in disasterous outcomes. For example, powerful utility maximisers are very good at finding the most obscure loopholes in their utility function to achieve a higher score. Powerful approval based agents are likely to try find a way to manipulating us. Powerful boxed agents are likely to find a way to escape that box.

Most of my current work on Less Wrong is best described as Agent Foundations Foundations. It involves working through the various claims or results in Agent Foundations, finding aspects that confuse me and then digging deeper to determine whether I’m confused, or something needs to be patched, or there’s a deeper problem.

Agent Foundations Foundations research looks very different from most Agent Foundations research. Agent Foundations is primarily about producing mathematical formulations, while Agent Foundations Foundations is primarily about questioning philosophical assumptions. Agent Foundations Foundations is intended to eventually lead to mathematical formalisations, but the focus is more on figuring out exactly what we want from out maths. There is less focus on producing formalisations as rushing out and producing incorrect formalisations is seen as a waste of time.

Agent Foundations research relies on a significant amount of philosophical assumptions and this is also an area where the default is disaster. The best philosophers are extremely careful with every step of the argument, yet they often come to entirely different conclusions. Given this, attempting to rush over this terrain by handwaving philosophical arguments is likely to end badly.

One challenge is that mathematicians can be blinded by elegant formalisations, to the point where they can’t objectively assess the merits of the assumptions it is build upon. Another key issue is that when someone is able to use a formalisation to produce a correct result, they will often assume that the formalisation must be correct. Agent Foundations Foundations attempts to fight against these biases.

Agent Foundations Foundations focuses on what often appears to be weird niche issues from the perspective of Agent Foundations. This includes questions such as:

Given a perfect predictor, what is it predicting when the counterfactual is impossible (Counterfactuals for Perfect Predictors)
Why doesn’t Timeless Decision Theory depend on backwards causation? (The Prediction Problem, further discussion here)
What class of problems should our decision theory optimise over? (One doubt about Timeless Decision Theories)?
What should we optimise for when your reference class depends on your decision? (Evil Genie Problem, Decision Theory with F@#!ed-Up Reference Classes)
What general kind of entities are logical counterfactuals? (Logical Counterfactuals & the Cooperation Game, Deconfusing Logical Counterfactuals)

Of course, lots of other people have done work in this vein too. I didn’t want to spend a lot of time browsing the archive, but some examples include:

I don’t want to pretend that the separation is clean at all. But in Agent Foundations work, the maths is first and the philosophical assumptions are second. For Agent Foundations Foundations, it is the other way round. Obviously, this distinction is somewhat subjective and messy. However, I think it model is useful as it opens up discussions about whether the current balance of research is right and provides suggestions of areas for further research. It also clarifies why some of these problems might turn out to be more important than they first appear.

Update: One issue is that I almost want to use the term in two different ways. One way to think about Meta-Foundations is in an absolute sense where it focuses on the philosophical assumptions while Foundations focuses more on formalisations vs ML which focuses on writing programs. Another is in a relative sense, where you have a body of work termed Agent Foundations and I want to encourage a body of work that responds to it and probes these assumptions further. And these senses are different, because when Agent Foundations work is pursued, they’ll usually be some investigation into the philosophy, but it’ll often be the minimal amount to get a theory up and running.