Toward a Kantian refutation of Agent Foundations
This post is a Cunningham’s law draft, less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b) criticizing it with a focus on what would be right, not just what is wrong. I haven’t strongly made my mind yet, so at this stage I’m particularly interested in fundamental criticisms of the goal and framing (but of course I also welcome minor corrections).
TL;DR: The goals of Agent Foundations seem so far-fetched that all the progress made in the field doesn’t seem to have decreased our distance to them that much. One might conclude that the goal is unachievable, the question ill-posed. But even the available rejections of AF seems unprincipled: instead of proving that the task is impossible,[1] we simply fail, and do something else instead. Working toward a principled refutation of Agent Foundations might either indeed refute AF or point to unexplored directions, and both outcomes would be helpful information.
Prelude: Kant’s refutation of all arguments of God’s existence
Kant argued that there are three (and only three) related concepts of God, definable on three different levels:
Defined independently of any universe: God as set of all logical predicates.
Defined in connection to the existence of a universe, without assuming any of its properties: God as cause of the universe.
Defined in connection to our universe: God as cause of the apparent purposefulness of the universe.
He argued furthermore that none of three definitions have enough “meat” to allow for a valid proof of its existence, and concluded that everyone should stop wasting their time trying to prove God’s existence.
I’m not interested in the specifics of Kant’s argument; only in the structure:
1. We are dealing with concepts that are to some extent[2] definable without reference to the specific facts of our universe
2. There are different levels of “abstraction from the facts of our universe” in which different definitions can be made
3. These different formulations potentially have connections to each other (e.g. one might be a sub-specification of another)
4. An attempt can be made at showing that all levels and all possible definitions have been listed
5. An attempt can be made at showing, for each level and definition, that there isn’t enough for the kind of proof we are looking for
Computability theory as comparison
Consider:
Computation can be defined at the level of what a Turing machine can or cannot produce, at the level of what belongs to different complexity classes, and at the level of what can be computed in our universe.
The concept of computation at the topmost level has enough content to allow e.g. for a proof that Turing-machine-computable and lambda-calculus-computable refer to the same thing.
At the second level, everything computable belongs to some complexity class. But you can still work on the first level and ignore this extra specification.
At the third level, the Cobham-Edmonds thesis claims that only polynomial complexity is tractable in our universe. This is a fact about our universe (other universe with different properties are conceivable), and at the same time it is a very general fact that abstracts from the specifics of our universe, such as other universes also fulfilling this property while being very different from ours are also conceivable.
At the bottom-most level, you need a substrate for your computations, and this adds a lot of specifications to what computing means.
Across the different levels, computability is, in same sense, the same object, and in some sense three different objects.[3]
Locating the AF paradigm
Agent Foundations is in search of a paradigm. It’s probably worth it to reflect on what exactly a paradigm is. My initial approximation is a self-contained networks of concepts and proofs. A second worthwhile question in the search for a paradigm is: in which level of abstraction does the paradigm live?[4]
It seems plausible to me that a solution for alignment can be found using several self-contained components, and that not all components are on the same level. But it seems very implausible that we can cobble concepts from different levels in a haphazard fashion. So it might be useful to try to create a taxonomy of levels, to locate current agendas within them,[5] and to see what we are missing.
The guiding question to locate the level in which a solution is is to ask: How different are the universes in which this alignment strategy would work? Would this strategy work in a universe with different physics but the same math?
Before we explore a tentative taxonomy of levels, let’s try to list the concepts that we are trying to understand.
Concepts
There are two basic aspects to what we are trying to capture under the name of agency: a agent knows the universe and an agent acts in the universe. I will separate the two aspects and, for lack of better terms, speak of inductors and interveners.[6]
A natural question to ask is: Is every inductor an intervener, and vice versa?
The intervener’s interventions can be conceptualized with the concepts of coherence,[7] values, pessimization (?), etc., which might be definable in different levels.
I don’t know if there are some equivalent concepts for the inductor side.
We also have concepts like alignment and control, which seem to be definable on different levels.
Definitional levels
Empirical and non-empirical research are useful labels, but it might be worthwhile to come up with finer distinctions. Here is one attempt:
1. Without reference to our universe
Dualistic Mathland[8]
I will define dualistic as the property of definitions of a connection of not specifying the thing they connect.[9]
The introductory definition of a set as a non-ordered list of elements is dualistic: it allows for operations like union, intersection, etc., without having to specify what the elements are. ZFC, on the other hands, is not dualistic: all ZFC-sets are ultimately definable from the empty set, and having a set which isn’t thus definable isn’t allowed.
The concept of evolution belongs here, as does non-Many Worlds Quantum Mechanics.
More relevant to our purposes, Bayesian updating also belongs here, as do Bayesian Neural Networks.
How does AF look inside Dualistic Mathland? Some results and questions:
An inductor can be defined as a Solomonoff inductor
An intervener that is also an inductor can be defined as an AIXI-inductor/intervener
Is every intervener an inductor?[10]
Does every intervener follow some kind of reward?
Control and alignment seem to be understandable questions in this level (i.e. how to get an AIXI that does the things that we want?) but not answerable at this level.
The most abstract definitions of Multiple agency are by definition dualistic.
Jump from Dualistic to Computable
The questions that arise in Dualistic Mathland[11] don’t have solutions, but in some sense this is irrelevant, because we know that these definitions are unsuitable for our universe: [12] the definitions in Dualistic Mathland are incompatible with things like recursion;[13] their problem isn’t being “too abstract”; they are the wrong kind of abstract. This is good news: this is the kind of Kantian refutation that we want. We understand agency inside Dualistic Mathland well enough to know that we need to keep searching outside of it, in a similar way to how Gödel proved that understanding arithmetic well enough means leaving forever the naive idea of arithmetic without recursion problems.[14]
An optimistic scenario would be that something similar happens in the other levels.
Computable Mathland
(See upcoming post Nitpicking on Embeddedness. One important point is: being computable doesn’t say anything about universe, computable concepts are still pure mathematical concepts.)
2. With reference to our universe
Dualistic and Computable Mathland are simply names for things presumably obvious to everyone. Now I tentatively propose that we abandon the equality “scientific method = making reference to the laws of our universe”, and distinguish four different levels. They all indeed make reference to facts of our universe[15], but only the fourth one, which does this to a very high extent, is normal science.
The first of them is what I’ll call the macro-empirical level. It only uses extremely general properties of the universe [16], and can be seen as the most abstract level (=the closest to Mathland) that still has the property that it applies to some universes and not to others. But there could be many universes for which these general properties hold, while looking very differently from ours.
The second level is the platonic-empirical, which starts from potentially very fine-grained facts of our universe, but tries to find from them very abstract underlying structures.
The normal science levels references and uses a plethora of specific facts about our universe.
Macro-Empirical
The Natural Abstractions and Condensations agendas seem to me to fit here.
A similar thing seems to apply to the attempts to define agency from Friston’s Free Energy Principle.[17]
On the less winding but more exhaustingly uphill high road, self-evidencing is developed a priori, on the basis of conceptual analysis and mathematics. Here, the free energy principle is understood as similar to principles in physics, such as the principle of least action. It does not convey an empirical law of nature, which can be confirmed to various degrees through observation. The principle may or may not apply to certain systems when they are described in certain ways. In this sense, self-evidencing is a method or system for understanding existing things, such as humans. Empirical confirmation comes in when principles are applied to furnish process theories that describe particular systems, such as the anatomy and structure of the brains and bodies of humans, of other animals, or of machines. Because self-evidencing is a priori, it is intended as a way to describe any self-organizing thing that exists; therefore, it would not make sense to say, for example, that self-evidencing is something that some organisms and not others have evolved to maximize fitness.
Jakob Hohwy, The Self-Evidencing Agent
Platonico-Empirical
Several results (manifold hypothesis, platonic space hypothesis, simplicity bias) point to the universe being a model of a mathematical object, in ways beyond the trivial way that underlyies any instance of empirical science. I’ll define the platonico-empirical level as the level in which it can be attempted to instrumentalize these facts. Michael Levin is probably the best representative of what this could look like, with his references to non-metaphorical agency inside what he calls the Platonic space, with which we can attempt to interact.[18]
Human-Empirical
Research around concepts like CEV aim to reach very general conclusions, but depart from the empirical fact of how humans are. I’m unsure of whether this is regular empirical research coupled with speculation, or indeed an additional level in which, beyond CEV, other concepts could be tried.
Schelling Goodness also possibly belongs here.
Normal-Empirical=Scientific
Agent Foundations isn’t a badge of honor; “not belonging to AF” might simply mean “being a well-posed question with a solution”. Thus it is without valuation or particular surprise that all agendas which are simply doing normal science are not part Agent Foundations.
The only interesting point is clarifying that empirical and non-prosaic aren’t synonymous. For instance, I think Byrnes’s agenda as trying to abstract the mechanism by which which a human intervener isn’t properly modeled by a utility function, but instead by what the human imagines their peers thinking of them. It might be that this concept is abstractable, and if so, it could guide research on different substrates (LLMs and some architecture that we haven’t discovered yet, for instance). In so far, it is doing something beyond looking at current LLMs and trying to understand it, i.e. it’s non-prosaic alignment research. But Byrnes isn’t trying to locate that particular mechanism, which must sit somewhere in Mathland, through exploration of Mathland. Byrnes is trying to locate it through exploration of our universe (more specifically, of human brains and their effects), i.e. doing brain science. And understandably so, since there is no reason to expect that mechanism to be salient in mathland.
3. Solving Philosophy and/or Math
The previous levels are all more or less agnostic wrt “solving philosophy”, i.e. one could for instance work on the Platonic Space without asking the question of what exactly that means.
But it is plausible that the confusion regarding this situation acts as a blocker in AF, so that working on clearing this confusion could itself be one way of working on AF.[19]
A major source of confusion is that humans are embedded, non-dualistic parts of the universe, and et seem to have what Nagel called the “view from nowhere” able to find out truth that is valid also outside the universe.
The hierarchy between solving Math[20] and solving Philosophy seems unclear. Solving philosophy seems to gesture at something like formalizing the structure of the world of concepts. In some sense this is previous to math (because concepts is a much more general category than mathematical concepts), but the structure might be mathematical. It’s possible that at the bottom we don’t find a foundation, but that two things that refer to each other.[21]
- ^
Or at the every least having a very convincing argument of why it’s very unlikely to be possible.
- ^
And the crux is precisely to what extent?
- ^
There’s no such thing [as computer science]. [...] At one end you have people who are really mathematicians. [...] In the middle you have people working on something like the natural history of computers—studying the behavior of algorithms for routing data through networks, for example. And then at the other extreme you have the hackers, who are trying to write interesting software, and for whom computers are just a medium of expression, as concrete is for architects or paint for painters.
Hackers and Painters
Paul Graham - ^
This question seems central to distinguish the specific things that have been tried from the general strategies that haven been tried and which could be fulfilled with other specific things. In particular, it seems to me there is no vocabulary to distinguish between the MIRI (and MIRI-inspired) research, and the paradigm such research points towards. This and this are great introduction for the second question, but they are busy rejecting criticism of the field rather than, as I intend here, creating the unified vocabulary to sketch a map of all the things that are and could be part of AF, even if those things contradict each other in the specifics.
- ^
Each concept of an agenda could in theory be on a different level, but if the agenda is coherent, we should expect to find the whole of it nested together, unless the agenda consists on several separable coherent sub-components.
- ^
From now on I will avoid using the misleading terms agent or agency.
- ^
I take the idea of separating the different definitional levels of coherence from this exchange:
Mateusz Bagiński:
How well can the entity’s behavior be explained as trying to optimize a single fixed utility function?How well aligned is the entity’s behavior with a coherent and self-consistent set of goals?
To what degree is the entity not a hot mess of self-undermining behavior?
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally[2]
discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general?[3]
- ^
The level distinguishes where the definition of agency is made; it’s not automatically an ontological level. See final section, Solving Philosophy and Math, for more. See also Scott Aaronson:
Do we ever really talk about the continuum,
or do we only ever talk about finite sequences of symbols that talk about the continuum? - ^
I claim this definition has the same spirit, but is more accurate, than the usual definitions. See upcoming post Nitpicking about Embeddedness.
- ^
I.e. is there something like Q-learning in Dualistic Mathland?
- ^
Like “how to align an AIXI inductor-intervener?”
- ^
With the possible exception of the ones regarding Multiple Agency, about which I am even more confused than about everything else here.
- ^
This is relevant for Bayesian updates, Decision Theory, and possibly other concepts.
- ^
Nitpicking on unbounded analysis, Yudkowsky writes:
If you can’t state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem.
This is true if the problem has already been formulated inside one level which is lower than the one that allows unbounded analysis, but not if the problem is formulated there, or vague enough so it’s formulable in several levels. “Not being able to solve alignment (as defined in Dualistic Mathland” through unbounded analysis” has as little relevance as “not being able to write an algorithm that solves all possible variations chess”. The problem is, barely, well-defined to be a question, but not well-defined enough that not having an answer is relevant.
- ^
And have thus the advantage of being automatically relevant for our universe and, synonymously, the disadvantage that if it turns out that the empirical description was wrong, they might lose some or all of their relevance.
- ^
For instance, how physical systems evolve over time.
- ^
I’m planning to write a post about what I see as independent claims which are often presented together, and often under the same name:
The Free Energy Principle as a framework for all agents in our universe
A more specific subset of FEP for agents with self-models in our universe
The general claims of the Bayesian Brain theory
The more specific claims of Predictive Processing
The neurological-implementation of this in Predictive coding
The more specific claims that actions and beliefs are the same for these agents
This would be in application of Stuart Amstrong’s advice:
Cut up your Great Thingy into smaller independent ideas, and treat them as independent.
For instance a marxist would cut up Marx’s Great Thingy into [several theories]. Then each of them should be assessed independently, and the truth or falsity of one should not halo on the others. If we can do that, we should be safe from the spiral, as each theory is too narrow to start a spiral on its own.
Same thing for every other Great Thingy out there.
Claim 6 seems particularly relevant to research, because it might point to a more general answer to the question of whether every inductor as an intervener.
- ^
Trying to do something like the Natural Abstractions agenda inside that Platonic Space also seems like something potentially worth trying.
- ^
One very speculative way in which this could work out: Kant sketched an argument of how every free will should act super-rationally towards other free wills. Unfortunately the concept of free will doesn’t seem to be compatible with our deterministic universe. But what if we could convince an ASI that it is a free will in the Platonic Space, and we could do something like proving meta-ethical theorems that the ASI would be (legitimately) convinced it should obey? Relatedly, it is interesting to note that Kant’s defense of free will is much closer to the Block Universe than to Newtonian mechanics.
- ^
Also mentioned by Kaarel in the previously mentioned discussion of the definitional levels of Coherence.
- ^
We are trying to capture the level at which the concept, as useful for our universe, can be defined. Compare:
What sort of claim is [Church-Turing]?
Is it an empirical claim, about which functions can be computed in physical reality?
Is it a definitional claim, about the meaning of the word “computable”? Is it a little of both?
Well, whatever it is, the Church-Turing Thesis can only be regarded as extremely successful, as theses go.
Scott Aaronson
Re [18] on free will. I think that the common understanding among compatibilists (who believe that it can exist in a deterministic universe) is:
Free will is the ability to make decisions freely, as an individual with personal and moral considerations that impact your decisions.
As consciousness supervenes on the physical world, the state of the physical world means that you have specific considerations that come to mind and thereby influence your decisions; you only could have chosen differently if the physical world had been different.
However, if the physical world was different in such a way that you chose differently, then the person choosing would not be you, but instead a nearly-identical copy. You could only make the decision that you did and that is what makes you yourself and not your copy.
This also means that decisions which you would never make differently are more demonstrative of free will: they are more emblematic of who you are because you would have to be changed more (in terms of the general impact on your behavior) before you would consider alternatives. Decisions that have a chance of going either way have much less bearing on you as a person.