(warning: this comment is significantly longer than the post)
I want to being by saying that I appreciate the existence of this post. Truly and honestly. I think it’s important to praise those who at least try explaining difficult topics or hard-to-communicate research intuitions, even when those explanations are imperfect or don’t fulfill their intended purpose. For it’s only by incentivizing genuine attempts that we have a chance of obtaining better ones or dispelling our confusions. And, at the very least, this post represents a good vehicle for me to express my current disagreement with/disapproval of agent foundations research.[1]
Nevertheless, in the interest of honesty, I will say this post leaves me deeply unsatisfied. Kind of like… virtually every post ever made on LW that tries to explain agent foundations? At this point I don’t think there’s anything about the authors[2] that causes this, but rather the topic itself which doesn’t lend itself nicely to this type of communication (but see below for an alternate perspective[3]).
Let’s start with the “Empirics” section. Alex Altair writes:
From where I’m standing, it’s hard to even think of how experiments would be relevant to what I’m doing. It feels like someone asking me why I haven’t soldered up a prototype. That’s just… not the kind of thing agent foundations is. I can imagine experiments that might sound like they’re related to agent foundations, but they would just be checking a box on a bureaucratic form, and not actually generated by me trying to solve the problem.
It’s… hard to see how experiments would be relevant to what you’re doing? Really? The point of experiments is to ensure that the mathematical frameworks you are describing actually map onto something meaningful in reality as opposed to being a nice, quaint, self-consistent set of mathematical symbols and ideas that nonetheless reside in their own separate magisterium without predicting anything important about how real life shapes out. As I have said before:
There’s a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don’t generalize well at all. And it’s only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you’ve chosen is in the former category as opposed to the latter.
Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.
Idk man, some days I’m half-tempted to believe that all non-prosaic alignment work is a bunch of “streetlighting.” Yeah, it doesn’t result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs.
I fear there is a general failure mode here that people who are not professional mathematicians tend to fall into when they think about this stuff. People who have read Wigner’s Unreasonable Effectiveness of Mathematics and Hamming’s follow-up to it and Eliezer’s Sequences vibing about the fundamentally mathematical nature of the universe, and whose main takeaway from them is that elegance and compact-descriptiveness in pure mathematics is some sort of strong predictor of real-world applicability. They see all these examples of concepts that were designed purely to satisfy the aesthetic curiosities of pure mathematicians, but afterwards became robustly applicable in concrete, empirical domains.
But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that’s so interesting; you don’t hear about stuff that’s left in the dustbin of history. It’s difficult for me to even put into words a precise explanation meant for non-mathematicians[4] of how this plays out. But suffice it to say the absolute vast majority of pure mathematics is not going to have practical applicability, ever. The vast majority of mathematical structures selected because they are compact and elegant in their description, or even because they arise “naturally”[5] in other domains mathematicians care about, are cute structures worth studying if you’re a pure mathematician, but almost surely irrelevant for practical purposes.
Yes, the world is stunningly well-approximated by a relatively compact and elegant set of mathematical rules.[6] But there are infinitely more short and “nice” sets of rules which don’t approximate it.[7] The fact that there is a solution out there doesn’t mean other posited solutions which superficially resemble it are also correct, or even close to correct. There is no “theory of the second-best” here. And those rules were found through experiments and observations and rigorous analysis of data, not merely through pre-empirical daydreaming and Aristotelian “science.”
Eliezer loves talking about the story of how Einstein knew his theory was correct based on its elegance, and how when he was asked by journalists what he’d do if Eddington falsified his theory, he would say “Then I would feel sorry for the good Lord. The theory is correct.” But that’s one story! One. And it’s cool and memorable and fun and you give it undue weight for how Deeply Awesome it feels. But it still generates the same selection effect I mentioned before. If you peer through the history of science, even prior to Einstein during the time of Galileo and Kepler, and especially after Einstein and further developments we’ve had in the field of physics, you’ll see the story is not representative of the vast, vast majority of how science is done. That empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.
But also pay attention to the reference class it’s in! Physics, I said above. Why physics? How do we know that’s at all representative of the type of science we’re interested in? If this didn’t have an effect, meaning the histories of different fields of inquiry were fundamentally similar, then it wouldn’t matter much which specific subfield we focus on. And yet it does!
The opening sounds a lot like saying “aerodynamics used to be a science until people started building planes.”
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is. A physicist’s view. It is one I’m deeply sympathetic to, and if your definition of science is Rutherford’s, you might be right, but a reasonable one that includes chemistry would have to include AI as well.
Villiam: I have an intuition that the “realism about rationality” approach will lead to success, even if it will have to be dramatically revised on the way.
To explain, imagine that centuries years ago there are two groups trying to find out how the planets move. Group A says: “Obviously, planets must move according to some simple mathematical rule. The simplest mathematical shape is a circle, therefore planets move in circles. All we have to do is find out the exact diameter of each circle.” Group B says: “No, you guys underestimate the complexity of the real world. The planets, just like everything in nature, can only be approximated by a rule, but there are always exceptions and unpredictability. You will never find a simple mathematical model to describe the movement of the planets.”
The people who finally find out how the planets move will be spiritual descendants of the group A. Even if on the way they will have to add epicycles, and then discard the idea of circles, which seems like total failure of the original group. The problem with the group B is that it has no energy to move forward.
The right moment to discard a simple model is when you have enough data to build a more complex model.
Richard Ngo: In this particular example, it’s true that group A was more correct. This is because planetary physics can be formalised relatively easily, and also because it’s a field where you can only observe and not experiment. But imagine the same conversation between sociologists who are trying to find out what makes people happy, or between venture capitalists trying to find out what makes startups succeed. In those cases, Group B can move forward using the sort of “energy” that biologists and inventors and entrepreneurs have, driven an experimental and empirical mindset. Whereas Group A might spend a long time writing increasingly elegant equations which rely on unjustified simplifications.
Instinctively reasoning about intelligence using analogies from physics instead of the other domains I mentioned above is a very good example of rationality realism.
Uncontrolled argues along similar lines—that the physics/chemistry model of science, where we get to generalize a compact universal theory from a number of small experiments, is simply not applicable to biology/psychology/sociology/economics and that policy-makers should instead rely more on widespread, continuous experiments in real environments to generate many localized partial theories.
One thing that makes agent foundations different from science is that we’re trying to understand a phenomenon that hasn’t occurred yet (but which we have extremely good reasons for believing will occur). I can’t do experiments on powerful agents, because they don’t exist.
The first sentence is false. Science routinely tries to understand phenomena it has good reason to believe exist, but it hasn’t been able to pinpoint exactly and concretely yet. The paradigmatic and illustrative example of this is the search for room-temperature superconductors. This is primarily done by scientists, not by engineers.
But more to the point, the second sentence also reads as substantively dubious. You don’t have all-powerful ASI to experiment on, but here’s what you do have:
somewhat agentic, somewhat intelligent, easy-to-query-and-interact-with AI models that you can (very cheaply!) run recurrent experiments on to test your theories
Does your theory of agency have nothing to say about either of them? Then why on Earth would you assume any partial results you obtain are anywhere close to reliable? Are you assuming a binary dichotomy between something that’s “smart and agentic, so our theories apply” on one end, and “dumb and unagentic, so our theories don’t apply” on the other end?[9]
If that’s so, and even humans fall into the latter, then I also don’t see why your theories would have any applicability in the most safety-critical regime, i.e., when the first powerful models are created. Nate Soares has written:
By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.
If your theory of Agent Foundations has nothing to say about current AI, and nothing to say about current generally-intelligent humans,[10] does it have anything to say about the actual AGI we might create?
Alex Altair also writes:
So, I don’t think that what we’re lacking is data or information about the nature of agents—we’re lacking understanding of the information we already have.
Well, actually, you are lacking some data or information about something you care about when it comes to agency. Namely, to what extent are the models we’re interested in aligning actually well-modeled as agents. When I look at other humans around me (and at myself), I see beings that are well-approximated as agents in some ways, and poorly-approximated as agents in other ways. Figuring out which aspect of cognition falls on which side of this divide seems like an absolutely critical (maybe even the absolutely critical) question I’d want agent foundations to give me a reliable answer to. Are you not even trying to do that?
Let me give a concrete example. A while ago, spurred on by observing many instances of terrible arguments in favor of treating relevant-agents-as-utility-maximizers, I wrote a question post on “What do coherence arguments actually prove about agentic behavior?” Do you think you have an complete answer to this question? And do you think you don’t even need any new data or information to answer it?
One needs to have some kind of life experiences that points your mind toward the relevant concepts, like “computing machines” at all. But once someone has those, they don’t necessarily need more information from experiments to figure out a bunch of computability theory.
One does need information from experiments to know that computability theory is at all useful/important in the real world. And also to know when it matters, and when it doesn’t or is an incomplete description of what’s going on.[11] Which is what I hope agent foundations is about. Something useful for AI safety. Something useful in practice. If it’s a cute branch of essentially-math that doesn’t necessarily concern itself with saving the world from AI doom, why should anyone give you any money or status in the AI safety community?
In any case, the analogy to computations also feels like the result of a kind of selection effect as well. Conor Leahy has written:
It is not clear to me that there even is an actual problem to solve here. Similar to e.g. consciousness, it’s not clear to me that people who use the word “metaphilosophy” are actually pointing to anything coherent in the territory at all, or even if they are, that it is a unique thing. It seems plausible that there is no such thing as “correct” metaphilosophy, and humans are just making up random stuff based on our priors and environment and that’s it and there is no “right way” to do philosophy, similar to how there are no “right preferences”. I know the other view ofc and still worth engaging with in case there is something deep and universal to be found (the same way we found that there is actually deep equivalency and “correct” ways to think about e.g. computation).
Yes, we have found a deep mathematical way of thinking about computation. Something simple, compact, elegant. But saying agent foundations is like the study of computation… kind of hides the fact that the theory of computation might be sort of a one-off in terms of how nice it is to formalize properly? As I see it, and as Leahy writes above, we have examples of things that make sense intuitively and we could formalize nicely (computation). And we also have examples of things that make sense intuitively and we couldn’t formalize nicely (everything else he talks about in that comment). Saying agent foundations is like computation is putting the cart before the horse, it’s assuming it falls into the category of nicely-formalizable things, which feels to me like it isn’t a representative subset of the set of things-we-try-to-formalize.
But that’s not even fully correct, frankly. Not to get into the nerdy weeds of this too much, but modern QFT, for instance, requires an inelegant and mathematically dubious cancellation of infinities to allow
It’s difficult to believe you’d actually hold this view, since frankly it’s really dumb, but I also would have had a difficult time believing you’d say you don’t have any experiments to run… and yet you’re saying it regardless!
As an illustrative example, proving an algorithm can be computed in polynomial time is cool, but maybe the constants involved are so large you can’t actually make it work in practice. If all complexity theory did was the former, without also telling me what the domain of applicability of its results is when it comes to what I actually care about, then I’d care about complexity theory a lot less
Empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.
Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking! Examples here abound: Einstein’s thought experiments about simultaneity and relativity, Szilard’s proposed resolution to Maxwell’s demon, many of Galileo’s concepts (instantaneous velocity, relativity, the equivalence principle), Landauer’s limit, logic (e.g., Aristotle, Frege, Boole), information theory, Schrödinger’s prediction that the hereditary material was an aperiodic crystal, Turing machines, etc. So it seems odd, imo, to portray this track record as near-universal failure of the approach.
But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that’s so interesting; you don’t hear about stuff that’s left in the dustbin of history.
I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin. Which certainly isn’t to say that empirical approaches are doomed by the outside view, or that science is doomed in general, just that using base rates to rule out whole approaches seems misguided to me. Not only because one ought to choose which approach makes sense based on the nature of the problem itself, but also because base rates alone don’t account for the value of the successes. And as far as I can tell, the concepts we’ve gained from this sort of philosophical and mathematical thinking (including but certainly not limited to those above) have accounted for a very large share of the total progress of science to date. Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.
Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking
Conflating “philosophy” and “mathematics” is another instance of the kind of sloppy thinking I’m warning against in my previous comment.
The former[1] is necessary and useful, if only because making sense of what we observe requires us to sit down and peruse our models of the world and adjust and update them. And also because we get to generate “thought experiments” that give us more data with which to test our theories.[2]
The latter, as a basic categorical matter, is not the same as the former. “Mathematics” has a siren-like seduction quality to those who are mathematically-inclined. It comes across, based not just on structure but also on vibes and atmosphere, as giving certainty and rigor and robustness. But that’s all entirely unjustified until you know the mathematical model you are employing it actually useful for the problem at hand.
So it seems odd, imo, to portray this track record as near-universal failure of the approach.
Of what approach?
Of the approach that “it’s hard to even think of how experiments would be relevant to what I’m doing,” as Alex Altair wrote about above? The only reason all those theories you mentioned before ultimately obtained success and managed to be refined into something closely approximated reality is because after some initial, flawed versions of them were proposed, scientists looked very hard at experiments to verify them, iron out their flaws, and in some situations throw away completely mistaken approaches. Precisely the type of feedback loop that’s necessary to do science.
This approach, that the post talks about, has indeed failed universally.
I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin.
Yes, the vast majority of theories and results are left in the dustbin after our predictions make contact and are contrasted with our observations. Precisely my point. That’s the system working as intended.
Which certainly isn’t to say that empirical approaches are doomed by the outside view
… what? What does this have to do with anything that came before it? The fact that approaches are ruled out is a benefit, not a flaw, of empirics. It’s a feature, not a bug. It’s precisely what makes it work. Why would this ever say anything negative about empirical approaches?
By contrast, if “it’s hard to even think of how experiments would be relevant to what I’m doing,” you have precisely zero means of ever determining that your theories are inappropriate for the question at hand. For you can keep working on and living in the separate magisterium of mathematics, rigorously proving lemmas and theorems and result with the iron certainty of mathematical proof, all without binding yourself to what matters most.
Not only because one ought to choose which approach makes sense based on the nature of the problem itself
Taking this into account makes agent foundations look worse, not better.
As I’ve written about before, the fundamental models and patterns of thought embedded in these frameworks were developed significantly prior to Deep Learning and LLM-type models taking over. “A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs,” as I said in that comment. The bottom line was written down long before it was appropriate to do so.
but also because base rates alone don’t account for the value of the successes
And if I look at what agent foundations-type researchers are concluding on the basis of their purely theoretical mathematical vibing, I see precisely the types of misunderstandings, flaws, and abject nonsense that you’d expect when someone gets away with not having to match their theories up with empirical observations.[3]
Case in point: John Wentworth claiming he has “puttogether an agent model which resolved all of [his] own most pressing outstanding confusions about the type-signature of human values,” when in fact many users here have explained in detail[4] why his hypotheses are entirely incompatible with reality.[5]
Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.
I don’t think I ever claimed restricting to the outside view is the proper thing to do here. I do think I made specific arguments for why it shouldn’t feel motivating.
And also the kinds of flaws that prevent whatever results are obtained from actually matching up with reality, even if the theorems themselves are mathematically correct
And has that stopped him? Of course not, nor do I expect any further discussion to. Because the conclusions he has reached, although they don’t make sense in empirical reality, do make sense inside of the mathematical models he is creating for his Natural Abstractions work. This is reifying the model and elevating it over reality, an evenworse epistemic flaw than conflating the two.
The one time he confessed he had been working on “speedrun[ning] the theory-practice gap” and creating a target product with practical applicability, it failed. Two years prior, he had written “Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.” But he didn’t seem all that worried now either.
By contrast, if “it’s hard to even think of how experiments would be relevant to what I’m doing,” you have precisely zero means of ever determining that your theories are inappropriate for the question at hand.
Here, you’ve gotten too hyperbolic about what I said. When I say “experiments”, I don’t mean “any contact with reality”. And when I said “what I’m doing”, I didn’t mean “anything I will ever do”. Some people I talk to seem to think it’s weird that I never run PyTorch, and that’s the kind of thing where I can’t think of how it would be relevant to what I’m currently doing.
When trying to formulate conjectures, I am constantly fretting about whether various assumptions match reality well enough. And when I do have a theory that is at the point where it’s making strong claims, I will start to work out concrete ways to apply it.
But I don’t even have one yet, so there’s not really anything to check. I’m not sure how long people are expecting this to take, and this difference in expectation might be one of the implicit things driving the confusion. As many theorems there are that end up in the dustbin, there is even more pre-theorem work that end up in the dustbin. I’ve been at this for three and change years, and I would not be surprised if it takes a few more years. But the entire point is to apply it, so I can certainly imagine conditions under which we end up finding out whether the theory applies to reality.
Which is what I hope agent foundations is about. Something useful for AI safety. Something useful in practice. If it’s a cute branch of essentially-math that doesn’t necessarily concern itself with saving the world from AI doom, why should anyone give you any money or status in the AI safety community?
Separating this response out for visibility—it is unequivocally, 100% my goal to reduce AI x-risk. The entire purpose of my research is to eventually apply it in practice.
I believe you, and I want to clarify that I did not (and do not) mean to imply otherwise. I also don’t mean to imply you shouldn’t get money or status; quite the opposite.
It’s just the post itself[1] that doesn’t make the whole “agent foundations is actually for solving AI x-risk” thing click for me.
For me, the OP brought to mind another kind of “not really math, not really science”: string theory. My criticisms of agent foundations research are analogous to Sabine Hossenfelder’s criticisms of string theory, in that string theory and agent foundations both screen themselves off from the possibility of experimental testing in their choice of subject matter: the Planck scale and very early universe for the former, and idealized superintelligent systems for the latter. For both, real-world counterparts (known elementary particles and fundamental forces; humans and existing AI systems) of the objects they study are primarily used as targets to which to overfit their theoretical models. They don’t make testable predictions about current or near-future systems. Unlike with early computer science, agent foundations doesn’t come with an expectation of being able to perform experiments in the future, or even to perform rigorous observational studies.
Ah, I think this is a straight-forward misconception of what agent foundations. (Or at least, of what my version of agent foundations is.) I am not trying to forge a theory of idealized superintelligent systems. I am trying to forge a theory of “what the heck is up with agency at all??”. I am attempting to forge a theory that can make testable predictions about current and near-future systems.
I was describing reasoning about idealized superintelligent systems as the method used in agent foundations research, rather than its goal. In the same way that string theory is trying to figure out “what is up with elementary particles at all,” and tries to answer that question by doing not-really-math about extreme energy levels, agent foundations is trying to figure out “what is up with agency at all” by doing not-really-math about extreme intelligence levels.
If you’ve made enough progress in your research that it can make testable predictions about current or near-future systems, I’d like to see them. But the persistent failure of agent foundations research to come up with any such bridge between idealized models and real-world system has made me doubtful that the former are relevant to the latter.
I haven’t looked into it yet but apparently Peter Bloem showed that pretraining on a Solomonoff-like task also improves performance on text prediction: https://arxiv.org/abs/2506.20057
Taken together, seems like some empirical evidence for LLM ICL as approximating Solomonoff induction, which is a frame I’ve been using clearly motivated by a type of “agent foundations” or at least “learning foundations” intuition. Of course it’s very loose. I’m working on a better example.
(Incidentally, I would probably be considered to be in math academia)
...I also do not use “reasoning about idealized superintelligent systems as the method” of my agent foundations research. Certainly there are examples of this in agent foundations, but it is not the majority. It is not the majority of what Garrabrant or Demski or Ngo or Wentworth or Turner do, as far as I know.
It sounds to me like you’re not really familiar with the breadth of agent foundations. Which is perfectly fine, because it’s not a cohesive field yet, nor is the existing work easily understandable. But I think you should aim for your statements to be more calibrated.
Notably, in the case of string theory, the fact that it predicts everything we currently observe plus new forces at the planck scale is currently better than all other theories of physics, because currently all other theories either predict something we have reason not to observe or limit themselves to a subset of predictions that other theories already predict, so the fact that string theory can predict everything we observe and predict (admittedly difficult to falsify) observations is enough to make it a leading theory.
No comment on whether the same applies to agent foundations.
in the case of string theory, the fact that it predicts
Hmm, my outsider impression is that there’s in fact a myriad “string theories”, all of them predicting everything we observe, but with no way to experimentally discern the correct one among them for the foreseeable future, which I have understood to be the main criticism. Is this broad-strokes picture fundamentally mistaken?
There are a large number of “string vacua” which contain particles and interactions with the quantum numbers and symmetries we call the standard model, but (1) they typically contain a lot of other stuff that we haven’t seen (2) the real test is whether the constants (e.g. masses and couplings) are the same as observed, and these are hard to calculate (but it’s improving).
(warning: this comment is significantly longer than the post)
I want to being by saying that I appreciate the existence of this post. Truly and honestly. I think it’s important to praise those who at least try explaining difficult topics or hard-to-communicate research intuitions, even when those explanations are imperfect or don’t fulfill their intended purpose. For it’s only by incentivizing genuine attempts that we have a chance of obtaining better ones or dispelling our confusions. And, at the very least, this post represents a good vehicle for me to express my current disagreement with/disapproval of agent foundations research.[1]
Nevertheless, in the interest of honesty, I will say this post leaves me deeply unsatisfied. Kind of like… virtually every post ever made on LW that tries to explain agent foundations? At this point I don’t think there’s anything about the authors[2] that causes this, but rather the topic itself which doesn’t lend itself nicely to this type of communication (but see below for an alternate perspective[3]).
Let’s start with the “Empirics” section. Alex Altair writes:
It’s… hard to see how experiments would be relevant to what you’re doing? Really? The point of experiments is to ensure that the mathematical frameworks you are describing actually map onto something meaningful in reality as opposed to being a nice, quaint, self-consistent set of mathematical symbols and ideas that nonetheless reside in their own separate magisterium without predicting anything important about how real life shapes out. As I have said before:
Conor Leahy has written:
I have written:
I fear there is a general failure mode here that people who are not professional mathematicians tend to fall into when they think about this stuff. People who have read Wigner’s Unreasonable Effectiveness of Mathematics and Hamming’s follow-up to it and Eliezer’s Sequences vibing about the fundamentally mathematical nature of the universe, and whose main takeaway from them is that elegance and compact-descriptiveness in pure mathematics is some sort of strong predictor of real-world applicability. They see all these examples of concepts that were designed purely to satisfy the aesthetic curiosities of pure mathematicians, but afterwards became robustly applicable in concrete, empirical domains.
But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that’s so interesting; you don’t hear about stuff that’s left in the dustbin of history. It’s difficult for me to even put into words a precise explanation meant for non-mathematicians[4] of how this plays out. But suffice it to say the absolute vast majority of pure mathematics is not going to have practical applicability, ever. The vast majority of mathematical structures selected because they are compact and elegant in their description, or even because they arise “naturally”[5] in other domains mathematicians care about, are cute structures worth studying if you’re a pure mathematician, but almost surely irrelevant for practical purposes.
Yes, the world is stunningly well-approximated by a relatively compact and elegant set of mathematical rules.[6] But there are infinitely more short and “nice” sets of rules which don’t approximate it.[7] The fact that there is a solution out there doesn’t mean other posited solutions which superficially resemble it are also correct, or even close to correct. There is no “theory of the second-best” here. And those rules were found through experiments and observations and rigorous analysis of data, not merely through pre-empirical daydreaming and Aristotelian “science.”
Eliezer loves talking about the story of how Einstein knew his theory was correct based on its elegance, and how when he was asked by journalists what he’d do if Eddington falsified his theory, he would say “Then I would feel sorry for the good Lord. The theory is correct.” But that’s one story! One. And it’s cool and memorable and fun and you give it undue weight for how Deeply Awesome it feels. But it still generates the same selection effect I mentioned before. If you peer through the history of science, even prior to Einstein during the time of Galileo and Kepler, and especially after Einstein and further developments we’ve had in the field of physics, you’ll see the story is not representative of the vast, vast majority of how science is done. That empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.
But also pay attention to the reference class it’s in! Physics, I said above. Why physics? How do we know that’s at all representative of the type of science we’re interested in? If this didn’t have an effect, meaning the histories of different fields of inquiry were fundamentally similar, then it wouldn’t matter much which specific subfield we focus on. And yet it does!
Shankar Sivarajan has written:
Richard Ngo has written:
jamii has written:
Anyway, enough on that.[8] Let’s move on to “What makes agent foundations different?” In the very first paragraph, Alex Altair writes:
The first sentence is false. Science routinely tries to understand phenomena it has good reason to believe exist, but it hasn’t been able to pinpoint exactly and concretely yet. The paradigmatic and illustrative example of this is the search for room-temperature superconductors. This is primarily done by scientists, not by engineers.
But more to the point, the second sentence also reads as substantively dubious. You don’t have all-powerful ASI to experiment on, but here’s what you do have:
the sole example of somewhat-aligned generally-intelligent occasionally-agentic beings ever created, namely humans
somewhat agentic, somewhat intelligent, easy-to-query-and-interact-with AI models that you can (very cheaply!) run recurrent experiments on to test your theories
Does your theory of agency have nothing to say about either of them? Then why on Earth would you assume any partial results you obtain are anywhere close to reliable? Are you assuming a binary dichotomy between something that’s “smart and agentic, so our theories apply” on one end, and “dumb and unagentic, so our theories don’t apply” on the other end?[9]
If that’s so, and even humans fall into the latter, then I also don’t see why your theories would have any applicability in the most safety-critical regime, i.e., when the first powerful models are created. Nate Soares has written:
If your theory of Agent Foundations has nothing to say about current AI, and nothing to say about current generally-intelligent humans,[10] does it have anything to say about the actual AGI we might create?
Alex Altair also writes:
Well, actually, you are lacking some data or information about something you care about when it comes to agency. Namely, to what extent are the models we’re interested in aligning actually well-modeled as agents. When I look at other humans around me (and at myself), I see beings that are well-approximated as agents in some ways, and poorly-approximated as agents in other ways. Figuring out which aspect of cognition falls on which side of this divide seems like an absolutely critical (maybe even the absolutely critical) question I’d want agent foundations to give me a reliable answer to. Are you not even trying to do that?
Let me give a concrete example. A while ago, spurred on by observing many instances of terrible arguments in favor of treating relevant-agents-as-utility-maximizers, I wrote a question post on “What do coherence arguments actually prove about agentic behavior?” Do you think you have an complete answer to this question? And do you think you don’t even need any new data or information to answer it?
And finally, Alex Altair talks about how “It’s kinda like computer science.” And he writes:
One does need information from experiments to know that computability theory is at all useful/important in the real world. And also to know when it matters, and when it doesn’t or is an incomplete description of what’s going on.[11] Which is what I hope agent foundations is about. Something useful for AI safety. Something useful in practice. If it’s a cute branch of essentially-math that doesn’t necessarily concern itself with saving the world from AI doom, why should anyone give you any money or status in the AI safety community?
In any case, the analogy to computations also feels like the result of a kind of selection effect as well. Conor Leahy has written:
Yes, we have found a deep mathematical way of thinking about computation. Something simple, compact, elegant. But saying agent foundations is like the study of computation… kind of hides the fact that the theory of computation might be sort of a one-off in terms of how nice it is to formalize properly? As I see it, and as Leahy writes above, we have examples of things that make sense intuitively and we could formalize nicely (computation). And we also have examples of things that make sense intuitively and we couldn’t formalize nicely (everything else he talks about in that comment). Saying agent foundations is like computation is putting the cart before the horse, it’s assuming it falls into the category of nicely-formalizable things, which feels to me like it isn’t a representative subset of the set of things-we-try-to-formalize.
Disclaimer: as an outsider who is not working on AI safety in any way, shape, or form
There are many of them, they use different writing styles and bring attention to different kinds of evidence or reasoning, etc.
Spoiler alert: it’s hard to communicate why agent foundations makes sense… because agent foundations doesn’t make sense
By which I mean, people literally not in math academia
In some hard-to-define aesthetic sense
But that’s not even fully correct, frankly. Not to get into the nerdy weeds of this too much, but modern QFT, for instance, requires an inelegant and mathematically dubious cancellation of infinities to allow
And yet scientists thought they did, for a long time!
For now
It’s difficult to believe you’d actually hold this view, since frankly it’s really dumb, but I also would have had a difficult time believing you’d say you don’t have any experiments to run… and yet you’re saying it regardless!
Since, again, you’re not running any experiments to check your theories against them
As an illustrative example, proving an algorithm can be computed in polynomial time is cool, but maybe the constants involved are so large you can’t actually make it work in practice. If all complexity theory did was the former, without also telling me what the domain of applicability of its results is when it comes to what I actually care about, then I’d care about complexity theory a lot less
Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking! Examples here abound: Einstein’s thought experiments about simultaneity and relativity, Szilard’s proposed resolution to Maxwell’s demon, many of Galileo’s concepts (instantaneous velocity, relativity, the equivalence principle), Landauer’s limit, logic (e.g., Aristotle, Frege, Boole), information theory, Schrödinger’s prediction that the hereditary material was an aperiodic crystal, Turing machines, etc. So it seems odd, imo, to portray this track record as near-universal failure of the approach.
I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin. Which certainly isn’t to say that empirical approaches are doomed by the outside view, or that science is doomed in general, just that using base rates to rule out whole approaches seems misguided to me. Not only because one ought to choose which approach makes sense based on the nature of the problem itself, but also because base rates alone don’t account for the value of the successes. And as far as I can tell, the concepts we’ve gained from this sort of philosophical and mathematical thinking (including but certainly not limited to those above) have accounted for a very large share of the total progress of science to date. Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.
Conflating “philosophy” and “mathematics” is another instance of the kind of sloppy thinking I’m warning against in my previous comment.
The former[1] is necessary and useful, if only because making sense of what we observe requires us to sit down and peruse our models of the world and adjust and update them. And also because we get to generate “thought experiments” that give us more data with which to test our theories.[2]
The latter, as a basic categorical matter, is not the same as the former. “Mathematics” has a siren-like seduction quality to those who are mathematically-inclined. It comes across, based not just on structure but also on vibes and atmosphere, as giving certainty and rigor and robustness. But that’s all entirely unjustified until you know the mathematical model you are employing it actually useful for the problem at hand.
Of what approach?
Of the approach that “it’s hard to even think of how experiments would be relevant to what I’m doing,” as Alex Altair wrote about above? The only reason all those theories you mentioned before ultimately obtained success and managed to be refined into something closely approximated reality is because after some initial, flawed versions of them were proposed, scientists looked very hard at experiments to verify them, iron out their flaws, and in some situations throw away completely mistaken approaches. Precisely the type of feedback loop that’s necessary to do science.
This approach, that the post talks about, has indeed failed universally.
Yes, the vast majority of theories and results are left in the dustbin after our predictions make contact and are contrasted with our observations. Precisely my point. That’s the system working as intended.
… what? What does this have to do with anything that came before it? The fact that approaches are ruled out is a benefit, not a flaw, of empirics. It’s a feature, not a bug. It’s precisely what makes it work. Why would this ever say anything negative about empirical approaches?
By contrast, if “it’s hard to even think of how experiments would be relevant to what I’m doing,” you have precisely zero means of ever determining that your theories are inappropriate for the question at hand. For you can keep working on and living in the separate magisterium of mathematics, rigorously proving lemmas and theorems and result with the iron certainty of mathematical proof, all without binding yourself to what matters most.
Taking this into account makes agent foundations look worse, not better.
As I’ve written about before, the fundamental models and patterns of thought embedded in these frameworks were developed significantly prior to Deep Learning and LLM-type models taking over. “A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs,” as I said in that comment. The bottom line was written down long before it was appropriate to do so.
And if I look at what agent foundations-type researchers are concluding on the basis of their purely theoretical mathematical vibing, I see precisely the types of misunderstandings, flaws, and abject nonsense that you’d expect when someone gets away with not having to match their theories up with empirical observations.[3]
Case in point: John Wentworth claiming he has “put together an agent model which resolved all of [his] own most pressing outstanding confusions about the type-signature of human values,” when in fact many users here have explained in detail[4] why his hypotheses are entirely incompatible with reality.[5]
I don’t think I ever claimed restricting to the outside view is the proper thing to do here. I do think I made specific arguments for why it shouldn’t feel motivating.
Which, mind you, we barely understand at a mechanistic/rigorous/”mathematical” level, if at all
Which is what the vast majority of your examples are about
And also the kinds of flaws that prevent whatever results are obtained from actually matching up with reality, even if the theorems themselves are mathematically correct
See also this
And has that stopped him? Of course not, nor do I expect any further discussion to. Because the conclusions he has reached, although they don’t make sense in empirical reality, do make sense inside of the mathematical models he is creating for his Natural Abstractions work. This is reifying the model and elevating it over reality, an even worse epistemic flaw than conflating the two.
The one time he confessed he had been working on “speedrun[ning] the theory-practice gap” and creating a target product with practical applicability, it failed. Two years prior, he had written “Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.” But he didn’t seem all that worried now either.
Here, you’ve gotten too hyperbolic about what I said. When I say “experiments”, I don’t mean “any contact with reality”. And when I said “what I’m doing”, I didn’t mean “anything I will ever do”. Some people I talk to seem to think it’s weird that I never run PyTorch, and that’s the kind of thing where I can’t think of how it would be relevant to what I’m currently doing.
When trying to formulate conjectures, I am constantly fretting about whether various assumptions match reality well enough. And when I do have a theory that is at the point where it’s making strong claims, I will start to work out concrete ways to apply it.
But I don’t even have one yet, so there’s not really anything to check. I’m not sure how long people are expecting this to take, and this difference in expectation might be one of the implicit things driving the confusion. As many theorems there are that end up in the dustbin, there is even more pre-theorem work that end up in the dustbin. I’ve been at this for three and change years, and I would not be surprised if it takes a few more years. But the entire point is to apply it, so I can certainly imagine conditions under which we end up finding out whether the theory applies to reality.
Separating this response out for visibility—it is unequivocally, 100% my goal to reduce AI x-risk. The entire purpose of my research is to eventually apply it in practice.
I believe you, and I want to clarify that I did not (and do not) mean to imply otherwise. I also don’t mean to imply you shouldn’t get money or status; quite the opposite.
It’s just the post itself[1] that doesn’t make the whole “agent foundations is actually for solving AI x-risk” thing click for me.
And other posts on LW trying to explain this
For me, the OP brought to mind another kind of “not really math, not really science”: string theory. My criticisms of agent foundations research are analogous to Sabine Hossenfelder’s criticisms of string theory, in that string theory and agent foundations both screen themselves off from the possibility of experimental testing in their choice of subject matter: the Planck scale and very early universe for the former, and idealized superintelligent systems for the latter. For both, real-world counterparts (known elementary particles and fundamental forces; humans and existing AI systems) of the objects they study are primarily used as targets to which to overfit their theoretical models. They don’t make testable predictions about current or near-future systems. Unlike with early computer science, agent foundations doesn’t come with an expectation of being able to perform experiments in the future, or even to perform rigorous observational studies.
Ah, I think this is a straight-forward misconception of what agent foundations. (Or at least, of what my version of agent foundations is.) I am not trying to forge a theory of idealized superintelligent systems. I am trying to forge a theory of “what the heck is up with agency at all??”. I am attempting to forge a theory that can make testable predictions about current and near-future systems.
I was describing reasoning about idealized superintelligent systems as the method used in agent foundations research, rather than its goal. In the same way that string theory is trying to figure out “what is up with elementary particles at all,” and tries to answer that question by doing not-really-math about extreme energy levels, agent foundations is trying to figure out “what is up with agency at all” by doing not-really-math about extreme intelligence levels.
If you’ve made enough progress in your research that it can make testable predictions about current or near-future systems, I’d like to see them. But the persistent failure of agent foundations research to come up with any such bridge between idealized models and real-world system has made me doubtful that the former are relevant to the latter.
I predicted that LLM ICL would perform reasonably well at predicting the universal distribution without finetuing and it apparently does:
https://www.alignmentforum.org/posts/xyYss3oCzovibHxAF/llm-in-context-learning-as-approximating-solomonoff
Would love to see a follow up experiment on this.
I haven’t looked into it yet but apparently Peter Bloem showed that pretraining on a Solomonoff-like task also improves performance on text prediction: https://arxiv.org/abs/2506.20057
Taken together, seems like some empirical evidence for LLM ICL as approximating Solomonoff induction, which is a frame I’ve been using clearly motivated by a type of “agent foundations” or at least “learning foundations” intuition. Of course it’s very loose. I’m working on a better example.
(Incidentally, I would probably be considered to be in math academia)
...I also do not use “reasoning about idealized superintelligent systems as the method” of my agent foundations research. Certainly there are examples of this in agent foundations, but it is not the majority. It is not the majority of what Garrabrant or Demski or Ngo or Wentworth or Turner do, as far as I know.
It sounds to me like you’re not really familiar with the breadth of agent foundations. Which is perfectly fine, because it’s not a cohesive field yet, nor is the existing work easily understandable. But I think you should aim for your statements to be more calibrated.
Notably, in the case of string theory, the fact that it predicts everything we currently observe plus new forces at the planck scale is currently better than all other theories of physics, because currently all other theories either predict something we have reason not to observe or limit themselves to a subset of predictions that other theories already predict, so the fact that string theory can predict everything we observe and predict (admittedly difficult to falsify) observations is enough to make it a leading theory.
No comment on whether the same applies to agent foundations.
Hmm, my outsider impression is that there’s in fact a myriad “string theories”, all of them predicting everything we observe, but with no way to experimentally discern the correct one among them for the foreseeable future, which I have understood to be the main criticism. Is this broad-strokes picture fundamentally mistaken?
There are a large number of “string vacua” which contain particles and interactions with the quantum numbers and symmetries we call the standard model, but (1) they typically contain a lot of other stuff that we haven’t seen (2) the real test is whether the constants (e.g. masses and couplings) are the same as observed, and these are hard to calculate (but it’s improving).