When I discussed the idea for this post with Abram Demski, he mentioned you as a notable example of someone who believes we are very confused about agency, so I was glad to see your response. In particular, I think you described to Abram some entirely missing field of study between AI, psychology, and linguistics? Have you written about this anywhere?
Yes, but, what is one supposed to do? I have investigated several diverse areas in hopes of finding relevant signal, and have largely come up empty. I’m just one guy, but if lots of other people do this, one can become fairly confident that the understanding we’d need just isn’t there yet.
If you believe you have justified confidence that everyone is confused about agency then you should say so.
I fairly strongly agree with this, and argue this to rationalists not too infrequently, though I would also make various counterpoints.
Such as?
This is true but it’s really not that much evidence about the question at hand. The same could be said of biology BEFORE Darwin’s theory, and BEFORE genomics and genomic regulatory networks and so on.
This seems like a potential crux.
Biology before Darwin (and his contemporary ~indepedent co-inventors) was missing an (even partial) answer for some central questions, such as where species (and life) come from. Lamark already had a (false but sort of on track) explanation for the adaptation of species to their environments. If I applied the same standard to biology then that I am applying to agency now, would I have mistaken Lamark’s explanation for a well-developed theory? I suspect that our understanding of agency is more formal, more complete, and more unified across fields than biology was in 1800, but I don’t know enough of the history to say for sure.
Before Lamark biology seems to have lacked unifying principles to address its central questions. Lamark was a contemporary of Darwin, so the period where it is not clear whether “the same could be said” about biology seems to have been rather short.
What do you mean? What are you claiming is possible for one person to do?
Build a unified picture of agency from the perspectives of the fields that formally describe it, which “explains” agency about as well as evolution + genetics “explains” biology. Perhaps a little worse.
Why on earth do you think this?
The answer is a bit hard to compress to a comment, which is partially because I think the state-of-the-art “explanation of agency” does not have a very short compressed length. My meta-theory of rationality sequence might be the best answer I have written, but is not very direct. Another answer is that as I have studied rationality from various angles, I have tended to update in the direction that existing work has done a better job of modeling agency than I expected, and I am trying to get ahead of future updates. Also, I have a bundle of intuitions and models from theoretical computer science that suggest there should be not be elegant complete theories for bounded rationality.
Initially, your post asks for a theory of agency, but later it shifts towards asking for a theory of alignment (or control/obedience).
“What determines a mind’s effects” does not seem likely to have a clean answer in general. What determines the trajectory of embryonic development? What determines the trajectory of an ecosystem? In your words: “It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind’s effects.”
I am not sure that there is a finite set of concepts for understanding agency. The problem with a theory of agents is that agents invent theory, so agent theory keeps pulling more stuff into itself. I guess that the point of agent foundations is to focus on only the core theory of agency. But what characterizes the core theory? If Bayesian probability is part of it, why stop at the definition of updating and not develop the theory of exponential families (without end)? I think the usual unstated frame is that the core is the necessary part to build an agent that recursively self-improves. Unfortunately, I suspect it is likely possible to do this with very little theory (e.g. the current trajectory of ML). That seems unsatisfying, so perhaps one hopes for theoretically justified self-improving agent. I think we have reasonable good concepts for discussing this (the recursion theorem, Levin/Hutter search, logical induction and various other approaches to reflection in decision theory, maybe IB). Unfortunately, self-improving systems from theory tend to be impractical relative to hackier ML approaches (but practice is usually hackier than theory, so this is not really surprising or confusing).
However, building more practical theoretically-justified agents is not clearly sufficient for solving AI safety! Even theoretically justified agents tend to become opaque as they run (e.g. AIXI approximations). But that problem is about communication, not agency in general. In other words, the thing that we really want for AI safety is not centrally deconfusion about agency. It’s more like deconfusion about a specific agent :)
But you say
Without first understanding how the effects of minds in general are determined, designing a mind to have specifiable effects is out of order. If the designer understands how to design the mind so that its effects are specifiable, then the designer also understands how the specification channel determines the mind’s effects.
I don’t agree with this. It is often easier to design some X that works predictably, than to verify whether an arbitrary X works. For example, we can write verifiable programs, but we can’t verify arbitrary programs.
If your claim is that we do not have the necessary concepts to perform AI alignment, then I agree. We are confused about AI alignment. This statement is narrower and more useful than “We are confused about agency.”
“What determines a mind’s effects” does not seem likely to have a clean answer in general.
I can quibble with this, and of course it’s an open question (e.g. cf. https://www.lesswrong.com/posts/NvwjExA7FcPDoo3L7/are-there-cognitive-realms). But I certainly take your point that you maybe don’t need to understand all or most minds. (I may have written something contradicting that in the “fundamental question” post; if I did, then I was mistaken or wrote unclearly or something.)
I am not sure that there is a finite set of concepts for understanding agency. The problem with a theory of agents is that agents invent theory, so agent theory keeps pulling more stuff into itself.
I guess that the point of agent foundations is to focus on only the core theory of agency. But what characterizes the core theory?
I would say that alignment is pretty close to the core theory; let’s say, 1⁄4 of core theory is alignment, and 1⁄2 of alignment is core theory (numbers made up, but just to give relative sizes or something). Or IDK haha. But what I mean is, core theory would tend to be
Elements that an individual mind converges to over time. (Yes, there are many elements X that may not converge, e.g. because the mind is permanently deranged, or because X is entangled with ongoing / infinite creativity; and in some sense all X are entangled with ongoing creativity, and therefore don’t fully converge in terms of their full meaning for the mind… but like, still, there’s obvious senses in which things do converge.) (Some of these elements are mind-general things like Bayesian reasoning; others are mind-specific / value-laden / cognitive-realm-specific. The latter are converged to by choice (choice of what to be).)
Elements that determine the mind’s ultimate effects.
If something tends to get self-modified away, it’s probably not core.
You might say that nothing is stable, everything gets self-modified away? We could debate that. But I have a perhaps not very accountable sense that it is possible for me to decide some things about myself permanently, even though I’m fairly gung-ho (in the long run) about radically changing / growing / transcending myself. E.g. I think I can decide to never pointlessly torture a person—I mean presumably someone could fairly feasibly mess with my head a bunch to change that property, but I mean that left to my own RSI devices I would never change my mind on that. Do you agree that some things can be stable like that?
Before Lamark biology seems to have lacked unifying principles to address its central questions.
Before Linnaeus it lacked unifying principles.… But after Linnaeus, we have categorized everything.
...By which I mean to suggest, I think we might be miscommunicating about the questions at hand… If a post-Linnaean biologist is looking back and saying “sure, we had lots of information about species, but we didn’t have a conceptual scheme; now we do, so we are no longer confused about biology”, then the problem isn’t that ze is wrong that a big advance happened, but rather that ze is… failing to imagine that there could be major future insights, after which the prior state of knowledge would seem fundamentally conceptually impoverished.
I think that a biologist in 1900 is confused about several fundamental things about life. For example, they don’t know about GRNs. My guess would be that biologists in 2026 are also confused that way. I wouldn’t know specifically what they are confused about, but for example they may lack concepts of [metastable self-reinforcing gene regulatory states] or [bundle structures within the space of biological functions corresponding to characters such as cross-species-hands and cross-species-eyes and cross-species-T-cells]. (You would definitely find papers about both of those topics, and probably lots of other interesting theoretical topics in biology, but I’m saying it would be unsatisfactory, where it could eventually be satisfactory.)
Also, I have a bundle of intuitions and models from theoretical computer science that suggest there should be not be elegant complete theories for bounded rationality.
I agree with this for at least two reasons, one of them being “The concrete is never lost” (https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html#values-require-reference). But I don’t view this as bearing much on questions like “can we get a major compression of the important elements of the domain”. All complex, context-embedded things will have a bunch of complication that doesn’t reduce, but that doesn’t mean you can’t much much more deeply understand things. E.g. the OS of a laptop computer ~irreducibly must have complexity to interface with mouse, keyboard, screen, wifi, audio, different memory elements, etc.; but it’s still very principled (written in multiple layers of languages, using algorithmic ideas for high leverage) and is much more understandable than a huge pile of machine code that some compiler spits out (or however that works).
I don’t agree with this. It is often easier to design some X that works predictably, than to verify whether an arbitrary X works. For example, we can write verifiable programs, but we can’t verify arbitrary programs.
Your interpretation of this paragraph is reasonable, and I agree that the statement you’re hearing is incorrect; but rereading the whole section ( https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects#The_word__a_), I largely stand by what I wrote—that paragraph functions in context as part of a development of thought that is dealing with the tension between (1) we just need to design one mind that works, and (2) you can’t really do that without understanding minds somewhat more generally (IDK how much).
This statement is narrower and more useful than “We are confused about agency.”
I think it loses a pretty important + useful idea, which is that it’s NOT the case that:
we understand agents and [their goals and how they relate to their goals or something], as long as you’re not trying to change [their goals and how they relate to their goals or something].
“What determines a mind’s effects” does not seem likely to have a clean answer in general. What determines the trajectory of embryonic development? What determines the trajectory of an ecosystem? In your words: “It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind’s effects.”
Yeah, not in fully generality. IDK how generally it should have an answer. I think embryonic development, and especially an ecosystem, are bad comparisons because they aren’t sufficiently mind/agent-like. The embryo is executing a bunch of narrow adaptations (various homeostats, and similar, I imagine); it isn’t on a very determined trajectory, and it can be fairly easily disrupted, and it won’t pull itself back onto the trajectory. Even more so for the ecosystem. A mind/agent, on the other hand, can aim itself—at least to some pretty large degree.
Build a unified picture of agency from the perspectives of the fields that formally describe it, which “explains” agency about as well as evolution + genetics “explains” biology. Perhaps a little worse.
Oh. Well that’s kind of a low bar. Maybe we don’t disagree about this then, not sure. We agree that it’s not nearly enough for alignment, right?
Maybe, but I still think it’s a strategic mistake to aim at the center of “what a mind is” when you want to hit the center of (for example) “what trust is.”
Because of the question of reflective (in)stability in general, I think it’s quite hard to get a handle on anything really important in a mind other than by really understanding mind/agency. Otherwise you have no coordinates for what the mind “really is” in the sense of what elements of the mind will actually stick around.
I think you probably need to understand many things about minds a lot better than “evolution + genetics” understands biology before it makes much sense to try attacking questions about alignment mechanics in particular. To stick with the analogy, I suspect you might at least need the sort of mastery level where you understand Mitochondria and DNA transcription well enough to build your own basic functional versions of them from scratch before you can even really get started.
I agree that ‘we are confused about agency’ is not a good slogan for pointing to this inadequacy. I think ‘we haven’t advanced practical mind science to anywhere near the level we’ve advanced e.g. condensed matter physics’ is true and a blocker for alignment of superintelligence, but ‘we are confused about agency’ brings up much stronger associations around memes like ‘maybe Bayesian EV maximisation is conceptually wrong even in the idealised setting’ to me. These meme groups seem sufficiently distinct to merit separate slogans.
In particular, I think you described to Abram some entirely missing field of study between AI, psychology, and linguistics? Have you written about this anywhere?
Not really (as I moved out of AGI alignment research); and if I did, it would be a speculative gestural description, rather than examples, because I don’t actually know the theory. A distant spiritual cousin might be Eisenstat’s Condensation (https://www.lesswrong.com/posts/BstHXPgQyfeNnLjjp/condensation). The closest thing I can point you to is:
If you believe you have justified confidence that everyone is confused about agency then you should say so.
Pretty confident, yeah.
I fairly strongly agree with this, and argue this to rationalists not too infrequently, though I would also make various counterpoints.
Such as?
The main thing would be that it can be highly beneficial to work things out for yourself, even if it’s much slower in the obvious sense that you learn the set of known ideas slower. Other points would be that indeed sometimes the literature has missed important questions or ideas; or even if someone did discuss it it’s too hard to find because the literature didn’t signal boost it properly; and often academics are uncomfortable with too speculative topics when they could just as well have been excited about them. There’s probably more, IDK. But generally I agree that rationalists tend to be dismissive of academia (e.g. failing to update that if academia didn’t do X, that’s often fairly strong evidence that X is less interesting and/or more difficult than you think it is).
When I discussed the idea for this post with Abram Demski, he mentioned you as a notable example of someone who believes we are very confused about agency, so I was glad to see your response. In particular, I think you described to Abram some entirely missing field of study between AI, psychology, and linguistics? Have you written about this anywhere?
If you believe you have justified confidence that everyone is confused about agency then you should say so.
Such as?
This seems like a potential crux.
Biology before Darwin (and his contemporary ~indepedent co-inventors) was missing an (even partial) answer for some central questions, such as where species (and life) come from. Lamark already had a (false but sort of on track) explanation for the adaptation of species to their environments. If I applied the same standard to biology then that I am applying to agency now, would I have mistaken Lamark’s explanation for a well-developed theory? I suspect that our understanding of agency is more formal, more complete, and more unified across fields than biology was in 1800, but I don’t know enough of the history to say for sure.
Before Lamark biology seems to have lacked unifying principles to address its central questions. Lamark was a contemporary of Darwin, so the period where it is not clear whether “the same could be said” about biology seems to have been rather short.
Build a unified picture of agency from the perspectives of the fields that formally describe it, which “explains” agency about as well as evolution + genetics “explains” biology. Perhaps a little worse.
The answer is a bit hard to compress to a comment, which is partially because I think the state-of-the-art “explanation of agency” does not have a very short compressed length. My meta-theory of rationality sequence might be the best answer I have written, but is not very direct. Another answer is that as I have studied rationality from various angles, I have tended to update in the direction that existing work has done a better job of modeling agency than I expected, and I am trying to get ahead of future updates. Also, I have a bundle of intuitions and models from theoretical computer science that suggest there should be not be elegant complete theories for bounded rationality.
Initially, your post asks for a theory of agency, but later it shifts towards asking for a theory of alignment (or control/obedience).
“What determines a mind’s effects” does not seem likely to have a clean answer in general. What determines the trajectory of embryonic development? What determines the trajectory of an ecosystem? In your words: “It may be that there is no answer. It may be that no small assembly of small elements of the mind comprehensively determines the mind’s effects.”
I am not sure that there is a finite set of concepts for understanding agency. The problem with a theory of agents is that agents invent theory, so agent theory keeps pulling more stuff into itself. I guess that the point of agent foundations is to focus on only the core theory of agency. But what characterizes the core theory? If Bayesian probability is part of it, why stop at the definition of updating and not develop the theory of exponential families (without end)? I think the usual unstated frame is that the core is the necessary part to build an agent that recursively self-improves. Unfortunately, I suspect it is likely possible to do this with very little theory (e.g. the current trajectory of ML). That seems unsatisfying, so perhaps one hopes for theoretically justified self-improving agent. I think we have reasonable good concepts for discussing this (the recursion theorem, Levin/Hutter search, logical induction and various other approaches to reflection in decision theory, maybe IB). Unfortunately, self-improving systems from theory tend to be impractical relative to hackier ML approaches (but practice is usually hackier than theory, so this is not really surprising or confusing).
However, building more practical theoretically-justified agents is not clearly sufficient for solving AI safety! Even theoretically justified agents tend to become opaque as they run (e.g. AIXI approximations). But that problem is about communication, not agency in general. In other words, the thing that we really want for AI safety is not centrally deconfusion about agency. It’s more like deconfusion about a specific agent :)
But you say
I don’t agree with this. It is often easier to design some X that works predictably, than to verify whether an arbitrary X works. For example, we can write verifiable programs, but we can’t verify arbitrary programs.
If your claim is that we do not have the necessary concepts to perform AI alignment, then I agree. We are confused about AI alignment. This statement is narrower and more useful than “We are confused about agency.”
I can quibble with this, and of course it’s an open question (e.g. cf. https://www.lesswrong.com/posts/NvwjExA7FcPDoo3L7/are-there-cognitive-realms). But I certainly take your point that you maybe don’t need to understand all or most minds. (I may have written something contradicting that in the “fundamental question” post; if I did, then I was mistaken or wrote unclearly or something.)
I think I agree with this. Cf. https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought
I would say that alignment is pretty close to the core theory; let’s say, 1⁄4 of core theory is alignment, and 1⁄2 of alignment is core theory (numbers made up, but just to give relative sizes or something). Or IDK haha. But what I mean is, core theory would tend to be
Elements that an individual mind converges to over time. (Yes, there are many elements X that may not converge, e.g. because the mind is permanently deranged, or because X is entangled with ongoing / infinite creativity; and in some sense all X are entangled with ongoing creativity, and therefore don’t fully converge in terms of their full meaning for the mind… but like, still, there’s obvious senses in which things do converge.) (Some of these elements are mind-general things like Bayesian reasoning; others are mind-specific / value-laden / cognitive-realm-specific. The latter are converged to by choice (choice of what to be).)
Elements that determine the mind’s ultimate effects.
Elements that are stable under reflective self-modification. Cf. https://www.lesswrong.com/posts/Ht4JZtxngKwuQ7cDC?commentId=koeti9ygXB9wPLnnF
If something tends to get self-modified away, it’s probably not core.
You might say that nothing is stable, everything gets self-modified away? We could debate that. But I have a perhaps not very accountable sense that it is possible for me to decide some things about myself permanently, even though I’m fairly gung-ho (in the long run) about radically changing / growing / transcending myself. E.g. I think I can decide to never pointlessly torture a person—I mean presumably someone could fairly feasibly mess with my head a bunch to change that property, but I mean that left to my own RSI devices I would never change my mind on that. Do you agree that some things can be stable like that?
Before Linnaeus it lacked unifying principles.… But after Linnaeus, we have categorized everything.
...By which I mean to suggest, I think we might be miscommunicating about the questions at hand… If a post-Linnaean biologist is looking back and saying “sure, we had lots of information about species, but we didn’t have a conceptual scheme; now we do, so we are no longer confused about biology”, then the problem isn’t that ze is wrong that a big advance happened, but rather that ze is… failing to imagine that there could be major future insights, after which the prior state of knowledge would seem fundamentally conceptually impoverished.
I think that a biologist in 1900 is confused about several fundamental things about life. For example, they don’t know about GRNs. My guess would be that biologists in 2026 are also confused that way. I wouldn’t know specifically what they are confused about, but for example they may lack concepts of [metastable self-reinforcing gene regulatory states] or [bundle structures within the space of biological functions corresponding to characters such as cross-species-hands and cross-species-eyes and cross-species-T-cells]. (You would definitely find papers about both of those topics, and probably lots of other interesting theoretical topics in biology, but I’m saying it would be unsatisfactory, where it could eventually be satisfactory.)
I agree with this for at least two reasons, one of them being “The concrete is never lost” (https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html#values-require-reference). But I don’t view this as bearing much on questions like “can we get a major compression of the important elements of the domain”. All complex, context-embedded things will have a bunch of complication that doesn’t reduce, but that doesn’t mean you can’t much much more deeply understand things. E.g. the OS of a laptop computer ~irreducibly must have complexity to interface with mouse, keyboard, screen, wifi, audio, different memory elements, etc.; but it’s still very principled (written in multiple layers of languages, using algorithmic ideas for high leverage) and is much more understandable than a huge pile of machine code that some compiler spits out (or however that works).
Your interpretation of this paragraph is reasonable, and I agree that the statement you’re hearing is incorrect; but rereading the whole section ( https://www.lesswrong.com/posts/NqsNYsyoA2YSbb3py/fundamental-question-what-determines-a-mind-s-effects#The_word__a_), I largely stand by what I wrote—that paragraph functions in context as part of a development of thought that is dealing with the tension between (1) we just need to design one mind that works, and (2) you can’t really do that without understanding minds somewhat more generally (IDK how much).
I think it loses a pretty important + useful idea, which is that it’s NOT the case that:
Yeah, not in fully generality. IDK how generally it should have an answer. I think embryonic development, and especially an ecosystem, are bad comparisons because they aren’t sufficiently mind/agent-like. The embryo is executing a bunch of narrow adaptations (various homeostats, and similar, I imagine); it isn’t on a very determined trajectory, and it can be fairly easily disrupted, and it won’t pull itself back onto the trajectory. Even more so for the ecosystem. A mind/agent, on the other hand, can aim itself—at least to some pretty large degree.
Oh. Well that’s kind of a low bar. Maybe we don’t disagree about this then, not sure. We agree that it’s not nearly enough for alignment, right?
Right, but I think it’s not enough in the sense that we need to develop the specific concepts which are relevant to alignment.
Mhm. I think those concepts are quite central to what an agent/mind is.
Maybe, but I still think it’s a strategic mistake to aim at the center of “what a mind is” when you want to hit the center of (for example) “what trust is.”
Because of the question of reflective (in)stability in general, I think it’s quite hard to get a handle on anything really important in a mind other than by really understanding mind/agency. Otherwise you have no coordinates for what the mind “really is” in the sense of what elements of the mind will actually stick around.
I think you probably need to understand many things about minds a lot better than “evolution + genetics” understands biology before it makes much sense to try attacking questions about alignment mechanics in particular. To stick with the analogy, I suspect you might at least need the sort of mastery level where you understand Mitochondria and DNA transcription well enough to build your own basic functional versions of them from scratch before you can even really get started.
I agree that ‘we are confused about agency’ is not a good slogan for pointing to this inadequacy. I think ‘we haven’t advanced practical mind science to anywhere near the level we’ve advanced e.g. condensed matter physics’ is true and a blocker for alignment of superintelligence, but ‘we are confused about agency’ brings up much stronger associations around memes like ‘maybe Bayesian EV maximisation is conceptually wrong even in the idealised setting’ to me. These meme groups seem sufficiently distinct to merit separate slogans.
Not really (as I moved out of AGI alignment research); and if I did, it would be a speculative gestural description, rather than examples, because I don’t actually know the theory. A distant spiritual cousin might be Eisenstat’s Condensation (https://www.lesswrong.com/posts/BstHXPgQyfeNnLjjp/condensation). The closest thing I can point you to is:
https://www.lesswrong.com/posts/TNQKFoWhAkLCB4Kt7/a-hermeneutic-net-for-agency
which is continued in https://tsvibt.blogspot.com/2025/11/ah-motiva-1-words-about-values.html
and https://tsvibt.blogspot.com/2025/11/ah-motiva-2-relating-values-and-novelty.html
and https://tsvibt.blogspot.com/2025/11/ah-motiva-3-context-of-concept-of-value.html
But that’s only obliquely related IIRC.
Pretty confident, yeah.
The main thing would be that it can be highly beneficial to work things out for yourself, even if it’s much slower in the obvious sense that you learn the set of known ideas slower. Other points would be that indeed sometimes the literature has missed important questions or ideas; or even if someone did discuss it it’s too hard to find because the literature didn’t signal boost it properly; and often academics are uncomfortable with too speculative topics when they could just as well have been excited about them. There’s probably more, IDK. But generally I agree that rationalists tend to be dismissive of academia (e.g. failing to update that if academia didn’t do X, that’s often fairly strong evidence that X is less interesting and/or more difficult than you think it is).