(The following is about a specific sub-point on the following part:)
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
Does the argument around VNM being more of a convergence property make sense?
And finally, I like the way you distilled the disagreement so thanks for that!
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.
(The following is about a specific sub-point on the following part:)
I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
Does the argument around VNM being more of a convergence property make sense?
And finally, I like the way you distilled the disagreement so thanks for that!
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.