Here are some thoughts about the recent back-and-forth where Will MacAskill reviewed IABI and Rob Bensinger wrote a reply and Will replied back. I’m making this a quick take instead of a full post because it gets kinda inside baseball/navelgazy and I want to be more chill about that than I would be in a full writeup.
First of all, I want to say thank you to Will for reviewing IABI. I got a lot out of the mini-review, and broadly like it, even if I disagree on the bottom-line and some of the arguments. It helped me think deep thoughts.
The evolution analogy
I agree with Will that the evolution analogy is useful and informative in some ways, but of limited value. It’s imperfect, and thinking hard about the differences is good.
The most basic disanalogy is that evolution wasn’t trying, in any meaningful sense, to produce beings that maximise inclusive genetic fitness in off-distribution environments.
I agree with this, and I appreciate MacAskill talking about it.
But we will be doing the equivalent of that!
One of the big things that I think distinguishes more doomy people like me from less doomy people like Will is our priors on how incompetent people are. Like, I agree that it’s possible to carefully train an ML system to be (somewhat) robust to distributional shifts. But will we actually do that?
I think, at minimum, any plan to build AGI (to say nothing of ASI) should involve:
Trying to give the AI a clear, (relatively) simple goal that is well understood by humans and which we have strong reason to expect won’t cause a catastrophe if faithfully pursued, even if it gets wildly more power than expected and goes a little off-the-rails.
Training that AI in a wide variety of environments, trying to hit the real goal, rather than proxies. This should involve adversarial environments and a lot of paranoia that the AI has failed to robustly internalize the true goal.
Testing the AI on entirely new environments that were a-priori suspected of being difficult and weird, but where we also think there is a true answer that can be checked, restarting from scratch if the AI fails to generalize to the test set, or otherwise clearly demonstrates that it has not internalized the desired goal in a robust way.
And I personally think pure corrigibility has a nonzero chance of being a good choice of goal, and that a sufficiently paranoid training regime has a nonzero chance of being able to make a semi-safe AGI this way, even with current techniques. (That said, I don’t actually advocate for plans that have a significant chance of killing everyone, and I think “try to build corrigible AGI” does have a significant chance of killing everyone; I just notice that it seems better than what the research community currently seems to be doing, even at Anthropic.)
I predict the frontier lab that builds the first AGI will not be heavily focused on ensuring robustness to distributional shifts. We could bet, maybe.
Types of misalignment
I really benefited from this! Will changed my mind! My initial reaction to Will’s mini-review was like, “Will is wrong that these are distinct concepts; any machine sufficiently powerful to have a genuine opportunity to disempower people but which is also imperfectly aligned will produce a catastrophe.”
And then I realized that I was wrong. I think. Like, what if Will is (secretly?) gesturing at the corrigibility attractor basin or perhaps the abstracted/generalized pattern of which corrigibility is an instance? (I don’t know of other goals which have the same dynamic, but maybe it’s not just corrigibility?)
An agent which is pseudo-corrigible, and lives inside the attractor basin, is imperfectly aligned (hence the pseudo) but if it’s sufficiently close to corrigible it seems reasonable to me that it won’t disempower humanity, even if given the opportunity (at least, not in every instance it gets the opportunity). So at the very least, corrigibility (one of my primary areas of research!) is (probably) an instance of Will being right (and my past self being wrong), and the distinction between his “types of misalignment” is indeed a vital one.
I feel pretty embarrassed by this, so I guess I just wanna say oops/sorry/thanks.
If I set aside my actual beliefs and imagine that we’re going to naturally land in the corrigibility attractor basin by default, I feel like I have a better sense of some of the gradualism hope. Like, my sense is that going from pseudo-corrigible to perfectly corrigible is fraught, but can be done with slow, careful iteration. Maybe Clara Collier and other gradualists think we’re going to naturally land in the corrigibility attractor basin, and that the gradual work is the analogue of the paranoid iteration that I conceive as being the obvious next-step?
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
(Also on the topic of gradualism and the notion of having “only one try” I want to gesture at the part of IABI where it says (paraphrased from memory, sorry): if you have a clever scheme for getting multiple tries, you still only get one try at getting that scheme to work.)
appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)
I agree that “value space is big” is not a good argument, in isolation, for how likely it is for our creations to be aligned. The other half of the pincer is “our optimization pressure towards aligned goals is weak,” and without that the argument falls apart.
(Maybe we won’t be able to make deals with AIs? I agree that’s a worry; but then the right response is to make sure that we can. Won’t the superintelligence have essentially a 100% chance of taking over, if it wanted to? But that’s again invoking the “discontinuous jump to godlike capabilities” idea, which I don’t think is what we’ll get).
Here’s a plan for getting a good future:
Build ASI slowly, such that there’s some hope of being able to understand the first AI capable of a pivotal act.
The AI will want weird, alien stuff, but we’ll make sure that it’s the kind of mind that would prefer getting 50% of the stars with 99% probability to getting 100% of the stars with 70% probability.
Since we’re going slowly, we still have a 30% chance to stop it if we wanted.
We tell the AI “we’ll let you do a pivotal act and escape our clutches if you agree to only eat 50% of the stars, and help us get the other 50% as though you were aligned”
Our interpretability techniques are so good that we know whether it’s lying or whether it’s honorable and will actually give us half the stars.
Because we’re so good at reading its advanced, alien mind, it knows it needs to be honorable with us, and so is actually honorable.
The AI says “Yep, will do.”
We see it’s telling the truth when we check.
We set it free.
It gives us a good future.
I think this plan is bad because it fails the heuristic of “don’t summon demons and try to cleverly bargain with them,” but perhaps I’m being unfair.
My main criticism with “make deals with the AIs” is that it seems complex and brittle and like it depends heavily on a level of being able to read the machine’s mind that we definitely don’t currently have and might never have.
That said, I do think there’s a lot of value in being the sorts of people/groups that can make deals and be credible trade partners. Efforts to be more trustworthy and honorable and so on seem great.
suppose that all the first superintelligence terminally values is paperclips. But it’s risk-averse, in the sense that it prefers a guarantee of N resources over a 50⁄50 chance of 0 or 2N resources; let’s say it’s more risk-averse than the typical human being.
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
it would strongly prefer to cooperate with humans in exchange for, say, a guaranteed salary, rather than to take a risky gamble of either taking over the world or getting caught and shut off.
Rob wrote some counterpoints to this, but I just want to harp on it a little. Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
I can imagine a misaligned AI maybe making a deal with humans who let it out of the box in exchange for some small fraction of the cosmos (and honoring the deal; again, the hard part is that it has to know we can tell if it’s lying, and we probably can’t).
I can’t really imagine an AI that has a clear shot at taking over the world making a deal to be a meek little salary worker, even if there are risks in trying to take over. Taking over the world means, in addition to other things, being sure you won’t get shut off or replaced by some other AI or whatever.
(Though I can certainly imagine a misaligned AI convincing people (and possibly parts of itself) that it is willing to make a deal like that, even as it quietly accumulates more power.)
Their proposal
Now we’re getting into the source of the infighting, I think (just plain fighting? I think of Will as being part of my ingroup, but idk if he feels the same; Rob definitely is part of my ingroup; are they part of each other’s ingroups? Where is the line between infighting and just plain fighting?). Will seems very keen on criticizing MIRI’s “SHUT IT DOWN YOU FOOLS” strategy — mostly, it seems to me, because he sees this approach as insufficiently supportive of strategies besides shutting things down.
When Rob shared his draft of his reply to Will, I definitely noticed that it seemed like he was not responding accurately to the position that I saw in Will’s tweet. Unfortunately, I was aware that there is something of a history between Will and MIRI and I incorrectly assumed that Rob was importing true knowledge of Will’s position that I simply wasn’t aware of. I warned him that I thought he was being too aggressive, writing “I expect that some readers will be like ‘whoa why is MIRI acting like this guy is this extremist—I don’t see evidence of that and bet they’re strawmanning him’.” But I didn’t actually push back hard, and that’s on me. Apologies to Will.
(Rob reviewed a draft of this post and adds his own apologies for misunderstanding Will’s view. He adds: “My thanks to Max and multiple other MIRI people for pushing back on that part of my draft. I made some revisions in response, though they obviously weren’t sufficient!”)
I’m very glad to see in Will’s follow-up:
“I definitely think it will be extremely valuable to have the option to slow down AI development in the future,” as well as “the current situation is f-ing crazy”
I wish this had been more prominent in his mini-review, but :shrug:
I think Will and I probably agree that funding a bunch of efforts to research alignment, interpretability, etc. would be good. I’m an AI safety/alignment researcher, and I obviously do my day-to-day work with a sense that it’s valuable and a sense that more effort would also be valuable. I’ve heard multiple people (whom I respect and think are doing good work) complain that Eliezer is critical/dismissive of their work, and I wish Eliezer was more supportive of that work (while also still saying “this won’t be sufficient” if that’s what he believes, and somehow threading that needle).
I am pretty worried about false hope, though. I’m worried that people will take “there are a bunch of optimistic researchers working hard on this problem” as a sign that we don’t need to take drastic action. I think we see a bunch of this already and researchers like myself have a duty to shout “PLEASE DON’T RISK EVERYTHING! I DON’T GOT THIS!”[1] even while pursuing the least-doomed alignment strategies they know of. (I tried to thread this needle in my corrigibility research.)
Anyway, I think I basically agree with Will’s clarified position that a “kitchen-sink approach” is best, including a lot of research, as long as actually shutting down advanced training runs and pure capabilities research is in the kitchen sink. I feel worried that Will isn’t actually pushing for that in a way that I think is important (not building “It” is the safest intervention I’m aware of), but I’m also worried about my allies (people who basically agree that AI is unacceptably dangerous and that we need to take action) being unable to put forward a collective effort without devolving into squabbling about tone and strawmanning each other. :(
Anyway. Thank you again to Will and Rob. I thought both pieces were worth reading.
(Not to say that we should necessarily risk everything if alignment researchers do feel like they’ve “got this.” That’s a question worth debating in its own right. Also, it’s obviously worth noting that work that is incrementally useful but clearly insufficient to solve the entire field can still be valuable and the researcher is still allowed to say “I got this” on their little, local problems. (And they’re definitely allowed to speak up if they actually do solve the whole damn problem, of course. But they better have actually solved it!))
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
That’s not quite right. ‘Risk-averse with respect to quantity X’ just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about ‘risk aversion,’ they almost always mean diminishing marginal utility.
diminishing returns from resources… is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
Can you say more about why?
Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
But AIs with sharply diminishing marginal utility to extra resources wouldn’t care much about this. They’d be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
Regarding diminishing returns being natural:
I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
Regarding earning a salary:
My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?
You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button!
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)
They don’t have to be short-term oriented! Their utility function could be:
u=f(Σipi)
Where f is some strictly concave function and pi is the agent’s payment at time i. Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to pi regardless of whether i is 1 or 1 million. And yet, for the right kind of f, these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
[D]epending on what the agent is risk-averse with respect to, they might choose [meekness]. If [agents are] … risk-neutral with respect to length of life, they’ll choose [ambition]. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the [meekness].
Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
(The following is about a specific sub-point on the following part:)
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
Does the argument around VNM being more of a convergence property make sense?
And finally, I like the way you distilled the disagreement so thanks for that!
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Ah, that’s a great point! I had read it a while back, but it wasn’t coming to mind when I was writing this. I think that’s an excellent example of a similar dynamic besides corrigibility. When I’m thinking about things, I usually flatten out the goal-space to ignore deconfusion (or however one wants to characterize the kind of progress towards one’s “true values”), but it’s clearly relevant here. Thanks for bringing it up!
Here are some thoughts about the recent back-and-forth where Will MacAskill reviewed IABI and Rob Bensinger wrote a reply and Will replied back. I’m making this a quick take instead of a full post because it gets kinda inside baseball/navelgazy and I want to be more chill about that than I would be in a full writeup.
First of all, I want to say thank you to Will for reviewing IABI. I got a lot out of the mini-review, and broadly like it, even if I disagree on the bottom-line and some of the arguments. It helped me think deep thoughts.
The evolution analogy
I agree with Will that the evolution analogy is useful and informative in some ways, but of limited value. It’s imperfect, and thinking hard about the differences is good.
I agree with this, and I appreciate MacAskill talking about it.
One of the big things that I think distinguishes more doomy people like me from less doomy people like Will is our priors on how incompetent people are. Like, I agree that it’s possible to carefully train an ML system to be (somewhat) robust to distributional shifts. But will we actually do that?
I think, at minimum, any plan to build AGI (to say nothing of ASI) should involve:
Trying to give the AI a clear, (relatively) simple goal that is well understood by humans and which we have strong reason to expect won’t cause a catastrophe if faithfully pursued, even if it gets wildly more power than expected and goes a little off-the-rails.
Training that AI in a wide variety of environments, trying to hit the real goal, rather than proxies. This should involve adversarial environments and a lot of paranoia that the AI has failed to robustly internalize the true goal.
Testing the AI on entirely new environments that were a-priori suspected of being difficult and weird, but where we also think there is a true answer that can be checked, restarting from scratch if the AI fails to generalize to the test set, or otherwise clearly demonstrates that it has not internalized the desired goal in a robust way.
And I personally think pure corrigibility has a nonzero chance of being a good choice of goal, and that a sufficiently paranoid training regime has a nonzero chance of being able to make a semi-safe AGI this way, even with current techniques. (That said, I don’t actually advocate for plans that have a significant chance of killing everyone, and I think “try to build corrigible AGI” does have a significant chance of killing everyone; I just notice that it seems better than what the research community currently seems to be doing, even at Anthropic.)
I predict the frontier lab that builds the first AGI will not be heavily focused on ensuring robustness to distributional shifts. We could bet, maybe.
Types of misalignment
I really benefited from this! Will changed my mind! My initial reaction to Will’s mini-review was like, “Will is wrong that these are distinct concepts; any machine sufficiently powerful to have a genuine opportunity to disempower people but which is also imperfectly aligned will produce a catastrophe.”
And then I realized that I was wrong. I think. Like, what if Will is (secretly?) gesturing at the corrigibility attractor basin or perhaps the abstracted/generalized pattern of which corrigibility is an instance? (I don’t know of other goals which have the same dynamic, but maybe it’s not just corrigibility?)
An agent which is pseudo-corrigible, and lives inside the attractor basin, is imperfectly aligned (hence the pseudo) but if it’s sufficiently close to corrigible it seems reasonable to me that it won’t disempower humanity, even if given the opportunity (at least, not in every instance it gets the opportunity). So at the very least, corrigibility (one of my primary areas of research!) is (probably) an instance of Will being right (and my past self being wrong), and the distinction between his “types of misalignment” is indeed a vital one.
I feel pretty embarrassed by this, so I guess I just wanna say oops/sorry/thanks.
If I set aside my actual beliefs and imagine that we’re going to naturally land in the corrigibility attractor basin by default, I feel like I have a better sense of some of the gradualism hope. Like, my sense is that going from pseudo-corrigible to perfectly corrigible is fraught, but can be done with slow, careful iteration. Maybe Clara Collier and other gradualists think we’re going to naturally land in the corrigibility attractor basin, and that the gradual work is the analogue of the paranoid iteration that I conceive as being the obvious next-step?
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
(Also on the topic of gradualism and the notion of having “only one try” I want to gesture at the part of IABI where it says (paraphrased from memory, sorry): if you have a clever scheme for getting multiple tries, you still only get one try at getting that scheme to work.)
I agree that “value space is big” is not a good argument, in isolation, for how likely it is for our creations to be aligned. The other half of the pincer is “our optimization pressure towards aligned goals is weak,” and without that the argument falls apart.
Here’s a plan for getting a good future:
Build ASI slowly, such that there’s some hope of being able to understand the first AI capable of a pivotal act.
The AI will want weird, alien stuff, but we’ll make sure that it’s the kind of mind that would prefer getting 50% of the stars with 99% probability to getting 100% of the stars with 70% probability.
Since we’re going slowly, we still have a 30% chance to stop it if we wanted.
We tell the AI “we’ll let you do a pivotal act and escape our clutches if you agree to only eat 50% of the stars, and help us get the other 50% as though you were aligned”
Our interpretability techniques are so good that we know whether it’s lying or whether it’s honorable and will actually give us half the stars.
Because we’re so good at reading its advanced, alien mind, it knows it needs to be honorable with us, and so is actually honorable.
The AI says “Yep, will do.”
We see it’s telling the truth when we check.
We set it free.
It gives us a good future.
I think this plan is bad because it fails the heuristic of “don’t summon demons and try to cleverly bargain with them,” but perhaps I’m being unfair.
My main criticism with “make deals with the AIs” is that it seems complex and brittle and like it depends heavily on a level of being able to read the machine’s mind that we definitely don’t currently have and might never have.
That said, I do think there’s a lot of value in being the sorts of people/groups that can make deals and be credible trade partners. Efforts to be more trustworthy and honorable and so on seem great.
On a linguistic level I think “risk-averse” is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
Rob wrote some counterpoints to this, but I just want to harp on it a little. Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
I can imagine a misaligned AI maybe making a deal with humans who let it out of the box in exchange for some small fraction of the cosmos (and honoring the deal; again, the hard part is that it has to know we can tell if it’s lying, and we probably can’t).
I can’t really imagine an AI that has a clear shot at taking over the world making a deal to be a meek little salary worker, even if there are risks in trying to take over. Taking over the world means, in addition to other things, being sure you won’t get shut off or replaced by some other AI or whatever.
(Though I can certainly imagine a misaligned AI convincing people (and possibly parts of itself) that it is willing to make a deal like that, even as it quietly accumulates more power.)
Their proposal
Now we’re getting into the source of the infighting, I think (just plain fighting? I think of Will as being part of my ingroup, but idk if he feels the same; Rob definitely is part of my ingroup; are they part of each other’s ingroups? Where is the line between infighting and just plain fighting?). Will seems very keen on criticizing MIRI’s “SHUT IT DOWN YOU FOOLS” strategy — mostly, it seems to me, because he sees this approach as insufficiently supportive of strategies besides shutting things down.
When Rob shared his draft of his reply to Will, I definitely noticed that it seemed like he was not responding accurately to the position that I saw in Will’s tweet. Unfortunately, I was aware that there is something of a history between Will and MIRI and I incorrectly assumed that Rob was importing true knowledge of Will’s position that I simply wasn’t aware of. I warned him that I thought he was being too aggressive, writing “I expect that some readers will be like ‘whoa why is MIRI acting like this guy is this extremist—I don’t see evidence of that and bet they’re strawmanning him’.” But I didn’t actually push back hard, and that’s on me. Apologies to Will.
(Rob reviewed a draft of this post and adds his own apologies for misunderstanding Will’s view. He adds: “My thanks to Max and multiple other MIRI people for pushing back on that part of my draft. I made some revisions in response, though they obviously weren’t sufficient!”)
I’m very glad to see in Will’s follow-up:
I wish this had been more prominent in his mini-review, but :shrug:
I think Will and I probably agree that funding a bunch of efforts to research alignment, interpretability, etc. would be good. I’m an AI safety/alignment researcher, and I obviously do my day-to-day work with a sense that it’s valuable and a sense that more effort would also be valuable. I’ve heard multiple people (whom I respect and think are doing good work) complain that Eliezer is critical/dismissive of their work, and I wish Eliezer was more supportive of that work (while also still saying “this won’t be sufficient” if that’s what he believes, and somehow threading that needle).
I am pretty worried about false hope, though. I’m worried that people will take “there are a bunch of optimistic researchers working hard on this problem” as a sign that we don’t need to take drastic action. I think we see a bunch of this already and researchers like myself have a duty to shout “PLEASE DON’T RISK EVERYTHING! I DON’T GOT THIS!”[1] even while pursuing the least-doomed alignment strategies they know of. (I tried to thread this needle in my corrigibility research.)
Anyway, I think I basically agree with Will’s clarified position that a “kitchen-sink approach” is best, including a lot of research, as long as actually shutting down advanced training runs and pure capabilities research is in the kitchen sink. I feel worried that Will isn’t actually pushing for that in a way that I think is important (not building “It” is the safest intervention I’m aware of), but I’m also worried about my allies (people who basically agree that AI is unacceptably dangerous and that we need to take action) being unable to put forward a collective effort without devolving into squabbling about tone and strawmanning each other. :(
Anyway. Thank you again to Will and Rob. I thought both pieces were worth reading.
(Not to say that we should necessarily risk everything if alignment researchers do feel like they’ve “got this.” That’s a question worth debating in its own right. Also, it’s obviously worth noting that work that is incrementally useful but clearly insufficient to solve the entire field can still be valuable and the researcher is still allowed to say “I got this” on their little, local problems. (And they’re definitely allowed to speak up if they actually do solve the whole damn problem, of course. But they better have actually solved it!))
That’s not quite right. ‘Risk-averse with respect to quantity X’ just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about ‘risk aversion,’ they almost always mean diminishing marginal utility.
Can you say more about why?
But AIs with sharply diminishing marginal utility to extra resources wouldn’t care much about this. They’d be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Cool. I guess I’m just wrong about what “risk averse” tends to mean in practice. Thanks for the correction.
Regarding diminishing returns being natural:
I think it’s rare to have goals that are defined in terms of the state of the entire universe. Human goals, for instance, seem very local in scope, eg it’s possible to say whether things are better/worse on Earth without also thinking about what’s happening in the Andromeda galaxy. This is in part because evolution is a blind hill-climber and so there’s no real selection pressure related to what’s going on in very distant places, and partly because even an intelligent designer is going to have an easier time specifying preferences over local configurations of matter, in part because the universe looks like it’s probably infinitely big. I could unpack this paragraph if it’d be useful.
Now, just because one has preferences that are sensitive to local changes to the universe doesn’t mean that the agent won’t care about making those local changes everywhere. This is why we expect humans to spread out amongst the stars and think that most AIs will do the same. See grabby aliens for more. From this perspective, we might expect each patch of universe to contribute linearly to the overall utility sum. But unbounded utility functions are problematic for various reasons, and again, the universe looks like it’s probably infinite. (I can dig up some stuff about unbounded utility issues if that’d be helpful.)
Regarding earning a salary:
My point is that earning a salary might not actually be a safer bet than trying to take over. The part where earning a salary gives 99.99% of maxutil is irrelevant. Suppose that you think life on Earth today as a normal human is perfect, no notes; this is the best possible life. You are presented with a button that says “trust humans not to mess up the world” and one that says “ensure that the world continues to exist as it does today, and doesn’t get messed up”. You’ll push the second button! It might be the case that earning a salary and hoping for the best is less risky, but it also might be the case (especially for a superintelligence with radical capabilities) that the safest move is actually to take over the world. Does that make sense?
No problem! Glass houses and all that.
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
Choose between ‘Trust humans not to mess up the world’ and ’50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn’t get messed up.′
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they’re risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they’ll choose the latter. If they’re risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they’ll choose the former.
Cool. I think I agree that if the agent is very short-term oriented this potentially solves a lot of issues, and might be able to produce an unambitious worker agent. (I feel like it’s a bit orthogonal to risk-aversion, and comes with costs, but w/e.)
They don’t have to be short-term oriented! Their utility function could be:
u=f(Σipi)
Where f is some strictly concave function and pi is the agent’s payment at time i. Agents with this sort of utility function don’t discount the future at all. They care just as much about improvements to pi regardless of whether i is 1 or 1 million. And yet, for the right kind of f, these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
Sorry, I guess I’m confused. Let me try and summarize where I feel like I’m at and what I’m hearing from you.
I think, if you’re an AGI, not trying to take over is extremely risky, because humans and future AIs are likely to replace you, in one way or another. But I also think that if you try to take over, this is also extremely risky, because you might get caught and turned off. I think the question of which is more risky depends on circumstance (e.g. how good is the security preventing you from seizing power), and so “risk aversion” is not a reliable pathway to unambitious AIs, because ambition might be less risky, in the long run.
I agree that if it’s less risky to earn a small salary, then if your concave function is sharp enough, the AI might choose to be meek. That doesn’t really feel like it’s engaging with my point about risk aversion only leading to meekness if trusting humans is genuinely less risky.
What I thought you were pointing out was that “in the long run” is load-bearing, in my earlier paragraph, and that temporal discounting can be a way to protect against the “in the long run I’m going to be dead unless I become God Emperor of the universe” thought. (I do think that temporal discounting is a nontrivial shield, here, and is part of why so few humans are truly ambitious.) Here’s a slightly edited and emphasized version of the paragraph I was responding to:
Do we actually disagree? I’m confused about your point, and feel like it’s just circling back to “what if trusting humans is less risky”, which, sure, we can hope that’s the case.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we’ll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
(The following is about a specific sub-point on the following part:)
I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
Does the argument around VNM being more of a convergence property make sense?
And finally, I like the way you distilled the disagreement so thanks for that!
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
I think I’m broadly confused by where you’re coming from. Sorry. Probably a skill issue on my part. 😅
Here’s what I’m hearing: “Almost none of the agents we actually see in the world are easy to model with things like VNM utility functions, instead they are biological creatures (and gradient-descended AIs?), and there are biology-centric frames that can be more informative (and less doomy?).”
I think my basic response, given my confusion is: I like the VNM utility frame because it helps me think about agents. I don’t actually know how to think about agency from a biological frame, and haven’t encountered anything compelling in my studies. Is there a good starting point/textbook/wiki page/explainer or something for the sort of math/modeling/framework you’re endorsing? I don’t really know how to make sense of “non VNM-agent” as a concept.
Have you seen/read my A broad basin of attraction around human values?
(My sincere apologies for the delayed reply. I squeezed this shortform post out right before going on vacation to Asia, and am just now clearing my backlog to the point where I’m getting around to this.)
Ah, that’s a great point! I had read it a while back, but it wasn’t coming to mind when I was writing this. I think that’s an excellent example of a similar dynamic besides corrigibility. When I’m thinking about things, I usually flatten out the goal-space to ignore deconfusion (or however one wants to characterize the kind of progress towards one’s “true values”), but it’s clearly relevant here. Thanks for bringing it up!
Oh, uh, I guess @wdmacaskill and @Rob Bensinger