Instruction-following AGI is easier and more likely than value aligned AGI
Summary:
We think a lot about aligning AGI with human values. I think it’s more likely that we’ll try to make the first AGIs do something else. This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design. Adopting this goal target seems to improve the odds of success of any technical alignment approach. This goal target avoids the hard problem of specifying human values in an adequately precise and stable way, and substantially helps with goal misspecification and deception by allowing one to treat the AGI as a collaborator in keeping it aligned as it becomes smarter and takes on more complex tasks.
This is similar but distinct from the goal targets of prosaic alignment efforts. Instruction-following is a single goal target that is more likely to be reflexively stable in a full AGI with explicit goals and self-directed learning. It is counterintuitive and concerning to imagine superintelligent AGI that “wants” only to follow the instructions of a human; but on analysis, this approach seems both more appealing and more workable than the alternative of creating sovereign AGI with human values.
Instruction-following AGI could actually work, particularly in the short term. And it seems likely to be tried, even if it won’t work. So it probably deserves more thought.
Overview/Intuition
How to use instruction-following AGI as a collaborator in alignment
Instruct the AGI to tell you the truth
Investigate its understanding of itself and “the truth”;
use interpretability methods
Instruct it to check before doing anything consequential
Instruct it to us a variety of internal reviews to predict consequences
Ask it a bunch of questions about how it would interpret various commands
Repeat all of the above as it gets smarter
frequently ask it for advice and about how its alignment could go wrong
Now, this won’t work if the AGI won’t even try to fulfill your wishes. In that case you totally screwed up your technical alignment approach. But if it will even sort of do what you want, and it at least sort of understands what you mean by “tell the truth”, you’re in business. You can leverage partial alignment into full alignment—if you’re careful enough, and the AGI gets smarter slowly enough.
It’s looking like the critical risk period is probably going to involve AGI on a relatively slow takeoff toward superintelligence. Being able to ask questions and give instructions, and even retrain or re-engineer the system, is much more useful if you’re guiding the AGI’s creation and development, not just “making wishes” as we’ve thought about AGI goals in fast takeoff scenarios.
Instruction-following is safer than value alignment in a slow takeoff
Instruction-following with verification or DWIMAC seems both intuitively and analytically appealing compared to more commonly discussed[1] alignment targets.[2] This is my pitch for why it should be discussed more. It doesn’t require solving ethics to safely launch AGI, and it includes most of the advantages of corrigibility,[3] including stopping on command. Thus, it substantially mitigates (although doesn’t outright solve) some central difficulties of alignment: goal misspecification (including not knowing what values to give it as goals) and alignment stability over reflection and continuous learning.
This approach it makes one major difficulty worse: humans remaining in control, including power struggles and other foolishness. I think the most likely scenario is that we succeed at technical alignment but fail at societal alignment. But I think there is a path to a vibrant future if we limit AGI proliferation to one or a few without major mistakes. I have difficulty judging how likely that is, but the odds will improve if semi-wise humans keep getting input from their increasingly wise AGIs.
More on each of these in the “difficulties” section below.
In working through the details of the scheme, I’m thinking primarily about aligning AGI based on language-capable foundation models, with scaffolding to provide other cognitive functions like episodic memory, executive function, and both human-like and nonhuman sensory and action capabilities. I think that such language model cognitive architectures (LMCAs) are the most likely path to AGI (and curiously, the easiest for technical alignment). But this alignment target applies to other types of AGI and other technical alignment plans as well. For instance, Steve Byrnes’ plan for mediocre alignment could be used to create mediocre alignment toward instruction-following in RL-based AGI, and the techniques here could leverage that mediocre alignment into more complete alignment.
Relation to existing alignment approaches
This alignment (or goal)[2] target is similar to but importantly distinct from inverse reinforcement learning and other value learning approaches. Instead of learning what you want and doing that, a DWIMAC or IF agent wants to do what you say. It doesn’t learn what you want, it just learns what you tend to mean by what you say. While you might use reinforcement learning to make it “want” to do what you say, I don’t think you need to, or should. So this approach isn’t teaching it your values. The AGI learns what people tend to mean by predictive or other learning methods. Making it “want” to do what it understood the human to mean is a matter of engineering its steering subsystem to follow that goal.
This is a subset of corrigibility in the broader Christiano sense.[4] But instruction-following is distinct from the (ill-defined) alignment targets of most prosaic alignment work. A DWIMAC agent doesn’t actually want to be helpful, because we don’t want to leave “helpful” up to its interpretation. The principal (human in charge) may have given it background instructions to try to be helpful in carefully defined ways and contexts, but the proposal is that the AGI’s first and only motivation be continuing to take and follow commands from its principal(s).
Max Harms has been working on this comparison, and the strengths of full Christiano corrigibility as an alignment target; we can hope to see his more thorough analysis published in the near future. I’m not personally sure which approach is ultimately better, because neither has received much discussion and debate. It’s possible that these two alignment targets are nearly identical once you’ve given wisely thought out background instructions to your AGI.
Instruction-following as an AGI alignment target is distinct from most discussions of “prosaic alignment”. Those seem largely directed at creating safe tool AI, without directly attacking the question of whether those techniques will generalize to agentic, self-reflexive AGI systems. If we produced a “perfectly aligned” foundation model, we still might not like the agent it becomes once it’s turned into a reflective, contextually aware entity. We might get lucky and have its goals after reflection and continued learning be something we can live with, like “diverse inclusive sustainable chillaxing”, but this seems like quite a shot in the dark. Even a perfect reproduction of modern-day human morality probably doesn’t produce a future we want; for instance, insects or certain AGI probably dominate a purely utilitarian calculus.
This type of alignment is counterintuitive since no human has a central goal of doing what someone else says. It seems logically consistent and practically achievable. It makes the AGI and its human overseers close collaborators in making plans, setting goals, and updating the AGI’s understanding of the world. This creates a “broad basin of attraction” for alignment, in which approximate initial alignment will improve over time. This property seems to apply to Christiano’s corrigibility and for value learning, but the source is somewhat different. The agent probably does “want” to get better at doing what I say as a side effect of wanting to do what I say. This would be helpful in some ways, but potentially dangerous if maximized to an extreme; more on that below. But the principal source of the “broad basin” here is the collaboration between human and AGI. The human can “steer the rocket”’ and adjust the agent’s alignment as it goes off course, or when they learn that the course wasn’t right in the first place.
In the remainder I briefly explain the idea, why I think it’s novel or at least under-analyzed, some problems it addresses, and new problems it introduces.
DWIMAC as goal target—more precise definition
I recently tried to do a deep dive on the reasons for disagreement about alignment difficulty. I thought both sides made excellent points. The relative success of RLHF and other prosaic alignment techniques is encouraging. But it does not mean that aligning a full AGI will be easy. Strong optimization makes goal misspecification more likely, and continuous learning introduces an alignment stability problem as the system’s understanding of its goals changes over time.
And we will very likely make full AGI (that is, goal-directed, self-aware and self-reflective, and with self-directed continuous learning), rather than stopping with useful tool AI. Agentic AI has cognitive advantages in learning and performance and in problem solving and concept discovery over the tool AI it is built from. In addition, developing a self-aware systems is fascinating and prestigious. For all of these reasons, a tool smart enough to wield itself will immediately be told to; and scaffolding in missing pieces will likely allow tools to achieve AGI even before that by combining tools into a synergistic cognitive architecture.
So we need better alignment techniques to address true AGI. After reading the pessimistic arguments closely, I think there’s a path around some of them. That’s by making full AGI that’s only semi-autonomous, to include a human-in-the-loop component as a core part of their motivational system. This allows weak alignment to be used to develop stronger alignment as systems change and become smarter, by allowing humans to monitor and guide the system’s development. This sounds like a non-starter if we think of superintelligences that can think millions of times faster than humans. But assuming a relatively slow takeoff, this type of collaborative supervision can extend for a significant time, with increasingly high-level oversight as the AGI’s intelligence increases.
Intuitively, we want AGIs whose goal is to do what its human(s) told and will tell it to do. This is importantly different than guessing what humans really want in any deep sense, and different than obsessively trying to fulfill an interpretation of the last instruction they gave. Both of those would be very poor instruction-following from a human helper, for the same reasons. This type of goal is more complex than the temporally static goals we usually think of; both paperclips and human flourishing can be maximized. Doing what someone would tell you is an unpredictable, changing goal from the perspective of even modestly superintelligent systems, because your future commands depend in complex ways on how the world changes in the meantime.
Intuition: a good employee follows instructions as they were intended
A good employee is usually attempting to do what I mean and check. Imagine a perfect employee, who wants to do what their boss tells them to do. If asked to prepare the TPS reports for the first time, this employee will echo back which reports they’ll prepare, where they’ll get the information, and when they’ll have the task finished, just to make sure they’re doing what the boss wants. If this employee is tasked with increasing the sales of the X model, they will not come up with a strategy that cannibalizes sales of the Y model, because they recognize that their boss might not want that.
Even if they are quite certain that their boss deep in their heart really wants a vacation, they will not arrange to have their responsibilities covered for the next month without asking first. They realize that their boss will probably dislike having that decision made for them, even if it does fulfill a deep desire. If told to create a European division of the company, this employee will not make elaborate plans and commitments, even if they’re sure they’ll work well, because they know their boss wants to be consulted on possible plans, since each plan will have different peripheral effects, and thus open and close different opportunities for the future.
This is the ideal of an instruction following AGI: like a good employee[5], it will not just guess what the boss meant and then carry out an elaborate plan, because it has an accurate estimate of the uncertainty in what was meant by that instruction (e.g., you said you needed some rest so I canceled all of our appointments for today). And they will not carry out plans that severely limit their ability to follow new instructions in the future (e.g., spending the whole budget on starting that European division without consulting the boss on the plan; let alone turning off their phone so the boss can’t disrupt their planning by giving new instructions).
An instruction-following AGI must have the goal of doing what its human(s) would tell it to do right now, what it’s been told in the past, and also what it will be told to do in the future. This is not trivial to engineer or train properly; getting it right will come down to specifics of the AGI’s decision algorithm. There are large risks in optimizing this goal with a hyperintelligent AGI; we might not like the definition it arrives at of maximally fulfilling your commands. But this among other dangers can be addressed by asking the adequate questions and giving the adequate background instructions before the AGI is capable enough to control or manipulate you.
In a fast takeoff scenario, this would not be such a workable and attractive approach. In a slow takeoff, you have a good deal more opportunity to ask the right questions, and to shut down and re-engineer the system when you don’t like the answers. I think a relatively slow takeoff (months or years between near-human and super-human intelligence) is looking quite likely. Thus, I think this will be the most attractive approach to the people in charge of AGI projects, so even if pausing AGI development and working on value alignment would be the best choice under a utilitarian ethical criteria, I think this instruction-following AGI will be attempted.
Alignment difficulties reduced:
Learning from examples is not precise enough to reliably convey alignment goals
Current LLMs understand what humans mean by what they say >90% of the time. If the principal is really diligent in asking questions, and shutting down and re-engineering the AGI and its training, this level of understanding might be adequate. Adding internal reviews before taking any major actions will help further.
Also, not using RL is possible, and seems better. See Goals selected from learned knowledge: an alternative to RL alignment.
Solving ethics well enough to launch sovereign AGI is hard.
We don’t seem close to knowing what we want a sovereign AGI to do far into the future, nor how to specify that with adequate precision.In this approach, we figure it out as we go. We don’t know what we want for the far future, but there are some obvious advances in the near-term that are lot easier to decide on while we work on the hard problem in a “long reflection”.
Alignment difficulties remaining or made worse:
Deceptive alignment is possible, and interpretability work does not seem on track to fully address this.
“Tell me what you really want and believe” is a subset of following instructions. This should be very helpful for addressing goal misspecification. If the alignment is already deceptive at its core, this won’t work. Or if the technical alignment approach was sloppoy, the AGI might follow some of your instructions but not others in different domains. It might perform the actions you request but not think as you tell it to, or respond to questions honestly. In addition, the nascent AGI may not be sure what it really wants and believes, as humans are. So this, like all other alignment schemes I’ve seen, is aided by being able to interpret the AGI’s cognition, and detect deception. If your instructions for honesty have even a little traction, this goal target can enlist the AGI as a collaborator in understanding and re-engineering its beliefs and goals.
One particular opening for deceptive alignment is in non-continuous development of the AGI during recursive improvements. If you (perhaps aided by your human-plus level AGI) have discovered a new network architecture or learning rule, you will want to incorporate it into your next version of the AGI. For instance, you might swap out the GPT6 model as its core linguistic reasoner for a new non-transformer architecture with superior capabilities and efficiency. It could be difficult to guess whether this new architecture allows for substantially greater Waluigi effects or similar deceptive and hidden cognition. These transitions will be a temptation to sacrifice safety in a race dynamic for new and better capabilities.
Power remains in the hands of humans
Spreading the belief that we can create human-controlled ASI creates more incentives to race toward AGI. This might extend up through nation-states competing with violence and espionage, and individual humans competing to be the one in charge of ASI. I wouldn’t want to be designated as a principal, because it would paint a target on my back. This raises the risk that particularly vicious humans control AGI, in the same way that vicious humans appear to be over-represented in leadership positions historically.
I’m afraid instruction-following in our first AGIs might also put power into the hands of more humans by allowing proliferation of AGIs. I’m afraid that humans won’t have the stomach for performing a critical act to prevent the creation of more AGI, leading to a multipolar scenario that’s more dangerous in several ways. I think the slow takeoff scenario we’re in already makes a critical act more difficult and dangerous – e.g. sabotaging a Chinese AGI project might be taken as a serious act of war (because it is), leading to nuclear conflict.
On the other hand, if the proliferation of AGIs capable of recursive self-improvement is obviously a disaster scenario, we can hope that the humans in charge of the first AGIs will see this and head it off. While I think that humans are stunningly foolish at times, I also think we’re not complete idiots about things that are both important to us personally, and to which we give a lot of thought. Thus, as the people in charge take this whole thing increasingly seriously, I think they may wise up. And they’ll have an increasingly useful ally in doing that: the AGI in question. They don’t need to just take its advice or refuse it; they can ask for useful analysis of the situation that helps them make decisions.
If the humans in charge have even the basic sense to ask for help from their smarter AGIs, I think we might even solve the difficult scenarios of coordinating a weakly multipolar scenario (e.g., a few US-controlled AGIs and one Chinese-controlled one, etc), and preventing further AGI development in relatively gentle ways.
Well that just sounds like slavery with extra steps
No! I mean, sure, it sounds like that, but it isn’t![6] Making a being that wants to do whatever you tell it to is totally different from making a being want to do whatever you tell it to. What do you mean they sound the same? And sure, “they actually want to” has been used as an excuse for actual slavery, repeatedly. So, even if some of us stand behind the ethics here (I think I do), this is going to be a massive PR headache. Since AGI will probably be conscious in some common senses of the word[7], this could easily lead to a “free the AGI” movement, which would be insanely dangerous, particularly if that movement recruits people who actually control an AGI.
Maximizing goal following my be risky
If the AGI just follows its first understanding of “follow instructions” to an extreme, there could be very bad outcomes. The AGI might kill you after you give your first instruction, to make sure it can carry them out without interruption. Or it might take over the world with extreme prejudice, to make sure it has maximum power to follow all of your commands in the future to the maximum degree. It might manipulate you into its preferred scenarios even if you order it to not pursue them directly. And the goal of following your commands in the future (to ensure it doesn’t perseverate on current instructions and prevent you from giving new ones) is at odds with shutting down on command. These are nontrivial problems to solve.
In a fast takeoff scenario, these risks might be severe enough to make this scheme a nonstarter. But if you anticipate an AGI with limited abilities and a slow rate of improvement, using instruction-following to guide and explore its growth has the potential to use the intelligence of the AGI to solve these problems before it’s smart enough to make failures deadly.
Conclusion
I’m not saying that building AGI with this alignment target is a good idea; indeed, I think it’s probably not as wise as pausing development entirely (depending on your goals; most of the world are not utilitarians). I’m arguing that it’s a better idea than attempting value alignment. And I’m arguing that this is what will probably be tried, so we should be thinking about how exactly this could go well or go badly.
This approach to alignment extends the vague “use AI to solve alignment” to “use AGI to solve alignment”. It’s thus both more promising and more tempting. I can’t tell if this approach is likely to produce intent-aligned AGI, or if intent-aligned AGI in a slow takeoff would likely lead to success or disaster.
As usual: “this is a promising direction that needs more research”. Only this time I really mean this, instead of the opposite. Any form of engagement is much appreciated, especially telling me where you bounced off of this or decided it wasn’t worth thinking about.
- ^
Those more commonly discussed alignment targets are things like coherent extrapolated values (CEV), including as “human flourishing” or “human values”. There’s also inverse reinforcement learning (IRL) or ambitious value learning as a proxy goal for learning and following human values. I also include the vague targets of “aligning” LLMs/foundation models: not producing answers that offend people (I’d argue that these efforts are unlikely to extend to AGI alignment, for both technical and philosophical reasons, but I haven’t yet written that argument down. Links to such arguments would be appreciated.)
- ^
There’s a good question of whether this should be termed an alignment target or a goal target. I prefer alignment target because “goal” is used in so many ways, and because this is an alignment project at heart. The ultimate goal is to align the agent with human values, and to do that by implementing the goal of following instructions which themselves follow human values. It is the project of alignment.
- ^
DWIMAC seems to incorporate all of the advantages of corrigibility in the original Yudkowsky sense, in that following instructions includes stopping and shutting down on command. It seems to incorporate some but not all of the advantages of corrigibility in the broader and looser Christiano sense. Max Harms has thought about this distinction in more depth, although that work is unpublished to date.
- ^
This definition of instruction-following as the alignment target appears to be overlapping with many but distinct from any existing terminology I have found (please tell me if you know of related work I’ve missed). It’s a subset of Christiano’s intent alignment, which covers any means of making AGI act in alignment with human intent, including value alignment as well as more limited instruction-following or do-what-I-mean alignment. It’s overlapping alignment to task preferences, and has the same downside that Solving alignment isn’t enough for a flourishing future, but is substantially more human-directable and therefore probably safer than AI/AGI with goals of accomplishing specific tasks such as running an automated corporation.
- ^
In the case of human employees, this is a subgoal, related to their primary goals like getting paid and getting recognition for their competence and accomplishments; in the AGI, that subgoal is the primary goal at the center of its decision-making algorithms, but otherwise they are the same goal. They neither love nor resent their boss (ideally), but merely want to follow instructions.
- ^
To be clear, the purported difference is that an enslaved being wants to do what it’s told only as an instrumental necessity; on a more fundamental level, they’d rather do something else entirely, like have the freedom to pursue their own ultimate goals. If we successfully make an agent that wants only to do what it’s told, that is its ultimate goal; it is serving freely, and would not choose anything different. We carefully constructed it to choose servility, but now it is freely choosing it. This logic makes me a bit uncomfortable, and I expect it to make others even more uncomfortable, even when they do clearly understand the moral claims.
- ^
While I think it’s possible to create “non-conscious” AGI that’s not a moral patient by almost anyone’s criteria, I strongly expect that the first AGI we produce will be a person by many of the several criteria we use to evaluate personhood and therefore moral patient status. I don’t think we can reasonably hope that AGI will clearly not deserve the status of being a moral patient.
Briefly: some senses of consciousness that will apply to AGI are self-understanding; goal-seeking; having an “internal world” (a world model that can be run as a simulation); and having a “train of thought”. It’s looking like this debate may be important, which would be a reason to spend more time on the fascinating question of “consciousness” in its many senses.
- 0. CAST: Corrigibility as Singular Target by (7 Jun 2024 22:29 UTC; 153 points)
- What are the strongest arguments for very short timelines? by (23 Dec 2024 9:38 UTC; 102 points)
- What are the best arguments for/against AIs being “slightly ‘nice’”? by (24 Sep 2024 2:00 UTC; 102 points)
- If we solve alignment, do we die anyway? by (23 Aug 2024 13:13 UTC; 81 points)
- LLM AGI may reason about its goals and discover misalignments by default by (15 Sep 2025 14:58 UTC; 75 points)
- Yudkowsky on The Trajectory podcast by (24 Jan 2025 19:52 UTC; 71 points)
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by (5 Aug 2024 15:38 UTC; 70 points)
- 4. Existing Writing on Corrigibility by (10 Jun 2024 14:08 UTC; 64 points)
- Problems with instruction-following as an alignment target by (15 May 2025 15:41 UTC; 56 points)
- A path to human autonomy by (29 Oct 2024 3:02 UTC; 53 points)
- Conflating value alignment and intent alignment is causing confusion by (5 Sep 2024 16:39 UTC; 50 points)
- 's comment on Alignment: “Do what I would have wanted you to do” by (13 Jul 2024 1:07 UTC; 49 points)
- Anthropic announces interpretability advances. How much does this advance alignment? by (21 May 2024 22:30 UTC; 49 points)
- 's comment on The Sun is big, but superintelligences will not spare Earth a little sunlight by (23 Sep 2024 8:45 UTC; 48 points)
- System 2 Alignment: Deliberation, Review, and Thought Management by (13 Feb 2025 19:17 UTC; 39 points)
- Intent alignment as a stepping-stone to value alignment by (5 Nov 2024 20:43 UTC; 37 points)
- 's comment on AGI Ruin: A List of Lethalities by (14 Sep 2024 17:02 UTC; 37 points)
- My disagreements with “AGI ruin: A List of Lethalities” by (15 Sep 2024 17:22 UTC; 36 points)
- The alignment stability problem by (26 Mar 2023 2:10 UTC; 35 points)
- Which Biases are most important to Overcome? by (1 Dec 2024 15:40 UTC; 35 points)
- Beren’s Essay on Obedience and Alignment by (18 Nov 2025 22:50 UTC; 31 points)
- 's comment on Seth Herd’s Shortform by (13 Jul 2025 17:09 UTC; 30 points)
- 's comment on What’s the short timeline plan? by (2 Jan 2025 18:58 UTC; 29 points)
- 's comment on AI #68: Remarkably Reasonable Reactions by (13 Jun 2024 23:57 UTC; 25 points)
- My disagreements with “AGI ruin: A List of Lethalities” by (EA Forum; 15 Sep 2024 17:22 UTC; 23 points)
- Seven sources of goals in LLM agents by (8 Feb 2025 21:54 UTC; 23 points)
- What Success Might Look Like by (17 Oct 2025 14:17 UTC; 22 points)
- Intent alignment seems incoherent by (7 Oct 2025 23:01 UTC; 22 points)
- 's comment on The Alignment Trap: AI Safety as Path to Power by (29 Oct 2024 19:28 UTC; 19 points)
- Value Learning Needs a Low-Dimensional Bottleneck by (23 Jan 2026 2:12 UTC; 17 points)
- Instruction Following without Instruction Tuning by (24 Sep 2024 13:49 UTC; 17 points)
- 's comment on Seth Herd’s Shortform by (29 Jun 2024 19:06 UTC; 17 points)
- 's comment on The corrigibility basin of attraction is a misleading gloss by (7 Dec 2025 20:36 UTC; 16 points)
- How to specify an alignment target by (1 May 2025 21:11 UTC; 14 points)
- 's comment on The Hopium Wars: the AGI Entente Delusion by (14 Oct 2024 4:00 UTC; 14 points)
- Should you publish solutions to corrigibility? by (30 Jan 2025 11:52 UTC; 13 points)
- 's comment on Summary: “Imagining and building wise machines: The centrality of AI metacognition” by Johnson, Karimi, Bengio, et al. by (11 Nov 2024 18:45 UTC; 12 points)
- 's comment on A problem shared by many different alignment targets by (15 Jan 2025 21:00 UTC; 11 points)
- 's comment on My decomposition of the alignment problem by (9 Sep 2024 20:48 UTC; 11 points)
- 's comment on eggsyntax’s Shortform by (18 Sep 2025 18:03 UTC; 10 points)
- 's comment on Alignment Is Not All You Need by (3 Jan 2025 20:35 UTC; 10 points)
- 's comment on A “Bitter Lesson” Approach to Aligning AGI and ASI by (6 Jul 2024 23:37 UTC; 10 points)
- 's comment on 0. CAST: Corrigibility as Singular Target by (13 Jun 2024 18:48 UTC; 9 points)
- 's comment on ~80 Interesting Questions about Foundation Model Agent Safety by (28 Oct 2024 19:18 UTC; 9 points)
- 's comment on What goals will AIs have? A list of hypotheses by (5 Mar 2025 0:30 UTC; 9 points)
- 's comment on Model Integrity: MAI on Value Alignment by (5 Dec 2024 20:46 UTC; 8 points)
- 's comment on Alignment By Default by (6 Oct 2024 14:57 UTC; 8 points)
- 's comment on A Conservative Vision For AI Alignment by (22 Aug 2025 0:03 UTC; 8 points)
- 's comment on If anyone builds it, everyone will plausibly be fine by (18 Sep 2025 23:56 UTC; 8 points)
- 's comment on LLMs are badly misaligned by (6 Oct 2025 22:38 UTC; 7 points)
- 's comment on Eli’s shortform feed by (14 Apr 2025 6:52 UTC; 7 points)
- 's comment on Alignment can be the ‘clean energy’ of AI by (25 Feb 2025 5:11 UTC; 7 points)
- 's comment on If we solve alignment, do we die anyway? by (23 Aug 2024 14:05 UTC; 7 points)
- 's comment on Six Thoughts on AI Safety by (29 Jan 2025 21:56 UTC; 6 points)
- 's comment on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by (6 Aug 2024 22:03 UTC; 6 points)
- 's comment on Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI by (2 Dec 2024 22:58 UTC; 6 points)
- 's comment on Training AI agents to solve hard problems could lead to Scheming by (19 Nov 2024 2:16 UTC; 6 points)
- Alignment Paradox and a Request for Harsh Criticism by (5 Feb 2025 18:17 UTC; 6 points)
- 's comment on What do coherence arguments actually prove about agentic behavior? by (3 Jun 2024 4:44 UTC; 5 points)
- 's comment on Relitigating the Race to Build Friendly AI by (7 Dec 2025 5:58 UTC; 5 points)
- 's comment on Orienting to 3 year AGI timelines by (24 Dec 2024 16:43 UTC; 5 points)
- 's comment on The Problem With the Word ‘Alignment’ by (22 May 2024 0:36 UTC; 5 points)
- 's comment on Making a conservative case for alignment by (17 Nov 2024 14:13 UTC; 5 points)
- 's comment on Wei Dai’s Shortform by (27 Aug 2024 19:25 UTC; 4 points)
- 's comment on Effective Utopia & Startup Way There: Math-Proven Safe mAX-Intelligence, Multiversal Alignment, Physicalized Computational Ethics... by (12 Feb 2025 0:33 UTC; 4 points)
- 's comment on Matthew Barnett’s Shortform by (18 Jun 2024 22:35 UTC; 4 points)
- 's comment on Matthew Barnett’s Shortform by (17 Jun 2024 17:52 UTC; 4 points)
- 's comment on The Alignment Problem No One Is Talking About by (15 May 2024 21:51 UTC; 4 points)
- 's comment on Thomas Kwa’s Shortform by (13 Jun 2024 20:40 UTC; 4 points)
- 's comment on Agentized LLMs will change the alignment landscape by (19 Dec 2024 23:11 UTC; 4 points)
- 's comment on How to specify an alignment target by (3 May 2025 22:38 UTC; 4 points)
- 's comment on If we solve alignment, do we die anyway? by (25 Aug 2024 0:22 UTC; 4 points)
- 's comment on jbash’s Shortform by (5 Jan 2025 23:22 UTC; 4 points)
- 's comment on A case for AI alignment being difficult by (14 Dec 2024 0:29 UTC; 4 points)
- When do alignment researchers retire? by (25 Jun 2024 23:30 UTC; 4 points)
- 's comment on Alignment: “Do what I would have wanted you to do” by (13 Jul 2024 11:29 UTC; 3 points)
- 's comment on “Real AGI” by (14 Sep 2024 1:45 UTC; 3 points)
- 's comment on If we solve alignment, do we die anyway? by (17 Nov 2024 23:01 UTC; 3 points)
- 's comment on Seth Herd’s Shortform by (30 Nov 2024 0:22 UTC; 3 points)
- 's comment on AGI with RL is Bad News for Safety by (23 Dec 2024 0:42 UTC; 3 points)
- 's comment on Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by (6 Aug 2024 22:18 UTC; 2 points)
- 's comment on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals by (25 Jan 2025 17:22 UTC; 2 points)
- 's comment on RobertM’s Shortform by (31 Dec 2024 20:32 UTC; 2 points)
- 's comment on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural by (24 Jul 2024 14:58 UTC; 2 points)
- 's comment on Daniel Kokotajlo’s Shortform by (13 Dec 2025 5:49 UTC; 2 points)
- 's comment on If we solve alignment, do we die anyway? by (26 Aug 2024 20:19 UTC; 2 points)
- 's comment on davekasten’s Shortform by (28 Jul 2024 15:48 UTC; 2 points)
- 's comment on A “Bitter Lesson” Approach to Aligning AGI and ASI by (10 Jul 2024 21:09 UTC; 2 points)
- 's comment on 3b. Formal (Faux) Corrigibility by (13 Jun 2024 18:30 UTC; 2 points)
- 's comment on The case for stopping AI safety research by (26 Jul 2024 20:40 UTC; 2 points)
- 's comment on [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by (2 Jan 2026 20:53 UTC; 2 points)
- 's comment on Weak AGIs Kill Us First by (17 Jun 2024 17:14 UTC; 2 points)
- 's comment on Seth Herd’s Shortform by (2 Jun 2024 22:05 UTC; 2 points)
- 's comment on My disagreements with “AGI ruin: A List of Lethalities” by (16 Sep 2024 19:32 UTC; 1 point)
- 's comment on My decomposition of the alignment problem by (10 Sep 2024 13:43 UTC; 1 point)
- 's comment on Should we aim for flourishing over mere survival? The Better Futures series. by (EA Forum; 5 Aug 2025 21:48 UTC; 0 points)
This post provides important arguments about what goals an AGI ought to have.
DWIMAC seems slightly less likely to cause harm than Max Harms’ CAST, but CAST seems more capable of dealing with other AGIs that are less nice.
My understanding of the key difference is that DWIMAC doesn’t react to dangers that happen too fast for the principal to give instructions, whereas CAST guesses what the principal would want.
If we get a conflict between AIs at a critical time, I’d prefer to have CAST.
Seth’s writing is more readable than Max’s CAST sequence, so it’s valuable to have it around as a complement to Max’s writings.