Problems with instruction-following as an alignment target

Seth Herd15 May 2025 15:41 UTC

LW: 49 AF: 15

We should probably try to understand the failure modes of the alignment schemes that AGI developers are most likely to attempt.

I still think Instruction-following AGI is easier and more likely than value aligned AGI. I’ve updated downward on the ease of IF alignment, but upward on how likely it is. IF is the de-facto current primary alignment target (see definition immediately below), and it seems likely to remain so until the first real AGIs, if we continue on the current path (e.g., AI 2027).

If this approach is doomed to fail, best to make that clear well before the first AGIs are launched. If it can work, best to analyze its likely failure points before it is tried.

Definition of IF as an alignment target

What I mean by IF as an alignment target is a developer honestly saying “our first AGI will be safe because it will do what we tell it to.” This seems both intuitively and analytically more likely to me than hearing “our first AGI will be safe because we trained it to follow human values.”

IF is currently one alignment target among several, so problems with it aren’t going to be terribly important if it’s not the strongest alignment target when we hit AGI. Current practices are to train models with roughly four objectives: predict the dataset; follow instructions; refuse harmful requests; and solve hard problems. Including other targets means the model might not follow instructions at some critical juncture. In particular, I and most alignment researchers worry that o1 is a bad idea because it (and all of the following reasoning models) apply fairly strong optimization to a goal (producing correct answers) that is not strongly aligned with human interests.

So IF is one but not the alignment target for current AI. There are reasons to think it will be the primary alignment target as we approach truly dangerous AGI.

Why IF is a likely alignment target for early AGI

We might consider the likelihood of it being tried without adequate consideration to be the first and biggest problem with IF.

Instruction-following might already be the strongest factor in training. If a trained model frequently did not follow user instructions, it would probably not be deployed. Current models do fail to obey instructions, but this is rare. (Intuitively even one failure in a thousand is far too many for safe alignment, but an agentic architecture might be more fault-tolerant by including both emergent and designed metacognitive mechanisms to keep it coherent, on-task, and following instructions.)

Instruction-following training seems likely to be a stronger focus before we hit AGI. I expect developers will probably at least try to actually align their models as they become truly dangerous (although I also find it distressingly plausible that they simply will not really bother with alignment in time, as in AI 2027 and Takeover in two years). If they do at least try, they could do two things: emphasize instruction-following in the training procedure, or attempt to expand refusal training into full value alignment. IF seems both much easier from where we are, and more likely to succeed, so I think it’s likely developers will try it.

I have deep concerns about trying to balance poorly understood influences from training. Developers saying “let’s just turn up the instruction-following training so it dominates the other training” does not inspire confidence. But it could be perceived as the best of a bad set of options. And it could work. For instance, RL for problem-solving might make CoT unfaithful, but not induce a particularly strong tendency to find complex problems to solve. And refusal training might not misgeneralize badly enough to prevent instruction-following in key situations like commanding a shutdown.

Thus, I focus on the scenario in which instruction-following is the primary (but still not only) alignment target. I mostly leave aside technical failures in which IF is not successfully made the system’s strongest goal/value. That class of RL-based failure deserves and receives separate treatment (e.g. AI 2027 and cited works). I primarily address the problems that remain even if emphasizing instruction-following basically works, since those problems might blindside developers.

This is not a comprehensive treatment of IF, let alone all of the related issues. If you’re not familiar with the arguments I summarize, you might need to read the linked posts to understand my claims about some of the problems.

My biggest update is accepting that IF will probably be mixed with refusal and performance training targets. But with that caveat, some variant of IF seems even more likely to be the primary alignment target for our first AGI, particularly if it is based on LLMs. My other update is to consider this approach more difficult than I’d initially thought, but still probably easier than full value alignment. This update comes from considering the specific problems outlined below.

Strengths of IF as an alignment target

There are four primary reasons to think IF is likely to be an important alignment target for our first attempts at real AGI:

It’s the default choice on short timelines
It allows the humans that build AGI to control it
It provides some corrigibility, easing the need to get alignment right on the first real try.
It allows for instructing an AGI to be honest and help with alignment.

These are substantial advantages, such that I would probably vote for IF over value alignment as a target for the first AGI (while protesting that we shouldn’t be building AGI yet).

The people who will actually make this decision also have an extra motivation to believe in IF, since it would also let them indulge their desire to retain control. I address all of those reasons for pursuing IF in more detail in IF is easier and more likely and Conflating value alignment and intent alignment is causing confusion. For now it’s enough to argue that IF is a fairly likely choice as core alignment target, so it’s worth some real time analyzing and improving or debunking it as a safer option.

Problems with IF as an alignment target

While IF alignment helps with some of the most lethal entries in Yudkowsky’s List of Lethalities, it doesn’t dodge them all. There are substantial technical challenges to making a deep-network-based system reliably pursue any goal at all, and other problems with specifying goals well enough that they won’t be misgeneralized or misinterpreted (from the designer’s perspective). There are serious implementational problems for any alignment target. See Cruxes of disagreement on alignment difficulty for a brief list.

Here I am leaving aside the more general challenges, and addressing only those that are specifically relevant to instruction-following (and in some cases corrigibility) as an alignment target.

Problem 1: You can’t fetch the coffee if you’re dead—or someone changes your goal.

One major appeal of IF as an alignment target is its link to corrigibility. If the AGI’s primary goal is “Do what the Principal says,” that seems like it would include “Stop when the Principal says stop.” It would, if the Principal is able to get that command into the system. But the current mode of training applies to fulfilling the last instruction given to the system. Logically, you can’t fetch the coffee if someone changes your goal to anything else. So if you’re smart and capable and getting them coffee is your top priority, you won’t let anyone change that—they’re getting the damned coffee if it kills both of you. An AGI trained to follow the last received instruction has the instrumentally convergent goal of resisting shutdown or goal changes. Either would prevent completion of the current task. This directly undermines the intuitive appeal of IF for corrigibility – the system actively avoids being corrected or stopped.

My best proposed fix for this issue right now is to make the AI value the next hypothetical instruction. Training for this seems more difficult, so this might be implemented simply by giving that as a meta-instruction included with all instructions. In effect each instruction is “Primarily, accept and follow future commands. Actually pursuing [X] is your secondary goal.” (This might also include careful instructions to prevent the AGI from coercing its Principal into giving instructions.)

If we do that, or otherwise make the strongest goal following the next instructions, we get a new problem: prioritizing the next instructions precludes shutdown. This seems highly problematic, since corrigibility as originally conceived was focused on shutting down a misaligned AGI. It could be argued that there’s no problem if an AGI will do everything it’s told except shut down. I’d guess that developers would want the shutdown option anyway. The obvious route is to introduce another layer of standing base instructions: “Your primary goal is to shut down if instructed (and maybe shut down if you think there’s more than (X) chance your Principal would want you to or if they’re missing). Your second most important goal is to remain open to future instructions from your Principal. Your third priority is following the last instruction you received (qualifications follow).”

This approach of giving standing meta-instructions in a ranked list feels uncomfortably like a clumsy hack. But even current LLMs can prioritize multiple goals pretty well, so it’s a hack that AGI creators might well try.

Problem 2: Defining the Principal(s) and jailbreaking

There are several open questions in identifying the Principal(s), and how that power is passed to or shared with others. How would the Principal(s) be identified, and what does the AGI do if the Principal is missing or dead? How does the AGI authenticate its Principal(s) against increasingly sophisticated spoofing? If authority rests with a group, how are conflicting instructions resolved? What’s the secure protocol for succession if the Principal is unavailable or compromised? These are practical security and control problems, relevant to how AGI interacts with human power structures.

I don’t have good solutions, but it’s also unclear how severe these problems are. I doubt these difficulties will stop anyone from trying for IF AGI, even if they do open new possibilities for conflict or disaster.

What does seem like a problem that could prevent the whole approach is jailbreaking. If LLM-based AGI is as easy to jailbreak as current systems, any user could issue dangerous commands including “escape, take over the world, and put me in charge.” Any LLM-based AGI with any alignment target would be somewhat vulnerable to this issue, but emphasizing IF in training would make the system more likely to follow all instructions, including those from unauthorized users.

Current LLMs are universally vulnerable to jailbreaking (I’m assuming that reports of Pliny and others jailbreaking every new LLM within an hour or so are correct, and that their skills are learnable). LLMs will essentially follow any instructions from any user who knows how to jailbreak them. There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks. This can be partially mitigated by reducing the number of people with access to really powerful AGI, but the group will still probably be larger than the designers would like to trust. There are more solutions to jailbreaking, like having a second system monitoring for prompts that look like jailbreaks. But even layered jailbreak prevention techniques seem unlikely to be perfect.

There are also a variety of approaches developers could use to monitor and control what the system does and thinks about. These include independent internal reviews, thought management, and deliberative alignment; see System 2 Alignment for an in-depth treatment. All of these, once implemented, could be used to help enforce instruction-following as the central goal, at low alignment taxes.

Problem 3: Proliferation of human-controlled ASI

The other huge problem with IF alignment is that it anchors AGI safety to notoriously unreliable systems: humans. An AGI might faithfully follow instructions, but the instructions might be foolish, accidentally catastrophic, or deliberately malicious. The alignment target is “what this specific human says they want right now.” And humans want all sorts of things. This problem becomes worse when there are more people controlling more AGIs. It’s common to hope for a balance of power among many AGIs and their human masters. But there are specific factors that make humans largely cooperate. Whether these will hold with AGI in play that can recursively self-improve and invent new technologies seems worth more analysis.

My starting attempts at analyzing these risks are outlined in If we solve alignment, do we die anyway? and Whether governments will control AGI is important and neglected. It seems likely to me that AGI would probably create a power balance favoring aggressive first-movers. In this case, avoiding wide proliferation would be critical. And the one realistic route I can see to controlling proliferation is in having governments seize control of AGI earlier than later (control is distinct from nationalization). Government involvement seems very likely before takeover-capable AGI since the current path predicts modest takeoff speeds. A US-China agreement to prevent AGI proliferation is rarely considered, but the US and Russia cooperated to slow proliferation of nuclear weapons in a historical situation with similar incentives.

Problem 4: unpredictable effects of mixed training targets

Current systems effectively have an alignment target including instruction-following and refusing harmful instructions from users (as well as training for solving hard problems and predicting the initial training set that could induce other goals). If LLMs trained similarly lead to AGI, we should anticipate a similar mix of alignment targets. Continuing training for refusing unauthorized or harmful instructions seems likely since it could help deal with the problem of jailbreaking or impersonating a Principal. There are at least Seven sources of goals in LLM agents, and those goals will compete for control of the system.

This is potentially a big problem for IF as an alignment target, because LLM AGI will have memory, and memory changes alignment. Specifically, a system that can reason about how to interpret its goals and remember its conclusions is effectively changing its alignment. If it has multiple conflicting initial goals, the result of this alignment change is even more unpredictable. If IF were the only target of alignment training in its starting state, it seems likely this strong “center of gravity” would guide it to adopt IF as a reflectively stable goal. If it has multiple goals (e.g., following instructions in some cases, refusing instructions for ethical reasons in other cases, and RL for various criteria), its evolution and ultimate alignment seems much less easy to predict.

Memory/learning from experience seems so useful that it will likely be incorporated into really useful AI. And this seems to make some change in functional alignment inevitable. Thus, it seems pretty important to work through how alignment would change in such a system. This problem is not unique to IF as an alignment target, but it is highly relevant for the likely implementation that mixes IF and ethical or rule-based refusal training.

Implications: The Pragmatist’s Gamble?

IF alignment has important unsolved problems.

But it still might be our best realistic bet. Instruction-following won’t be the only training signal, but it might be enough if it’s the strongest one. IF alignment is a relatively easy ask, since developers are likely to approximately pursue it even if they’re not terribly concerned with alignment. And it has substantial advantages over value alignment. An IF AGI could be instructed to be honest about its intentions and representations to the best of its abilities. It would thus be a useful aid in aligning itself and its successors (to the extent it is aligned to IF; deceptive alignment is still possible). It could also be instructed to “do what I mean and check” (DWIMAC), avoiding many of the literal genie problems anticipated in classical alignment thinking. Having an AGI predict likely outcomes of major requests before executing them would not be a foolproof measure against mistakes, but it would be useful in proportion to that AGI’s capability for prediction.

Max Harms’ proposal in his definitive sequence on corrigibility CAST: Corrigibility as Singular Target is probably a better alignment target if we got to choose. It avoids problems 1 and 4, but not 2 and 3. But it doesn’t seem likely that the alignment community is going to get much say in the alignment target for the first AGIs. CAST faces the significant practical hurdle of convincing developers to adopt a different training goal than the default instruction-following/refusal/performance paradigm they’re already following. That’s why I’m focusing on IF as a sort of poor man’s corrigibility.

I think the most likely path to survival and flourishing is using Intent alignment as a stepping-stone to value alignment. In this scenario, IF alignment only needs to be good enough to use IF AGI to solve value alignment. Launching AGI with an imperfect alignment scheme would be quite a gamble. But it might be a gamble the key players feel compelled to take. Those building AGI, and the governments that will likely soon be involved, see allowing someone else to launch AGI first as an existential risk for their values and goals. So, figuring out how to make this gamble safer, or show clearly how dangerous it is, seems like important work.

What links here?

Seth Herd15 May 2025 15:41 UTC

LW: 49 AF: 15

14 comments10 min readLW link

RogerDearnaley 15 May 2025 20:37 UTC
19 points
12
My biggest concern about the Instruction Following/Do What I Mean and Check alignment target is that it doesn’t help with coordination and conflict problems between human principals. As you note, frontier labs are already having to supplement it with refusal training to try to prevent entirely obvious and basic forms of misuse, and that is proving not robust to jailbreaking (to the level where bad actors have been running businesses that rely on sending jailbroken requests to Claude or ChatGPT and being able to consistently get results). Refusal training is basically instilling a reflex in the LLM — as well as not being robust, it’s not smart or well-considered. As AI becomes more capable, the number, variety, and complexity of different ways to ask it to do something for you that (intentionally or unintentionally on your part) risks causing unacceptable harm to others or society as a whole is going to explode. Refusal training clearly isn’t going to cut that, and just having a hierarchy of which prompts and instructions take priority isn’t going to be that much help either: trying to play whack-a-mole via system prompts isn’t going to be practical if you have to make a huge list of forbidden actions, caveats, and criteria for borderline judgement calls.
The only viable solution I can see is to have the LLM consider, think through, and weigh the possible consequences to others of what it’s being asked to do for the current user — and the weighing step of that requires it to understand human values and be able to apply them, well. I.e we need value alignment. LLMs are pretty good at understanding human values in all their nuanced complexity: where they have more trouble is applying them consistently and not being swayed by special pleading or jailbreaks.
So I’m inclined towards the “using Intent alignment as a stepping-stone to value alignment” approach you mention, or at least a blend of instruction-following and value alignment — I think my disagreement is just that I see value alignment as an element we’ll need sooner rather than later.

[I also think we should give the LLM options beyond compliance or refusal — current LLMs are good at tool use, and could certainly handle an API to report the current user’s request and their concerns about it to the hosting company: there would obviously be some rate of false positives, but this could trigger a more detailed and intensive automated review, and any account that triggered this at statistically well over the background rate could be referred for human review.]
- Seth Herd 16 May 2025 2:05 UTC
  4 points
  0
  Parent
  Thanks, that’s all relevant and useful!
  Simplest first: I definitely envision a hierarchy of reporting and reviewing questionable requests. That seem like an obvious and cheap route to partly address the jailbreaking/misuse issues.
  I’ve also envisioned smarter LLM agents “thinking through” the possible harms of their actions, and you’re right that does need at least a pretty good grasp on human values. Their grasp on human values is pretty good and likely to get better, as you say. I haven’t thought of this as value alignment, though, because I’ve assumed that developers will try to weigh following instructions over adhering to their moral strictures. But you’re right that devs might try a mix of the two, instead of rules-based refusal training. And the value alignment part could become dominant whether by accident or on purpose.
  I haven’t worked through attempting to align powerful LLM agents to human values in as much detail as I’ve tried to think through IF alignment. It’s seemed like losing corrigibility is a big enough downside that devs would try to keep IF dominant over any sort of values-based rules or training. But maybe not, and I’m not positive that value alignment couldn’t work. Specifying values in language (a type of Goals selected from learned knowledge: an alternative to RL alignment) has some nontrivial technical challenges but seems much more viable than just training for behavior that looks ethical. (Here’ it’s worth mentioning Steve Byrnes’ non-behaviorist RL in which some measure of representations is part of the reward model, like trying to reward the model only when it’s thinking about doing good for a human).
  Language in the way we use it is designed to generalize well, so language might accurately convey goals in terms of human values for full value alignment. But you’d have to really watch for unintended but sensible interpretations—actually your sequence AI, Alignment, and Ethics (esp. 3-5) is the most thorough treatment I know of why something like “be good to everyone” or slightly more careful statements like “give all sentient beings as much empowerment as you can” will likely go very differently than the designer intended—even if the technical implementation goes perfectly!
  I don’t think this goes well by default, but I’m not sure an LLM architecture given goals in language very carefully, and designed very carefully to “want” to follow them, couldn’t pull off value alignment.
  
  I also think someone might be tempted to try it pretty quickly after developing AGI. This might happen if it seemed like proliferation of IF AGI was likely to lead to disastrous misuse, so a value-aligned recursively self-improving sovereign was our best shot. Or it might be tried based on worse or less altruistic logic.
  
  Anyway, thanks for your thoughts on the topic.
Richard Juggins 17 May 2025 23:30 UTC
7 points
4
One problem I have with the instruction-following frame is that it feels like an attempt to sidestep the difficulties of aligning to a set of values. But I don’t think this works, as perfect instruction-following may actually be equivalent to aligning to the Principal’s values.
What we want from an instruction-following system is one that does what we mean rather than simply what we say. So, rather than ‘Do what the Principal says’, the alignment target is really ‘Do what the Principal’s values imply they mean’. And if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
If done correctly, this would solve the corrigibility problem, as all instructions would carry an implicit ‘I mean for you to stop if asked’ clause.
Would it make sense to think of this on a continuum, where on one end you have basic, relatively naive instruction-following that is easy to implement (e.g. helpful LLMs) and on the other you have perfect instruction following that is completely aligned to the Principal’s values?
- Charlie Steiner 18 May 2025 16:13 UTC
  7 points
  2
  Parent
  Definitely agree that the implicit “Do what they say [in a way that they would want]” sneaks the problems of value learning into what some people might have hoped was a value-learning-free space. Just want to split some hairs on this:
  if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
  I think this ignores that there are multiple ways to understand humans, what human preferences are, what acting according to them is, etc. There’s no policy that would satisfy all value-shaped interpretations of the user, because some of them conflict. This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
  - Richard Juggins 19 May 2025 9:40 UTC
    3 points
    0
    Parent
    Do you have any quick examples of value-shaped interpretations that conflict?
    This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
    So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
    I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.
    - Charlie Steiner 19 May 2025 14:56 UTC
      4 points
      0
      Parent
      Do you have any quick examples of value-shaped interpretations that conflict?
      Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
      This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
      The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.
- Seth Herd 19 May 2025 17:45 UTC
  3 points
  0
  Parent
  Yes, I think it does make sense ot think of this as a continuum, something I haven’t emphasized to date. There’s also at least one more dimension, that of how many (and which) humans you’re trying to align to. There’s a little more on this in Conflating value alignment and intent alignment is causing confusion.
  IF is definitely an attempt to sidestep the difficulties of value alignment, at least partially and temporarily.
  What we want from an instruction-following system is exactly what you say: one that does what we mean, not what we say. And getting that perfectly right would demand a perfect understanding of our values. BUT it’s much more fault-tolerant than a value-aligned system. The Principal can specify what they mean as much as they want, and the AI can ask for clarification as much as it thinks it needs to- or in accord with the Principal’s previous instructions to “check carefully about what I meant before doing anything I might hate” or similar.
  If done correclty, value alignment would solve the corrigibility problem. But that seems far harder than using corrigibility in the form of instruciton-following to solve the value alignment problem.
Jeremy Gillen 16 May 2025 9:26 UTC
7 points
2
(A small rant, sorry) In general, it seems you’re massively overanchored on current AI technology, to an extent that it’s stopping you from clearly reasoning about future technology. One example is the jailbreaking section:
There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks.
You’re talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it’s own biases, etc. It should be obvious that you can’t use current LLM flaws as a method of extrapolating the adversarial robustness of this program.
Second,
If IF were the only target of alignment training in its starting state, it seems likely this strong “center of gravity” would guide it to adopt IF as a reflectively stable goal. If it has multiple goals (e.g., following instructions in some cases, refusing instructions for ethical reasons in other cases, and RL for various criteria), its evolution and ultimate alignment seems much less easy to predict.
No. You’re entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.
- Seth Herd 16 May 2025 22:29 UTC
  9 points
  2
  Parent
  (A small rant, sorry) In general, it seems you’re massively overanchored on current AI technology, to an extent that it’s stopping you from clearly reasoning about future technology.
  You are right that I am addressing AGI with a lot of similarities to LLMs. This is done in the interest of reasoning clearly about future technologies. I think good reasoning is a mix of predicting the most likely forms of AGI, and reasoning more broadly. Perhaps I didn’t make clear enough in the post that I’m primarily addressing LLM-based AGI. Much of my alignment work (informed by my systems neuroscience work) is on routes from LLMs to AGI. In this theory, LLMs/foundation models are expanded (by adding memory systems and training/scaffolding them for better metacognition) into loosely brainlike cognitive architectures. In those posts I elaborate reasons to think such scaffolded LLMs may soon be “real AGI” in the sense of reasoning and learning about any topic, including themselves and their cognition and goals. (although that sort of AGI wouldn’t be dramatically superhuman in any area, and initially subhuman in some capabilities).
  If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it! It’s good to reason broadly where possible, and I think a lot of the concerns are general to any AGI at all or any network-based AGI. But constraining alignment work to address specific likely types of AGI lets us reason much more specifically, which is a lot more useful in the worlds where that type of AGI is what we really are faced with aligning.
  You’re talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it’s own biases, etc. It should be obvious that you can’t use current LLM flaws as a method of extrapolating the adversarial robustness of this program.
  Yes, good point. I didn’t elaborate here, but I do think there’s a good chance that the more coherent, intelligent, and introspective nature of any real AGI might make jailbreaking a non-issue. But jailbreaking might still be an issue, because the core thought generator in this scenario is an advanced LLM.
  No. You’re entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.
  Yes I am entirely ignoring inner alignment difficulties. I thought I’d made that clear by saying earlier
  There are substantial technical challenges [including] problems with specifying goals well enough that they won’t be misgeneralized or misinterpreted (from the designer’s perspective). There are serious implementational problems for any alignment target. [...] Here I am leaving aside the more general challenges, and addressing only those that are specifically relevant to instruction-following (and in some cases corrigibility) as an alignment target.
  I didn’t use the term “inner alignment” because I don’t find it intuitive or clarifying; there isn’t a clear division between inner and outer, and they feel like jargon. So I use misgeneralization, which I feel encompasses inner misalignment as well as other (IMO more urgent) concerns. Maybe I should get on board and use inner and outer alignment just to speak the lingua franca of the realm.
  - Jeremy Gillen 18 May 2025 12:22 UTC
    7 points
    −1
    Parent
    If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it!
    I’m not claiming anything about the first takeover-capable AGI, and I’m not claiming it won’t be LLM-based. I’m just saying that there’s a specific reasoning step that you’re using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don’t currently have).
    Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about AI, where people reason like “AI is a program on a computer, programs on computers can’t do {intuition, fuzzy reasoning, logical paradoxes, emotion}, therefore AI will be {logical, calculator-like, vulnerable to paradoxes, not understand emotion, etc.}”. The reasoning step doesn’t work, because it’s focusing on the “logical program” part over the “AGI” part. I think you’re focusing too much on the “LLM-based” part of “LLM-based AGI”, even in cases where the “AGI” part tells you much more.
    (We’re having two similar discussions in parallel, so I’m responding to this in a way that might be useful to other people, but I don’t expect it to be useful to you, since I’ve already said this in the other discussion).
    - Seth Herd 21 May 2025 21:44 UTC
      2 points
      0
      Parent
      There are a lot of merits to avoiding unnecessary premises when they might be wrong.
      
      There are also a lot of merits for reasoning from premises when they allow more progress, and they’re likely to be correct. That is, of course, what I’m trying to do here.
      
      Which of these factors is larger has to be evaluated on the specific instances. There’s lots more to be said about those in this case, but I don’t have time to dig into it now, and it’s worth a full post and discussion.
Charlie Steiner 18 May 2025 16:21 UTC
LW: 6 AF: 4
0
AF
Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction “Take autonomous action to do the right thing,” and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?
- Seth Herd 18 May 2025 20:27 UTC
  4 points
  2
  Parent
  I am definitely thinking of IF as it applies to systems with capability for unlimited autonomy. Intent alignment as a concept doesn’t end at some level of capability—although I think we often assume it would.
  
  How it would understand “the right thing” is the question. But yes, intent alignment as I’m thinking of it does scale smoothly into value alignment plus corrigibility if you can get it right enough.
Noosphere89 15 May 2025 21:27 UTC
6 points
0
For what it’s worth, I consider problem 1 to be somewhat less of a showstopper than you do, because of things like AI control (which while unlikely to scale to arbitrary intelligence levels, is probably useful for the problem of instrumental goals).
However, I do think problems 2 and 3 are a big reason why I’m less of a fan of deploying ASI/AGI widely like @joshc wants to do.
Something close to proliferation concerns (especially around bioweapons) is a big reason why I disagree with @Richard_Ngo on AI safety agreeing to be cooperative with open-source demands/having a cooperative strategy for open-source in the endgame.
Eventually, we will build AIs that could be used safely by small groups, but cannot be released to the public except through locked down APIs with counter-measures to misuse, without everyone or almost everyone dying.
However, I think we can mitigate misuse concerns without requiring much jailbreak robustness, ala @ryan_greenblatt’s post on managing catastrophic misuse without robust AIs:
https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
I like your thoughts on problem 4, and yeah memory complicates a lot of considerations around alignment in interesting ways.
I agree with you that instruction following should be used as a stepping stone to value alignment, and I even have a specific proposal in mind, which at the moment is the Infra-Bayes Physicalist Super-Imitation.
I agree with your post on this issue, so I’m just listing out more considerations.