Seth Herd comments on Problems with instruction-following as an alignment target

Seth Herd 16 May 2025 22:29 UTC
9 points
2
(A small rant, sorry) In general, it seems you’re massively overanchored on current AI technology, to an extent that it’s stopping you from clearly reasoning about future technology.
You are right that I am addressing AGI with a lot of similarities to LLMs. This is done in the interest of reasoning clearly about future technologies. I think good reasoning is a mix of predicting the most likely forms of AGI, and reasoning more broadly. Perhaps I didn’t make clear enough in the post that I’m primarily addressing LLM-based AGI. Much of my alignment work (informed by my systems neuroscience work) is on routes from LLMs to AGI. In this theory, LLMs/foundation models are expanded (by adding memory systems and training/scaffolding them for better metacognition) into loosely brainlike cognitive architectures. In those posts I elaborate reasons to think such scaffolded LLMs may soon be “real AGI” in the sense of reasoning and learning about any topic, including themselves and their cognition and goals. (although that sort of AGI wouldn’t be dramatically superhuman in any area, and initially subhuman in some capabilities).
If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it! It’s good to reason broadly where possible, and I think a lot of the concerns are general to any AGI at all or any network-based AGI. But constraining alignment work to address specific likely types of AGI lets us reason much more specifically, which is a lot more useful in the worlds where that type of AGI is what we really are faced with aligning.
You’re talking about AGI here. An agent capable autonomously doing research, play games with clever adversaries, detecting and patching it’s own biases, etc. It should be obvious that you can’t use current LLM flaws as a method of extrapolating the adversarial robustness of this program.
Yes, good point. I didn’t elaborate here, but I do think there’s a good chance that the more coherent, intelligent, and introspective nature of any real AGI might make jailbreaking a non-issue. But jailbreaking might still be an issue, because the core thought generator in this scenario is an advanced LLM.
No. You’re entirely ignoring inner alignment difficulties. The main difficulties. There are several degrees of freedom in goal specification that a pure goal target fails to nail down. These lead to an unpredictable reflective equilibrium.
Yes I am entirely ignoring inner alignment difficulties. I thought I’d made that clear by saying earlier
There are substantial technical challenges [including] problems with specifying goals well enough that they won’t be misgeneralized or misinterpreted (from the designer’s perspective). There are serious implementational problems for any alignment target. [...] Here I am leaving aside the more general challenges, and addressing only those that are specifically relevant to instruction-following (and in some cases corrigibility) as an alignment target.
I didn’t use the term “inner alignment” because I don’t find it intuitive or clarifying; there isn’t a clear division between inner and outer, and they feel like jargon. So I use misgeneralization, which I feel encompasses inner misalignment as well as other (IMO more urgent) concerns. Maybe I should get on board and use inner and outer alignment just to speak the lingua franca of the realm.
- Jeremy Gillen 18 May 2025 12:22 UTC
  7 points
  −1
  Parent
  If you have an alternate theory of the likely form of first takeover-capable AGI, I’d love to hear it!
  I’m not claiming anything about the first takeover-capable AGI, and I’m not claiming it won’t be LLM-based. I’m just saying that there’s a specific reasoning step that you’re using a lot (current tech has property X, therefore AGI has property almost-X) which I think is invalid (when X is entangled with properties of AGI that LLMs don’t currently have).
  Maybe a slightly insulting analogy (sorry): That type of reasoning looks a lot like bad scifi ideas about AI, where people reason like “AI is a program on a computer, programs on computers can’t do {intuition, fuzzy reasoning, logical paradoxes, emotion}, therefore AI will be {logical, calculator-like, vulnerable to paradoxes, not understand emotion, etc.}”. The reasoning step doesn’t work, because it’s focusing on the “logical program” part over the “AGI” part. I think you’re focusing too much on the “LLM-based” part of “LLM-based AGI”, even in cases where the “AGI” part tells you much more.
  (We’re having two similar discussions in parallel, so I’m responding to this in a way that might be useful to other people, but I don’t expect it to be useful to you, since I’ve already said this in the other discussion).
  - Seth Herd 21 May 2025 21:44 UTC
    2 points
    0
    Parent
    There are a lot of merits to avoiding unnecessary premises when they might be wrong.
    
    There are also a lot of merits for reasoning from premises when they allow more progress, and they’re likely to be correct. That is, of course, what I’m trying to do here.
    
    Which of these factors is larger has to be evaluated on the specific instances. There’s lots more to be said about those in this case, but I don’t have time to dig into it now, and it’s worth a full post and discussion.