Seth Herd comments on Machines of Faithful Obedience

Seth Herd 27 Jun 2025 16:47 UTC
4 points
0
See my recent post Problems with instruction-following as an alignment target
- Boaz Barak 27 Jun 2025 19:18 UTC
  1 point
  0
  Parent
  Thank you! This is quite relevant. Some of your concerns are about the technical feasibility of achieving instruction following, which is not a point I’m going into in this post. FWIW when I say “instruction following” I mean models that respect the instruction hierarchy and chain of command (for example as in our model spec). However I do not mean models that obey hypothetical future instructions that were not given to them (which is one option you mention in your post).
  - Seth Herd 28 Jun 2025 0:18 UTC
    2 points
    0
    Parent
    If they don’t obey future instructions not yet given, the only sensible thing to do to carry out your current instructions thoroughly and with certainty is to make sure you can’t issue new instructions. That would logically prevent them from completing their instructions and represent utter failure in their goal.
    - Boaz Barak 28 Jun 2025 18:52 UTC
      1 point
      0
      Parent
      I think AI assistants would have common sense even if they are obedient. I doubt an AI assistants would interpret “go fetch me coffee” as “kill me first so I can’t interrupt your task and then fetch me coffee” but YMMV
      - Seth Herd 28 Jun 2025 21:56 UTC
        2 points
        0
        Parent
        I don’t think it’s safe to assume that LLM-based AGI will have common sense (maybe this is different from the assistant you’re addressing, in that it’s a lot smarter and can think for itself more?). I’m talking about machines based on networks, but which can also reason in depth. They will understand common sense, but that doesn’t ensure that it will guide their reasoning.
        
        So it depends what you mean by “obedient”. And how you trained them to be obedient. And whether that ensures that their interpretation doesn’t change once they can reason more deeply than we can.
        
        So I think those questions require serious thought, but you can’t tackle it all at once, so starting by assuming that all works is also sensible. I’m just focusing on that first, because I don’t think that’s likely to work unless we put a lot more careful thought in before we try it.
        
        If you look at my previous posts on the topic, linked from Problems with instruction-following, you’ll see that I was initially more focused on downstream concerns like yours. After working in the field for longer and having more in-depth discussions and research on the techniques we’d use to align future LLM agents, I am increasingly concerned that it’s not that easy, and we should focus on the difficulty of aligning AGI while we still have time. Difficulties from aligned AGI are also substantial, and I’ve addressed those as well in my string of work backlinked from Whether governments will control AGI is important and neglected
        
        I am also drawn to the idea that government control of AGI is quite dangerous; I’ve addressed this tension in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
        
        But on the whole, I think the risks of widely distributed AGI are even greater. How many people can control a mind that can create technologies capable of taking over or destroying the world before someone uses them?
        
        Michael Nielsen’s excellent ASI existential risk: Reconsidering Alignment as a Goal is a similar analysis of why even obedient aligned AGI would be intensely dangerous if it’s allowed to proliferate.