Boaz Barak comments on Machines of Faithful Obedience

Boaz Barak 28 Jun 2025 18:52 UTC
1 point
0
I think AI assistants would have common sense even if they are obedient. I doubt an AI assistants would interpret “go fetch me coffee” as “kill me first so I can’t interrupt your task and then fetch me coffee” but YMMV
- Seth Herd 28 Jun 2025 21:56 UTC
  2 points
  0
  Parent
  I don’t think it’s safe to assume that LLM-based AGI will have common sense (maybe this is different from the assistant you’re addressing, in that it’s a lot smarter and can think for itself more?). I’m talking about machines based on networks, but which can also reason in depth. They will understand common sense, but that doesn’t ensure that it will guide their reasoning.
  
  So it depends what you mean by “obedient”. And how you trained them to be obedient. And whether that ensures that their interpretation doesn’t change once they can reason more deeply than we can.
  
  So I think those questions require serious thought, but you can’t tackle it all at once, so starting by assuming that all works is also sensible. I’m just focusing on that first, because I don’t think that’s likely to work unless we put a lot more careful thought in before we try it.
  
  If you look at my previous posts on the topic, linked from Problems with instruction-following, you’ll see that I was initially more focused on downstream concerns like yours. After working in the field for longer and having more in-depth discussions and research on the techniques we’d use to align future LLM agents, I am increasingly concerned that it’s not that easy, and we should focus on the difficulty of aligning AGI while we still have time. Difficulties from aligned AGI are also substantial, and I’ve addressed those as well in my string of work backlinked from Whether governments will control AGI is important and neglected
  
  I am also drawn to the idea that government control of AGI is quite dangerous; I’ve addressed this tension in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
  
  But on the whole, I think the risks of widely distributed AGI are even greater. How many people can control a mind that can create technologies capable of taking over or destroying the world before someone uses them?
  
  Michael Nielsen’s excellent ASI existential risk: Reconsidering Alignment as a Goal is a similar analysis of why even obedient aligned AGI would be intensely dangerous if it’s allowed to proliferate.