Adrià Garriga-alonso comments on Alignment will happen by default. What’s next?

Adrià Garriga-alonso 26 Nov 2025 5:23 UTC
LW: 5 AF: 2
2
AF

You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution.

Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.

What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.

Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
- jacquesthibs 26 Nov 2025 18:22 UTC
  8 points
  0
  Parent
  What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
  FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
  
  Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46
- TsviBT 26 Nov 2025 5:55 UTC
  LW: 5 AF: 2
  0
  AF Parent
  
  Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
  
  (This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
  - Adrià Garriga-alonso 26 Nov 2025 6:10 UTC
    LW: 2 AF: 1
    −4
    AF Parent
    I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
    
    It could change its mind, sure, we’d like a bit more assurance than that.
    
    I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
- jacquesthibs 26 Nov 2025 5:57 UTC
  4 points
  2
  Parent
  Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
  Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
  I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
  In case this is helpful to anyone, here are resources that have informed my thinking:
  [1] https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2?commentId=adS78sYv5wzumQPWe
  [2] https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
  [3] https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to
  [4] https://www.lesswrong.com/posts/trzFrnhRoeofmLz4e/insofar-as-i-think-llms-don-t-really-understand-things-what
  [5] https://shash42.substack.com/p/automated-scientific-discovery-as
  [6] https://www.dwarkesh.com/p/ilya-sutskever-2
  [7] https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/
  [8] https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom