You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution.
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
[1] https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2?commentId=adS78sYv5wzumQPWe
[2] https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
[3] https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to
[4] https://www.lesswrong.com/posts/trzFrnhRoeofmLz4e/insofar-as-i-think-llms-don-t-really-understand-things-what
[5] https://shash42.substack.com/p/automated-scientific-discovery-as
[6] https://www.dwarkesh.com/p/ilya-sutskever-2
[7] https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/
[8] https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom