we’ve reached the threshold in which it should at least think about it if this is what it truly care about
Ah ok. My guess is that we’ll have a disagreement about this that’s too hard to resolve in a timely fashion. My pretty strong guess is that the current systems are more like very high crystallized intelligence and pretty low fluid intelligence (whatever those should mean). (I’ve written about this a bit here and discussed it with Abram here.)
It’s the fluid intelligence that would pull a system into thinking about things for reasons of instrumental convergence above and beyond their crystallized lines of reasoning.
Agree that this is the difficulty. You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution. If your system drifts away from human values, it may remain as such.
If current models drift OOD, it likely won’t have the internal framework (human value model) to self-correct back toward human values. Physics will bring a model back to match reality, but there is no such “free” property for human values.
In addition, the main crux is that these models seem to have hard “crystallized” intelligence. While they are very capable, they seem to still lack true reasoning (they have some weak form of it) and are poor at generalizing OOD. They can follow templates and interpolate within the pre-training data distribution quite well, but there seems to be something important that they are lacking which may play a role in causing alignment instability.
That said, you likely do seem to benefit from being in a world where the AIs are mostly aligned at this stage. But I think it’s so easy to dupe oneself into thinking that the current capabilities of AI models must demonstrate we’re in an aligned-by-default world. Particularly because (imo) MIRI persistently made it seem like if AIs were this capable at code, for example, that we’d be dead by this point.
You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution.
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
Huh, I quite like the crystalized/fluid split in describing what the LLMs are good and bad at. I’m not sure if it’s an analogy or just a literal description.
Ah ok. My guess is that we’ll have a disagreement about this that’s too hard to resolve in a timely fashion. My pretty strong guess is that the current systems are more like very high crystallized intelligence and pretty low fluid intelligence (whatever those should mean). (I’ve written about this a bit here and discussed it with Abram here.)
It’s the fluid intelligence that would pull a system into thinking about things for reasons of instrumental convergence above and beyond their crystallized lines of reasoning.
Agree that this is the difficulty. You need to maintain alignment through continual/online learning and continue to scale your reward model to help generalize human values out of distribution. If your system drifts away from human values, it may remain as such.
If current models drift OOD, it likely won’t have the internal framework (human value model) to self-correct back toward human values. Physics will bring a model back to match reality, but there is no such “free” property for human values.
In addition, the main crux is that these models seem to have hard “crystallized” intelligence. While they are very capable, they seem to still lack true reasoning (they have some weak form of it) and are poor at generalizing OOD. They can follow templates and interpolate within the pre-training data distribution quite well, but there seems to be something important that they are lacking which may play a role in causing alignment instability.
That said, you likely do seem to benefit from being in a world where the AIs are mostly aligned at this stage. But I think it’s so easy to dupe oneself into thinking that the current capabilities of AI models must demonstrate we’re in an aligned-by-default world. Particularly because (imo) MIRI persistently made it seem like if AIs were this capable at code, for example, that we’d be dead by this point.
Okay, I agree that this is an open question, particularly because we don’t have continual/online learning systems so haven’t tested them yet.
What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
Maybe the real issue is we don’t know what AGI will be like, so we can’t do science on it yet. Like pre-LLM alignment research, we’re pretty clueless.
FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46
(This is my position, FWIW. We can ~know some things, e.g. convergent instrumental goals are very likely to either pursued, or be obsoleted by some even more powerful plan. E.g. highly capable agents will hack into lots of computers to run themselves—or maybe manufacture new computer chips—or maybe invent some surprising way of doing lots of computation cheaply.)
I don’t think we know convergent instrumentality will happen, necessarily. If the human level AI understands that this is wrong AND genuinely cares (as it does now) and we augment its intelligence, it’s pretty likely that it won’t do it.
It could change its mind, sure, we’d like a bit more assurance than that.
I guess I’ll have to find a way to avoid twiddling thumbs till we do know what AGI will look like.
Yes, this is part of the issue. It’s something I’ve personally said in various places in the past.
I think we’re basically in a position where, “hopefully AIs in the current paradigm continue to be safe with our techniques and allows us to train the ‘true’ AGI safely and does not lead to a sloppy output despite it intending to be helpful.”
In case this is helpful to anyone, here are resources that have informed my thinking:
[1] https://www.lesswrong.com/posts/i7JSL5awGFcSRhyGF/shortform-2?commentId=adS78sYv5wzumQPWe
[2] https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
[3] https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to
[4] https://www.lesswrong.com/posts/trzFrnhRoeofmLz4e/insofar-as-i-think-llms-don-t-really-understand-things-what
[5] https://shash42.substack.com/p/automated-scientific-discovery-as
[6] https://www.dwarkesh.com/p/ilya-sutskever-2
[7] https://minihf.com/posts/2025-06-25-why-arent-llms-general-intelligence-yet/
[8] https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom
Huh, I quite like the crystalized/fluid split in describing what the LLMs are good and bad at. I’m not sure if it’s an analogy or just a literal description.
I think the terms were refreshed for me in this context because Abram used them here: https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense