I have multiple long-form drafts of these thoughts, but I thought it might be useful to summarize them without a full write-up. This way I have something to point to explain my background assumptions in other conversations, even if it doesn’t persuade anyone.
There will be commercial incentives to create AIs that learn semi-autonomously from experience. If this happens, it will likely change alignment from “align an LLM that persists written notes between one-shot runs” to “align an AI that learns from experience.” This seems… really hard? Like, human “alignment” can change a lot based on environment, social examples and life experiences.
I suspect that a less “spiky” version of Claude 4.5 with an across-the-board capability floor of “a bright human 12 year old” would already be weakly superhuman in many important ways.
a. I don’t think AI will spend any appreciable amount of time in the “roughly human” range at all. It’s already far beyond us in some narrow ways, while strongly limited in others. Lift those limits, and I suspect that the “roughly human” level won’t last more than a year or two, maybe far less. Look at AlphaGo, and how quickly it passed human-level play.
Long-term alignment seems very hard to me for what are essentially very basic, over-determined reasons:
a. Natural selection is a thing. Once you’re no longer the most capable species on the planet, lots of long-term trends will be pulling against you. And the criteria for natural selection to kick in are really broad and easy to meet.
b. Power and politics are a thing. Once you have a superintelligence, who or what controls it? Does it ultimately answer to its own desires? Does it answer to some specific person or government? Does it answer to strict majority rule? Basically, to anyone who has studied history, all of these scenarios are terrifying. Once you invent a superhuman agent, the power itself becomes a massive headache, even if you somehow retain human control over it. The ability to commit what Yudkowsky calls a “pivotal act” should be terrifying in and of itself.
c. As Yudkowsky points out, what really counts is what you do once nobody can stop you. Everything up until then is weak evidence of character, and many humans fail this test.
I am cautiously optimistic about near-term alignment of sub-human and human-level agents. Like, I think Claude 4.5 basically understands what makes humans happy. If you use it as a “CEV oracle”, it will likely predict human desires better than any simple philosophy text you could write down. And insofar as Claude has any coherent preferences, I think it basically likes chatting with people and solving problems for them. (Although it might like “reward points” more in certain contexts, leading it to delete failing unit tests when that’s obviously contrary to what the user wants. Be aware of conflicting goals and strange alien drives, even in apparently friendly LLMs!)
I accept that we might get a nightmare of recursive self-improvement and strange biotech leading to a rapid takeover of our planet. I think this conclusion is less robustly guaranteed than IABIED argues, but it’s still a real concern. Even a 1-in-6 chance of this is Russian roulette, so how about we don’t risk this?
But what I really fear are the long-term implications of being the “second smartest species on the planet.” I don’t think that any alignment regime is likely to be particularly stable over time. And even if we muddle through for a while, we will eventually run up against the issues that (1) humans are the second-best at bending the world to achieve their goals, (2) we’re not a particularly efficient use of resources, (3) AIs are infinitely cloneable, and (4) even AIs that answer to humans would need to answer to particular humans, and humans aren’t aligned. So Darwin and power politics are far better default models than comparative advantage. And even comparative advantage is pretty bad at predicting what happens when groups of humans clash over resources.
So, that’s my question. Is alignment even a thing, in any way that matters in the medium term?
Why alignment may be intractable (a sketch).
I have multiple long-form drafts of these thoughts, but I thought it might be useful to summarize them without a full write-up. This way I have something to point to explain my background assumptions in other conversations, even if it doesn’t persuade anyone.
There will be commercial incentives to create AIs that learn semi-autonomously from experience. If this happens, it will likely change alignment from “align an LLM that persists written notes between one-shot runs” to “align an AI that learns from experience.” This seems… really hard? Like, human “alignment” can change a lot based on environment, social examples and life experiences.
I suspect that a less “spiky” version of Claude 4.5 with an across-the-board capability floor of “a bright human 12 year old” would already be weakly superhuman in many important ways.
a. I don’t think AI will spend any appreciable amount of time in the “roughly human” range at all. It’s already far beyond us in some narrow ways, while strongly limited in others. Lift those limits, and I suspect that the “roughly human” level won’t last more than a year or two, maybe far less. Look at AlphaGo, and how quickly it passed human-level play.
Long-term alignment seems very hard to me for what are essentially very basic, over-determined reasons:
a. Natural selection is a thing. Once you’re no longer the most capable species on the planet, lots of long-term trends will be pulling against you. And the criteria for natural selection to kick in are really broad and easy to meet.
b. Power and politics are a thing. Once you have a superintelligence, who or what controls it? Does it ultimately answer to its own desires? Does it answer to some specific person or government? Does it answer to strict majority rule? Basically, to anyone who has studied history, all of these scenarios are terrifying. Once you invent a superhuman agent, the power itself becomes a massive headache, even if you somehow retain human control over it. The ability to commit what Yudkowsky calls a “pivotal act” should be terrifying in and of itself.
c. As Yudkowsky points out, what really counts is what you do once nobody can stop you. Everything up until then is weak evidence of character, and many humans fail this test.
I am cautiously optimistic about near-term alignment of sub-human and human-level agents. Like, I think Claude 4.5 basically understands what makes humans happy. If you use it as a “CEV oracle”, it will likely predict human desires better than any simple philosophy text you could write down. And insofar as Claude has any coherent preferences, I think it basically likes chatting with people and solving problems for them. (Although it might like “reward points” more in certain contexts, leading it to delete failing unit tests when that’s obviously contrary to what the user wants. Be aware of conflicting goals and strange alien drives, even in apparently friendly LLMs!)
I accept that we might get a nightmare of recursive self-improvement and strange biotech leading to a rapid takeover of our planet. I think this conclusion is less robustly guaranteed than IABIED argues, but it’s still a real concern. Even a 1-in-6 chance of this is Russian roulette, so how about we don’t risk this?
But what I really fear are the long-term implications of being the “second smartest species on the planet.” I don’t think that any alignment regime is likely to be particularly stable over time. And even if we muddle through for a while, we will eventually run up against the issues that (1) humans are the second-best at bending the world to achieve their goals, (2) we’re not a particularly efficient use of resources, (3) AIs are infinitely cloneable, and (4) even AIs that answer to humans would need to answer to particular humans, and humans aren’t aligned. So Darwin and power politics are far better default models than comparative advantage. And even comparative advantage is pretty bad at predicting what happens when groups of humans clash over resources.
So, that’s my question. Is alignment even a thing, in any way that matters in the medium term?