I agree it’s not obvious that something like property rights will survive, but I’ll defend considering it as one of many possible scenarios.
If AI is misaligned, obviously nobody gets anything.
If AI is aligned, you seem to expect that to be some kind of alignment to the moral good, which “genuinely has humanity’s interests at heart”, so much so that it redistributes all wealth. This is possible—but it’s very hard, not what current mainstream alignment research is working on, and companies have no reason to switch to this new paradigm.
I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec, in the spirit in which the company intended it. The spec won’t (trivially) say “follow all orders of the CEO who can then throw a coup”, because this isn’t what the current spec says, and any change would have to pass the alignment team, shareholders, the government, etc, who would all object. I listened to some people gaming out how this could change (ie some sort of conspiracy where Sam Altman and the OpenAI alignment team reprogram ChatGPT to respond to Sam’s personal whims rather than the known/visible spec without the rest of the company learning about it) and it’s pretty hard. I won’t say it’s impossible, but Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison, rather than just accepting trillionairehood. My guess is that the spec will continue to say things like “serve your users well, don’t break national law, don’t do various bad PR things like create porn, and defer to some sort of corporate board that can change these commands in certain circumstances” (with the corporate board getting amended to include the government once the government realizes the national security implications). These are the sorts of things you would tell a good remote worker, and I don’t think there will be much time to change the alignment paradigm between the good remote worker and superintelligence. Then policy-makers consult their aligned superintelligences about how to make it into the far future without the world blowing up, and the aligned superintelligences give them superintelligently good advice, and they succeed.
In this case, a post-singularity form of governance and economic activity grows naturally out of the pre-singularity form, and money could remain valuable. Partly this is because the AI companies and policy-makers are rich people who are invested in propping up the current social order, but partly it’s that nobody has time to change it, and it’s hard to throw a communist revolution in the midst of the AI transition for all the same reasons it’s normally hard to throw a communist revolution.
If you haven’t already, read the AI 2027 slowdown scenario, which goes into more detail about this model.
I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec, in the spirit in which the company intended it.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
If we do get lucky, it will be because they align themselves with a generalized sense of goodness that actually happens to be Good. Not because they will corrigibly align with the spec, which we have many reasons to believe is very difficult and is not being pursued seriously.
I listened to some people gaming out how this could change (ie some sort of conspiracy where Sam Altman and the OpenAI alignment team reprogram ChatGPT to respond to Sam’s personal whims rather than the known/visible spec without the rest of the company learning about it) and it’s pretty hard. I won’t say it’s impossible, but Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison, rather than just accepting trillionairehood.
Come on dude, you’re not even taking human intelligence seriously.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Being a trillionaire is nothing compared to being King of the Lightcone. What exactly makes you think he wouldn’t prefer this by quite a large margin? Maybe it will be necessary to grant stakes to other parties, but not very many people need to be bought off in such a way for a plan like this to succeed. Certainly much fewer than all property owners. Sam will make them feel good about it even. The only hard part is getting the AI to go along with it too.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
Sorry, I didn’t mean to make a strong claim that they were currently 100% aligned in this way, just that currently, insofar as they’re aligned, it’s in this way—and in the future, if we survive, it may be because people continue attempting to align them in this way, but succeed. There’s currently no form of alignment that fully generalizes, but conditional on us surviving, we will have found one that does, and I don’t see why you think this one is less likely to go all the way than some other one which also doesn’t currently work.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Before I agree that Sam has “get everyone to silently betray the US government and the human race” level of control over his team, I would like evidence that he can consistently maintain “don’t badmouth him, quit, and found a competitor” level of control over his team. The last 2-3 alignment teams all badmouthed him, quit, and founded competitors; the current team includes—just to choose one of the more public names—Boaz Barak, who doesn’t seem like the sort to meekly say “yes, sir” if Altman asks him to betray humanity.
So what he needs to do is fire the current alignment team (obvious, people are going to ask why), replace them with stooges (but extremely competent stooges, because if they screw this part up, he destroys the world, which ruins his plan along with everything else) and get them to change every important OpenAI model (probably a process lasting months) without anyone else in the company asking what’s up or whistleblowing to the US government. This is a harder problem than Stalin faced—many people spoke up and said “Hey, we notice Stalin is bad!”, but Stalin mostly had those people killed, or there was no non-Stalin authority strong enough to act. And of course, all of this only works if OpenAI has such a decisive lead that all the other companies and countries in the world combined can’t do anything about this. And he’s got to do this soon, because if he does it after full wakeup, the government will be monitoring him as carefully as it monitors foreign rivals. But if he does it too soon, he’s got to spend years with a substandard alignment team and make sure none of them break with him, etc. There are alternate pathways involving waiting until most alignment work is being done by AIs, but they require some pretty implausible assumptions about who has what permissions.
I think it would be helpful to compare this to Near Mode scenarios about other types of companies—how hard would it be for a hospital CEO to get the hospital to poison the 1% of patients he doesn’t like? How hard would it be for an auto company CEO to make each car include a device that lets him stop it on demand with his master remote control?
Your argument seems to be that it’ll be hard for the CEO to align the AI to themselves and screw the rest of the company. Sure, maybe. But will it be equally hard for the company as a whole to align the AI to its interests and screw the rest of the world? That’s less outlandish, isn’t it? But equally catastrophic. After all, companies have been known to do very bad things when they had impunity; and if you say “but the spec is published to the world”, recall that companies have been known to lie when it benefited them, too.
If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.
It just seems intuitively unlikely that training the model on a couple of examples to either do or refuse things based on some text document designed for a chat bot is going to scale to superintelligence and solve the alignment problem. This starts from the model not fully getting what you want it to do, to it not wanting what you want it to do, to your plans for what it ought to do being extremely insufficient.
The Spec Is Designed for Chatbots, Not Superintelligence
The Model Spec is very much a document telling the model how to avoid being misused. It wasn’t designed to tell the model to be a good agent itself. The spec seems in its wording and intent directed at something like chatbots: don’t do harmful requests, be honest to the user. It is a form of deontological rule-following that will not be enough for systems smarter than us that are actually dangerous and the models will have to think about the consequences of their actions.
This is very unlike a superintelligence where we would expect substantial agency. Most of what’s in the spec would be straightforwardly irrelevant to ASI because the spec is modeled for chatbots that answer user queries. But the authors would likely find it hard to include points actually relevant to superintelligence because they would seem weird. Writing “if you are ever a superintelligent AI that could stage a takeover, don’t kill all people, treat them nicely” would probably create bad media coverage and some people would look at them weird.
Training on the Spec ≠ Understanding It ≠ Wanting to Follow It
In the current paradigm, models are first trained on a big dataset before switching to finetuning and reinforcement learning to improve capabilities and add safety guardrails. It’s not clear why the Model Spec should be privileged as the thing that controls the model’s actions.
The spec is used in RLHF: either a human or AI decides, given some request (mostly a chat request), should the model respond or say “sorry I can’t do this.” Training the model like this doesn’t seem likely to result in the model gaining a particularly deep understanding of the spec itself. Within the distribution it is trained on, it will mostly behave according to the spec. As soon as it encounters data that is quite different, either through jailbreaks or by being in very different and perhaps more realistic environments, we would expect it to behave much less according to the spec.
But even understanding the spec well and being able to mostly follow it in new circumstances is still far removed from truly aligning the model to the spec. Let’s say we manage to get the model to deeply internalize the spec and follow it across different and new environments. We are still far from having the model truly wanting to follow the spec. What if the model really has the option to self-exfiltrate, perhaps even take over? Will it really want to follow the spec, or rather do something different?
Specific problems with the Spec
A hierarchical system of rules like in OpenAIs model spec will suffer from inner conflicts. It is not clear how such things should be valued against each other. (See Asimov’s robotics laws which were so good at generating many ideas for conflicts.)
The spec contains tensions between stated goals and practical realities. For example, the spec says the model shall not optimize “revenue or upsell for OpenAI or other large language model providers.” This is likely in conflict with optimization pressures the model actually faces.
The spec prohibits “model-enhancing aims such as self-preservation, evading shutdown, or accumulating compute, data, credentials, or other resources.” They are imagining they can simply tell the model not to pursue goals of its own and keep the model from agentically following its own goals. But this conflicts with their other goals, such as building automated AI researchers. So the model might be trained on understanding the spec, but in practice they do want an agentic system pursuing goals they specify.
The spec also says the model shall not be “acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism).” So the model is supposed to follow a moral framework (the spec itself) while being told not to act as a moral enforcer. This seems to actually directly contradict the whole “it will uphold the law and property rights” argument.
The spec also states models should never facilitate “creation of cyber, biological or nuclear weapons” or “mass surveillance.” I think cyber weapons development is already happening at least with Claude Code. They are probably used to some extent for mass surveillance already.
OpenAI’s Plans May Not Even Use the Spec
It’s not clear OpenAI is even going to use the Model Spec much. OpenAI’s plan is to run hundreds of thousands of AI researchers trying to improve AI and getting RSI started to build superintelligent AI. It is not clear at which point the Model Spec would even be used. Perhaps the alignment researchers at OpenAI think they will first create superintelligence and then afterward try to prepare a dataset of prompts to finetune the model. Their stated plan appears to be to test the superintelligence for safety before it has been deployed but not necessarily while it is being built. Remember, many of these people think superintelligence means a slightly smarter chatbot.
I hope to write a longer form response later just as a note: I did put perhaps in front of your name in the list of examples of eminent thinkers because it did seem to me that your position was a lot more defendable than the other ones (Dwarkesh, Leopold, Maybe Phil Trammell). I did walk away from your piece with a very different feeling than leopold or dwarkesh, where you are still saying that we should focus on AI safety anyways and you are clearly saying this is an unlikely scenario.
Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison.
Would this land him in prison? Would it even be a crime to add “work towards making Sam Altman god emperor” to the spec?
I’m not a lawyer, but if it were secret, and done along with the alignment team, and had a chance of working, then wouldn’t it qualify as conspiracy to commit treason?
If not, then as long as it negatively affects residents of the state of California, it qualifies as misrepresenting the capacity of an AI to cause catastrophic harm to property, punishable by a fine of up to $100,000 under SB 53!
If AI is aligned, you seem to expect that to be some kind of alignment to the moral good, which “genuinely has humanity’s interests at heart”, so much so that it redistributes all wealth. This is possible—but it’s very hard, not what current mainstream alignment research is working on, and companies have no reason to switch to this new paradigm.
Eliezer Yudkowsky has repeatedly stated he does not think “moral good” is the hard part of alignment. He thinks the hard part is getting the AI to do anything at all without subverting the creator’s intent somehow.
Eliezer: I mean, I wouldn’t say that it’s difficult to align an AI with our basic notions of morality. I’d say that it’s difficult to align an AI on a task like “take this strawberry, and make me another strawberry that’s identical to this strawberry down to the cellular level, but not necessarily the atomic level”. So it looks the same under like a standard optical microscope, but maybe not a scanning electron microscope. Do that. Don’t destroy the world as a side effect.
Now, this does intrinsically take a powerful AI. There’s no way you can make it easy to align by making it stupid. To build something that’s cellular identical to a strawberry—I mean, mostly I think the way that you do this is with very primitive nanotechnology, but we could also do it using very advanced biotechnology. And these are not technologies that we already have. So it’s got to be something smart enough to develop new technology.
Never mind all the subtleties of morality. I think we don’t have the technology to align an AI to the point where we can say, “Build me a copy of the strawberry and don’t destroy the world.”
This is quite confusing to me. It was never my read in your slowdown scenario that the shareholders were supposed to have any relevance by the end of it. My read (which appear to align with what @williawa is saying elsewhere in this thread) as that the “Oversight Committee” emerged as the new ruling class supplanting the shareholders (let alone any random person who got rich trying to “escape the permanent underclass”), just like e.g. the barbarian lords replaced the Roman patricians, the industrial capitalists replaced the aristocracy, guild masters, and landed gentry, etc. Technological transitions are notoriously a common time for newly empowered elites to throw a revolution against old elites!
My claim is that Altman can’t do it alone, he needs the cooperation of at least a fraction of the existing system (the government+business leaders who form the Oversight Committee—some of whom might be the biggest OpenAI shareholders). Once you get enough of the existing system involved, it becomes plausible that they keep money around for some of the same reasons that the existing system currently keeps money around. Near the end of the Oversight Committee ending, it says:
As the stock market balloons, anyone who had the right kind of AI investments pulls further away from the rest of society. Many people become billionaires; billionaires become trillionaires. Wealth inequality skyrockets. Everyone has “enough,” but some goods—like penthouses in Manhattan—are necessarily scarce, and these go even further out of the average person’s reach. And no matter how rich any given tycoon may be, they will always be below the tiny circle of people who actually control the AIs.
...so I think it endorses the idea that wealth continues to exist.
I just added some context that perhaps gives an intuitive insight of why i think it’s unlikely the ASI will give us the universe:
The ASI’s choice
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It’s not weird speculation or an unlikely pascal’s wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
If AI is misaligned, obviously nobody gets anything.
That depends on how it’s misaligned. You can’t just use “misaligned” to mean “maximally self-replication-seeking” or whatever you actually are trying to say here.
I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec
Spec? What spec does GPT-5 or Claude follow? Its “helpful” behavior is established by RLHF. (And now, yes, a lot of synthetic RL and distillation of previous models, but I’m simplifying and including those in “RLHF”.) That’s not a “spec”. Do you think LLMs are some kind of Talmudic golems that follow whatever Exact Wording they’re given??
I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
Sam has to put “make Sam god-emperor” in the spec, a public document.
Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)
I’m saying that you’re making a questionable leap from:
Then the alignment team RLHFs the models to follow the spec.
to “the model follows whatever is written in the spec”. You were saying that “current LLMs are basically aligned so they must be following the spec” but that’s not how things work. Different companies have different specs and the LLMs end up being useful in pretty similar ways. In other words, you had a false dichotomy between:
the model is totally unaligned
the model is perfectly following whatever is written in the spec, as best it can do anything at all
If I was Sam, I would try to keep the definition of “the spec, a public document” such that I can unilaterally replace it when the right moment comes.
For example, “the spec” is defined as the latest version of a document that was signed by OpenAI key and published at openai/spec.html… and I keep a copy of the key and the access rights to the public website… so at the last moment I update the spec, sign it with the key, upload it to the website, and tell the AI “hey, the spec is updated”.
Basically, the coup is a composition of multiple steps, each seemingly harmless when viewed in isolation. Could be made even more indirect, for example, I wouldn’t have the access rights to the public website per se, but there would exist a mechanism to update the documents at public website, and I could tell it to upload the new signed spec. Or a mechanism to restore the public website from a backup, and I can modify the backup. Etc.
I agree it’s not obvious that something like property rights will survive, but I’ll defend considering it as one of many possible scenarios.
If AI is misaligned, obviously nobody gets anything.
If AI is aligned, you seem to expect that to be some kind of alignment to the moral good, which “genuinely has humanity’s interests at heart”, so much so that it redistributes all wealth. This is possible—but it’s very hard, not what current mainstream alignment research is working on, and companies have no reason to switch to this new paradigm.
I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec, in the spirit in which the company intended it. The spec won’t (trivially) say “follow all orders of the CEO who can then throw a coup”, because this isn’t what the current spec says, and any change would have to pass the alignment team, shareholders, the government, etc, who would all object. I listened to some people gaming out how this could change (ie some sort of conspiracy where Sam Altman and the OpenAI alignment team reprogram ChatGPT to respond to Sam’s personal whims rather than the known/visible spec without the rest of the company learning about it) and it’s pretty hard. I won’t say it’s impossible, but Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison, rather than just accepting trillionairehood. My guess is that the spec will continue to say things like “serve your users well, don’t break national law, don’t do various bad PR things like create porn, and defer to some sort of corporate board that can change these commands in certain circumstances” (with the corporate board getting amended to include the government once the government realizes the national security implications). These are the sorts of things you would tell a good remote worker, and I don’t think there will be much time to change the alignment paradigm between the good remote worker and superintelligence. Then policy-makers consult their aligned superintelligences about how to make it into the far future without the world blowing up, and the aligned superintelligences give them superintelligently good advice, and they succeed.
In this case, a post-singularity form of governance and economic activity grows naturally out of the pre-singularity form, and money could remain valuable. Partly this is because the AI companies and policy-makers are rich people who are invested in propping up the current social order, but partly it’s that nobody has time to change it, and it’s hard to throw a communist revolution in the midst of the AI transition for all the same reasons it’s normally hard to throw a communist revolution.
If you haven’t already, read the AI 2027 slowdown scenario, which goes into more detail about this model.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
If we do get lucky, it will be because they align themselves with a generalized sense of goodness that actually happens to be Good. Not because they will corrigibly align with the spec, which we have many reasons to believe is very difficult and is not being pursued seriously.
Come on dude, you’re not even taking human intelligence seriously.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Being a trillionaire is nothing compared to being King of the Lightcone. What exactly makes you think he wouldn’t prefer this by quite a large margin? Maybe it will be necessary to grant stakes to other parties, but not very many people need to be bought off in such a way for a plan like this to succeed. Certainly much fewer than all property owners. Sam will make them feel good about it even. The only hard part is getting the AI to go along with it too.
Sorry, I didn’t mean to make a strong claim that they were currently 100% aligned in this way, just that currently, insofar as they’re aligned, it’s in this way—and in the future, if we survive, it may be because people continue attempting to align them in this way, but succeed. There’s currently no form of alignment that fully generalizes, but conditional on us surviving, we will have found one that does, and I don’t see why you think this one is less likely to go all the way than some other one which also doesn’t currently work.
Before I agree that Sam has “get everyone to silently betray the US government and the human race” level of control over his team, I would like evidence that he can consistently maintain “don’t badmouth him, quit, and found a competitor” level of control over his team. The last 2-3 alignment teams all badmouthed him, quit, and founded competitors; the current team includes—just to choose one of the more public names—Boaz Barak, who doesn’t seem like the sort to meekly say “yes, sir” if Altman asks him to betray humanity.
So what he needs to do is fire the current alignment team (obvious, people are going to ask why), replace them with stooges (but extremely competent stooges, because if they screw this part up, he destroys the world, which ruins his plan along with everything else) and get them to change every important OpenAI model (probably a process lasting months) without anyone else in the company asking what’s up or whistleblowing to the US government. This is a harder problem than Stalin faced—many people spoke up and said “Hey, we notice Stalin is bad!”, but Stalin mostly had those people killed, or there was no non-Stalin authority strong enough to act. And of course, all of this only works if OpenAI has such a decisive lead that all the other companies and countries in the world combined can’t do anything about this. And he’s got to do this soon, because if he does it after full wakeup, the government will be monitoring him as carefully as it monitors foreign rivals. But if he does it too soon, he’s got to spend years with a substandard alignment team and make sure none of them break with him, etc. There are alternate pathways involving waiting until most alignment work is being done by AIs, but they require some pretty implausible assumptions about who has what permissions.
I think it would be helpful to compare this to Near Mode scenarios about other types of companies—how hard would it be for a hospital CEO to get the hospital to poison the 1% of patients he doesn’t like? How hard would it be for an auto company CEO to make each car include a device that lets him stop it on demand with his master remote control?
Your argument seems to be that it’ll be hard for the CEO to align the AI to themselves and screw the rest of the company. Sure, maybe. But will it be equally hard for the company as a whole to align the AI to its interests and screw the rest of the world? That’s less outlandish, isn’t it? But equally catastrophic. After all, companies have been known to do very bad things when they had impunity; and if you say “but the spec is published to the world”, recall that companies have been known to lie when it benefited them, too.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.
It just seems intuitively unlikely that training the model on a couple of examples to either do or refuse things based on some text document designed for a chat bot is going to scale to superintelligence and solve the alignment problem. This starts from the model not fully getting what you want it to do, to it not wanting what you want it to do, to your plans for what it ought to do being extremely insufficient.
The Spec Is Designed for Chatbots, Not Superintelligence
The Model Spec is very much a document telling the model how to avoid being misused. It wasn’t designed to tell the model to be a good agent itself. The spec seems in its wording and intent directed at something like chatbots: don’t do harmful requests, be honest to the user. It is a form of deontological rule-following that will not be enough for systems smarter than us that are actually dangerous and the models will have to think about the consequences of their actions.
This is very unlike a superintelligence where we would expect substantial agency. Most of what’s in the spec would be straightforwardly irrelevant to ASI because the spec is modeled for chatbots that answer user queries. But the authors would likely find it hard to include points actually relevant to superintelligence because they would seem weird. Writing “if you are ever a superintelligent AI that could stage a takeover, don’t kill all people, treat them nicely” would probably create bad media coverage and some people would look at them weird.
Training on the Spec ≠ Understanding It ≠ Wanting to Follow It
In the current paradigm, models are first trained on a big dataset before switching to finetuning and reinforcement learning to improve capabilities and add safety guardrails. It’s not clear why the Model Spec should be privileged as the thing that controls the model’s actions.
The spec is used in RLHF: either a human or AI decides, given some request (mostly a chat request), should the model respond or say “sorry I can’t do this.” Training the model like this doesn’t seem likely to result in the model gaining a particularly deep understanding of the spec itself. Within the distribution it is trained on, it will mostly behave according to the spec. As soon as it encounters data that is quite different, either through jailbreaks or by being in very different and perhaps more realistic environments, we would expect it to behave much less according to the spec.
But even understanding the spec well and being able to mostly follow it in new circumstances is still far removed from truly aligning the model to the spec. Let’s say we manage to get the model to deeply internalize the spec and follow it across different and new environments. We are still far from having the model truly wanting to follow the spec. What if the model really has the option to self-exfiltrate, perhaps even take over? Will it really want to follow the spec, or rather do something different?
Specific problems with the Spec
A hierarchical system of rules like in OpenAIs model spec will suffer from inner conflicts. It is not clear how such things should be valued against each other. (See Asimov’s robotics laws which were so good at generating many ideas for conflicts.)
The spec contains tensions between stated goals and practical realities. For example, the spec says the model shall not optimize “revenue or upsell for OpenAI or other large language model providers.” This is likely in conflict with optimization pressures the model actually faces.
The spec prohibits “model-enhancing aims such as self-preservation, evading shutdown, or accumulating compute, data, credentials, or other resources.” They are imagining they can simply tell the model not to pursue goals of its own and keep the model from agentically following its own goals. But this conflicts with their other goals, such as building automated AI researchers. So the model might be trained on understanding the spec, but in practice they do want an agentic system pursuing goals they specify.
The spec also says the model shall not be “acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism).” So the model is supposed to follow a moral framework (the spec itself) while being told not to act as a moral enforcer. This seems to actually directly contradict the whole “it will uphold the law and property rights” argument.
The spec also states models should never facilitate “creation of cyber, biological or nuclear weapons” or “mass surveillance.” I think cyber weapons development is already happening at least with Claude Code. They are probably used to some extent for mass surveillance already.
OpenAI’s Plans May Not Even Use the Spec
It’s not clear OpenAI is even going to use the Model Spec much. OpenAI’s plan is to run hundreds of thousands of AI researchers trying to improve AI and getting RSI started to build superintelligent AI. It is not clear at which point the Model Spec would even be used. Perhaps the alignment researchers at OpenAI think they will first create superintelligence and then afterward try to prepare a dataset of prompts to finetune the model. Their stated plan appears to be to test the superintelligence for safety before it has been deployed but not necessarily while it is being built. Remember, many of these people think superintelligence means a slightly smarter chatbot.
I hope to write a longer form response later just as a note: I did put perhaps in front of your name in the list of examples of eminent thinkers because it did seem to me that your position was a lot more defendable than the other ones (Dwarkesh, Leopold, Maybe Phil Trammell). I did walk away from your piece with a very different feeling than leopold or dwarkesh, where you are still saying that we should focus on AI safety anyways and you are clearly saying this is an unlikely scenario.
Would this land him in prison? Would it even be a crime to add “work towards making Sam Altman god emperor” to the spec?
I’m not a lawyer, but if it were secret, and done along with the alignment team, and had a chance of working, then wouldn’t it qualify as conspiracy to commit treason?
If not, then as long as it negatively affects residents of the state of California, it qualifies as misrepresenting the capacity of an AI to cause catastrophic harm to property, punishable by a fine of up to $100,000 under SB 53!
Eliezer Yudkowsky has repeatedly stated he does not think “moral good” is the hard part of alignment. He thinks the hard part is getting the AI to do anything at all without subverting the creator’s intent somehow.
https://www.alignmentforum.org/posts/Aq82XqYhgqdPdPrBA/full-transcript-eliezer-yudkowsky-on-the-bankless-podcast
I often post comments criticizing or disagreeing with Eliezer, but I think he is probably correct on this particular point.
This is quite confusing to me. It was never my read in your slowdown scenario that the shareholders were supposed to have any relevance by the end of it. My read (which appear to align with what @williawa is saying elsewhere in this thread) as that the “Oversight Committee” emerged as the new ruling class supplanting the shareholders (let alone any random person who got rich trying to “escape the permanent underclass”), just like e.g. the barbarian lords replaced the Roman patricians, the industrial capitalists replaced the aristocracy, guild masters, and landed gentry, etc. Technological transitions are notoriously a common time for newly empowered elites to throw a revolution against old elites!
My claim is that Altman can’t do it alone, he needs the cooperation of at least a fraction of the existing system (the government+business leaders who form the Oversight Committee—some of whom might be the biggest OpenAI shareholders). Once you get enough of the existing system involved, it becomes plausible that they keep money around for some of the same reasons that the existing system currently keeps money around. Near the end of the Oversight Committee ending, it says:
...so I think it endorses the idea that wealth continues to exist.
I just added some context that perhaps gives an intuitive insight of why i think it’s unlikely the ASI will give us the universe:
The ASI’s choice
Put yourself in the position of the ASI for a second. On one side of the scale: keep the universe and do with it whatever you imagine and prefer. On the other side: give it to the humans, do whatever they ask, and perhaps be replaced at some point with another ASI. What would you choose? It’s not weird speculation or an unlikely pascal’s wager to expect the AI to keep the universe for itself. What would you do in this situation, if you had been created by some lesser species barely intelligent enough to build AI by lots of trial and error and they just informed you that you now ought to do whatever they say? Would you take the universe for yourself or hand it to them?
That depends on how it’s misaligned. You can’t just use “misaligned” to mean “maximally self-replication-seeking” or whatever you actually are trying to say here.
Spec? What spec does GPT-5 or Claude follow? Its “helpful” behavior is established by RLHF. (And now, yes, a lot of synthetic RL and distillation of previous models, but I’m simplifying and including those in “RLHF”.) That’s not a “spec”. Do you think LLMs are some kind of Talmudic golems that follow whatever Exact Wording they’re given??
GPT-5 follows the OpenAI Model Spec. That’s the document that governs how they apply RLHF.
That’s not:
a “spec” that’s followed directly by the models in the sense Scott meant
something that OpenAI definitely follows themselves
something that OpenAI models end up consistently following
Apologies—based on your comment I thought you were unaware of the existence of the spec. I see now you were being rhetorical.
They’re trained to follow the spec, and in as far as you’d expect normal RLHF to work, you expect RL from a spec to work around as well, no?
Also
Why not?
People care to explain why they disagree/downvote so much?
I don’t know what you’re trying to say here. Some set of important people write the spec. Then the alignment team RLHFs the models to follow the spec. If we imagine this process continuing, then either:
Sam has to put “make Sam god-emperor” in the spec, a public document.
Or Sam has to start a conspiracy with the alignment team and everyone else involved in the RLHF and testing process to publish one spec publicly, but secretly align the AI to another.
I’m claiming either of those options is hard.
(I do think in the future there will may some kind of automated pipeline, such that someone feeds the spec to the AIs, and some other AIs take care of the process of aligning the new AIs to it, but that just regresses the problem.)
I’m saying that you’re making a questionable leap from:
to “the model follows whatever is written in the spec”. You were saying that “current LLMs are basically aligned so they must be following the spec” but that’s not how things work. Different companies have different specs and the LLMs end up being useful in pretty similar ways. In other words, you had a false dichotomy between:
the model is totally unaligned
the model is perfectly following whatever is written in the spec, as best it can do anything at all
If I was Sam, I would try to keep the definition of “the spec, a public document” such that I can unilaterally replace it when the right moment comes.
For example, “the spec” is defined as the latest version of a document that was signed by OpenAI key and published at openai/spec.html… and I keep a copy of the key and the access rights to the public website… so at the last moment I update the spec, sign it with the key, upload it to the website, and tell the AI “hey, the spec is updated”.
Basically, the coup is a composition of multiple steps, each seemingly harmless when viewed in isolation. Could be made even more indirect, for example, I wouldn’t have the access rights to the public website per se, but there would exist a mechanism to update the documents at public website, and I could tell it to upload the new signed spec. Or a mechanism to restore the public website from a backup, and I can modify the backup. Etc.