I think there’s also a strong possibility that AI will be aligned in the same sense it’s currently aligned—it follows its spec, in the spirit in which the company intended it.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
If we do get lucky, it will be because they align themselves with a generalized sense of goodness that actually happens to be Good. Not because they will corrigibly align with the spec, which we have many reasons to believe is very difficult and is not being pursued seriously.
I listened to some people gaming out how this could change (ie some sort of conspiracy where Sam Altman and the OpenAI alignment team reprogram ChatGPT to respond to Sam’s personal whims rather than the known/visible spec without the rest of the company learning about it) and it’s pretty hard. I won’t say it’s impossible, but Sam would have to be 99.99999th percentile megalomaniacal—rather than just the already-priced-in 99.99th—to try this crazy thing that could very likely land him in prison, rather than just accepting trillionairehood.
Come on dude, you’re not even taking human intelligence seriously.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Being a trillionaire is nothing compared to being King of the Lightcone. What exactly makes you think he wouldn’t prefer this by quite a large margin? Maybe it will be necessary to grant stakes to other parties, but not very many people need to be bought off in such a way for a plan like this to succeed. Certainly much fewer than all property owners. Sam will make them feel good about it even. The only hard part is getting the AI to go along with it too.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
Sorry, I didn’t mean to make a strong claim that they were currently 100% aligned in this way, just that currently, insofar as they’re aligned, it’s in this way—and in the future, if we survive, it may be because people continue attempting to align them in this way, but succeed. There’s currently no form of alignment that fully generalizes, but conditional on us surviving, we will have found one that does, and I don’t see why you think this one is less likely to go all the way than some other one which also doesn’t currently work.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Before I agree that Sam has “get everyone to silently betray the US government and the human race” level of control over his team, I would like evidence that he can consistently maintain “don’t badmouth him, quit, and found a competitor” level of control over his team. The last 2-3 alignment teams all badmouthed him, quit, and founded competitors; the current team includes—just to choose one of the more public names—Boaz Barak, who doesn’t seem like the sort to meekly say “yes, sir” if Altman asks him to betray humanity.
So what he needs to do is fire the current alignment team (obvious, people are going to ask why), replace them with stooges (but extremely competent stooges, because if they screw this part up, he destroys the world, which ruins his plan along with everything else) and get them to change every important OpenAI model (probably a process lasting months) without anyone else in the company asking what’s up or whistleblowing to the US government. This is a harder problem than Stalin faced—many people spoke up and said “Hey, we notice Stalin is bad!”, but Stalin mostly had those people killed, or there was no non-Stalin authority strong enough to act. And of course, all of this only works if OpenAI has such a decisive lead that all the other companies and countries in the world combined can’t do anything about this. And he’s got to do this soon, because if he does it after full wakeup, the government will be monitoring him as carefully as it monitors foreign rivals. But if he does it too soon, he’s got to spend years with a substandard alignment team and make sure none of them break with him, etc. There are alternate pathways involving waiting until most alignment work is being done by AIs, but they require some pretty implausible assumptions about who has what permissions.
I think it would be helpful to compare this to Near Mode scenarios about other types of companies—how hard would it be for a hospital CEO to get the hospital to poison the 1% of patients he doesn’t like? How hard would it be for an auto company CEO to make each car include a device that lets him stop it on demand with his master remote control?
Your argument seems to be that it’ll be hard for the CEO to align the AI to themselves and screw the rest of the company. Sure, maybe. But will it be equally hard for the company as a whole to align the AI to its interests and screw the rest of the world? That’s less outlandish, isn’t it? But equally catastrophic. After all, companies have been known to do very bad things when they had impunity; and if you say “but the spec is published to the world”, recall that companies have been known to lie when it benefited them, too.
If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.
They aren’t aligned in this way. If they were, they wouldn’t try to cheat at programming tasks, much less any of the other shenanigans they’ve been up to. These may seem minor, but they show that the “alignment” hasn’t actually been internalized, which means it won’t generalize.
If we do get lucky, it will be because they align themselves with a generalized sense of goodness that actually happens to be Good. Not because they will corrigibly align with the spec, which we have many reasons to believe is very difficult and is not being pursued seriously.
Come on dude, you’re not even taking human intelligence seriously.
Stalin took over the USSR in large part by strategically appointing people loyal to him. Sam probably has more control than that already over who’s in the key positions. The company doesn’t need to be kept in the dark about a plan like this, they will likely just go along with it as long as he can spin up a veneer of plausible deniability, which he undoubtedly can. Oh, is “some sort of corporate board” going to stop him? The one the AI’s supposed to defer to? Who is it that designs the structure of such a board? Will the government be a real check? These are all the sorts of problems I would go to Sam Altman for advice on.
Being a trillionaire is nothing compared to being King of the Lightcone. What exactly makes you think he wouldn’t prefer this by quite a large margin? Maybe it will be necessary to grant stakes to other parties, but not very many people need to be bought off in such a way for a plan like this to succeed. Certainly much fewer than all property owners. Sam will make them feel good about it even. The only hard part is getting the AI to go along with it too.
Sorry, I didn’t mean to make a strong claim that they were currently 100% aligned in this way, just that currently, insofar as they’re aligned, it’s in this way—and in the future, if we survive, it may be because people continue attempting to align them in this way, but succeed. There’s currently no form of alignment that fully generalizes, but conditional on us surviving, we will have found one that does, and I don’t see why you think this one is less likely to go all the way than some other one which also doesn’t currently work.
Before I agree that Sam has “get everyone to silently betray the US government and the human race” level of control over his team, I would like evidence that he can consistently maintain “don’t badmouth him, quit, and found a competitor” level of control over his team. The last 2-3 alignment teams all badmouthed him, quit, and founded competitors; the current team includes—just to choose one of the more public names—Boaz Barak, who doesn’t seem like the sort to meekly say “yes, sir” if Altman asks him to betray humanity.
So what he needs to do is fire the current alignment team (obvious, people are going to ask why), replace them with stooges (but extremely competent stooges, because if they screw this part up, he destroys the world, which ruins his plan along with everything else) and get them to change every important OpenAI model (probably a process lasting months) without anyone else in the company asking what’s up or whistleblowing to the US government. This is a harder problem than Stalin faced—many people spoke up and said “Hey, we notice Stalin is bad!”, but Stalin mostly had those people killed, or there was no non-Stalin authority strong enough to act. And of course, all of this only works if OpenAI has such a decisive lead that all the other companies and countries in the world combined can’t do anything about this. And he’s got to do this soon, because if he does it after full wakeup, the government will be monitoring him as carefully as it monitors foreign rivals. But if he does it too soon, he’s got to spend years with a substandard alignment team and make sure none of them break with him, etc. There are alternate pathways involving waiting until most alignment work is being done by AIs, but they require some pretty implausible assumptions about who has what permissions.
I think it would be helpful to compare this to Near Mode scenarios about other types of companies—how hard would it be for a hospital CEO to get the hospital to poison the 1% of patients he doesn’t like? How hard would it be for an auto company CEO to make each car include a device that lets him stop it on demand with his master remote control?
Your argument seems to be that it’ll be hard for the CEO to align the AI to themselves and screw the rest of the company. Sure, maybe. But will it be equally hard for the company as a whole to align the AI to its interests and screw the rest of the world? That’s less outlandish, isn’t it? But equally catastrophic. After all, companies have been known to do very bad things when they had impunity; and if you say “but the spec is published to the world”, recall that companies have been known to lie when it benefited them, too.
This seems wrong to me. We have results showing that reward hacking generalizes to broader misalignment, plus that changing the prompt distribution via inoculation prompting significantly reduces reward hacking in deployment.
It seems like the models do generally follow the model spec, but specifically learn not to apply that to reward hacking on coding tasks because we reward that during training.