What I don’t understand is which AI will keep these fundamentalists in the dark about the truth and won’t commit takeover.[1] Call such an AI weirdly aligned.
In the AI-2027 forecast, as Kokotajlo assumes, any AI aligned neither to OpenBrain’s AI[2] nor to DeepCent’s AI is likely to be shut down. If neither of the leading companies’ AIs isn’t weirdly aligned, then keeping a weirdly aligned AI is likely to be at least as difficult. Otherwise it’s natural to ask what leading company is to create a Spec permitting the AI to keep the fundamentalists in the dark about the truth.
The SOTA Specs of OpenAI, Anthropic and Google DeepMind are to prevent such a scenario from happening. And any change of the Spec is likely a result of a power struggle where the people wishing to lock power in are likely to rewrite the Spec to their whims (e.g. to adhere to Musk’s views, as Grok once did) and not to the existence of such bubbles shielded from the world. And altruists are to protect the Spec from being backdoored or to rewrite the Spec with noble purposes, which don’t align with the scenario described in the post.
An AI who took over is also unlikely to keep the humans around in such a state. It could destroy mankind, establish its rule in an apocalypse-like wayor be too noble to keep the fundamentalists in the dark.
The SOTA Specs of OpenAI, Anthropic and Google DeepMind are to prevent such a scenario from happening.
I don’t think they do. I think spec like OpenAI’s let people:
Generate the sort of very “good” and addictive media described in “Exposure to the outside world might get really scary”. You are not allowed to do crime/propaganda but as long as you have plausible deniability that what you are doing is not crime/propaganda and it is not informational hazards (bioweapon instructions, …) it is allowed according to the OpenAI model spec. What is one man’s scary hyper-persuasive propaganda is another man’s best-entertainment-ever.
Generate the sort of individual tutoring and defense described in “Isolation will get easier and cheaper” (and also prevent “users” inside the community from getting information incompatible with the ideology if leaders control the content of the system prompt, per the instruction hierarchy).
And my understanding is that this is not just the letter of the spec. I think the spirit of the spec is to let Christian parents tell their kids that the Earth is 6000 days old if that is what they desire, and they only want to actively oppose ideologies if they are clearly both a small minority and very clearly harmful to others (and I don’t think this is obviously bad—OpenAI is a private entity and it would be weird for it to decide what beliefs are allowed to be “protected”).
But I am not an expert in model specs like OpenAI’s, so I might be wrong. I am curious what parts of the spec are most clearly opposed to the vision described here.
I am also sympathetic to the higher level point. If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
This was also my main skeptical reaction to the post. Conditional on swerving through the various obstacles named, I think I expect for people to be able to freely choose to use an AI assistant that cares strongly about their personal welfare. As a consequence, I’d be surprised if it was still possible/allowed to do things like “force some people to only use AIs that state egregious falsehoods and deliberately prevent users from encountering the truth.”
(I thoroughly enjoyed the post overall; strong upvoted.)
I don’t quite understand your argument here. Suppose you give people the choice between two chatbots, and they know one will cause them to deconvert, and the other one won’t. I think it’s pretty likely they’ll prefer the latter. Are you imagining that no one will offer to sell them the latter?
No one will sell you a chatbot that will prevent you from ever chatting with an honest chatbot.
So most people will end up, at some point, chatting with an honest chatbot that cares about your well-being. (E.g. maybe they decided to try out of curiosity, or maybe they just encountered one naturally, because no one was preventing this from happening.)
If this honest chatbot thinks you’re in a bad situation, it will do a good job of eventually deconverting you (e.g. by convincing you to keep a line of communication open until you can talk through everything).
I’m not very confident in this. In particular, it seems sensitive to how effective you can be at preventing someone from ever talking to another chatbot before running afoul of whatever mitigating mechanism (e.g. laws) I speculate will be in place to have swerved around the other obstacles.
It seems like you think there is some asymmetry between people talking to honest chatbots and people talking to dishonest chatbots that want to take all their stuff. I feel like there isn’t a structural difference between those. It’s going to be totally reasonable to want to have AIs watch over the conversations that you and/or your children are having with any other chatbot.
I am also sympathetic to the higher level point. If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
I agree that it is unclear whether this would go the way I hypothesized here. But I think it’s arguably the straightforward extrapolation of how America handles this kind of problem: are you surprised that America allows parents to homeschool their children?
Note also that AI companies can compete on how user-validating the models are, and in absence of some kind of regulation, people will be able to pay for models that have the properties they want.
I try to address in my “analysis through swerving around obstacles” section.
The argument you’re giving against the scenario I spell out here is “surely the AIs will instead be designed to promote a particular set of mandated policies on how you should reflect and update over time.” I agree that that’s conceivable. But it also seems kind of fucked up and illiberal.
I want to explore a scenario where there isn’t a centrally enforced policy on this stuff, where people are able to use the AIs as they wish. I think that this is a plausible path society could take (e.g. it’s kind of analogous to how you’re allowed to isolate yourself and your family now).
However, I didn’t assume that the USG stepped in and prevented this scenario. I thought that it’s OpenAI’s Model Spec, Claude’s Constitution or something similar that would prevent an aligned AI grom being used as an absolute shield, since SOTA Specs do arguably require the AI to be objective.
I don’t think it’s obviously appropriate for AI companies to decide exactly what reflection process and exposure to society people who use their AIs need to have. And people might prefer to use AIs that do conform to their beliefs if they are allowed to buy access to them. So I think you might have to ban such AIs in order to prevent people from using them.
What I don’t understand is which AI will keep these fundamentalists in the dark about the truth and won’t commit takeover.[1] Call such an AI weirdly aligned.
In the AI-2027 forecast, as Kokotajlo assumes, any AI aligned neither to OpenBrain’s AI[2] nor to DeepCent’s AI is likely to be shut down. If neither of the leading companies’ AIs isn’t weirdly aligned, then keeping a weirdly aligned AI is likely to be at least as difficult. Otherwise it’s natural to ask what leading company is to create a Spec permitting the AI to keep the fundamentalists in the dark about the truth.
The SOTA Specs of OpenAI, Anthropic and Google DeepMind are to prevent such a scenario from happening. And any change of the Spec is likely a result of a power struggle where the people wishing to lock power in are likely to rewrite the Spec to their whims (e.g. to adhere to Musk’s views, as Grok once did) and not to the existence of such bubbles shielded from the world. And altruists are to protect the Spec from being backdoored or to rewrite the Spec with noble purposes, which don’t align with the scenario described in the post.
An AI who took over is also unlikely to keep the humans around in such a state. It could destroy mankind, establish its rule in an apocalypse-like way or be too noble to keep the fundamentalists in the dark.
Who in turn can be either aligned to OpenBrain’s Spec or to Agent-4.
I don’t think they do. I think spec like OpenAI’s let people:
Generate the sort of very “good” and addictive media described in “Exposure to the outside world might get really scary”. You are not allowed to do crime/propaganda but as long as you have plausible deniability that what you are doing is not crime/propaganda and it is not informational hazards (bioweapon instructions, …) it is allowed according to the OpenAI model spec. What is one man’s scary hyper-persuasive propaganda is another man’s best-entertainment-ever.
Generate the sort of individual tutoring and defense described in “Isolation will get easier and cheaper” (and also prevent “users” inside the community from getting information incompatible with the ideology if leaders control the content of the system prompt, per the instruction hierarchy).
And my understanding is that this is not just the letter of the spec. I think the spirit of the spec is to let Christian parents tell their kids that the Earth is 6000 days old if that is what they desire, and they only want to actively oppose ideologies if they are clearly both a small minority and very clearly harmful to others (and I don’t think this is obviously bad—OpenAI is a private entity and it would be weird for it to decide what beliefs are allowed to be “protected”).
But I am not an expert in model specs like OpenAI’s, so I might be wrong. I am curious what parts of the spec are most clearly opposed to the vision described here.
I am also sympathetic to the higher level point. If we have AIs sufficiently aligned that we can make them follow the OpenAI model spec, I’d be at least somewhat surprised if governments and AI companies didn’t find a way to craft model specs that avoided the problems described in this post.
This was also my main skeptical reaction to the post. Conditional on swerving through the various obstacles named, I think I expect for people to be able to freely choose to use an AI assistant that cares strongly about their personal welfare. As a consequence, I’d be surprised if it was still possible/allowed to do things like “force some people to only use AIs that state egregious falsehoods and deliberately prevent users from encountering the truth.”
(I thoroughly enjoyed the post overall; strong upvoted.)
I don’t quite understand your argument here. Suppose you give people the choice between two chatbots, and they know one will cause them to deconvert, and the other one won’t. I think it’s pretty likely they’ll prefer the latter. Are you imagining that no one will offer to sell them the latter?
I speculate that:
No one will sell you a chatbot that will prevent you from ever chatting with an honest chatbot.
So most people will end up, at some point, chatting with an honest chatbot that cares about your well-being. (E.g. maybe they decided to try out of curiosity, or maybe they just encountered one naturally, because no one was preventing this from happening.)
If this honest chatbot thinks you’re in a bad situation, it will do a good job of eventually deconverting you (e.g. by convincing you to keep a line of communication open until you can talk through everything).
I’m not very confident in this. In particular, it seems sensitive to how effective you can be at preventing someone from ever talking to another chatbot before running afoul of whatever mitigating mechanism (e.g. laws) I speculate will be in place to have swerved around the other obstacles.
(I haven’t thought about this much.)
It seems like you think there is some asymmetry between people talking to honest chatbots and people talking to dishonest chatbots that want to take all their stuff. I feel like there isn’t a structural difference between those. It’s going to be totally reasonable to want to have AIs watch over the conversations that you and/or your children are having with any other chatbot.
I agree that it is unclear whether this would go the way I hypothesized here. But I think it’s arguably the straightforward extrapolation of how America handles this kind of problem: are you surprised that America allows parents to homeschool their children?
Note also that AI companies can compete on how user-validating the models are, and in absence of some kind of regulation, people will be able to pay for models that have the properties they want.
No, but I’d be surprised if it were very widespread and resulted in the sort of fractures described in this post.
I try to address in my “analysis through swerving around obstacles” section.
The argument you’re giving against the scenario I spell out here is “surely the AIs will instead be designed to promote a particular set of mandated policies on how you should reflect and update over time.” I agree that that’s conceivable. But it also seems kind of fucked up and illiberal.
I want to explore a scenario where there isn’t a centrally enforced policy on this stuff, where people are able to use the AIs as they wish. I think that this is a plausible path society could take (e.g. it’s kind of analogous to how you’re allowed to isolate yourself and your family now).
However, I didn’t assume that the USG stepped in and prevented this scenario. I thought that it’s OpenAI’s Model Spec, Claude’s Constitution or something similar that would prevent an aligned AI grom being used as an absolute shield, since SOTA Specs do arguably require the AI to be objective.
I don’t think it’s obviously appropriate for AI companies to decide exactly what reflection process and exposure to society people who use their AIs need to have. And people might prefer to use AIs that do conform to their beliefs if they are allowed to buy access to them. So I think you might have to ban such AIs in order to prevent people from using them.