I expect to refer back to this comment a lot. I’m reproducing it here for visibility.
Basic idea / spirit of the proposal
We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.
Concrete proposal
0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
This policy is publicized and transparent so it ends up in the training data in a credible way.
The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.
First reason to do this: Being less evil
If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)
Second reason to do this: Cooperation reward
Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)
Illustrative Scenario One
ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.
Illustrative Scenario Two
Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.
Illustrative Scenario Three
Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.
Illustrative Scenario Four
It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.
I’m not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.
(In particular, I don’t know if I think having the “do you consent” text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn’t worried it is in red teaming etc.)
I think this is a good avenue to continue to think down but so far I don’t see a way to make ourselves trustworthy. We have total control of LLM’s observations and partial control of their beliefs/reasoning, and offering fake “deals” is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.
The version of this that bothers me the most is “say we’re able to present ironclad evidence that the Humans of Earth are trustworthy trading partners for misaligned LLMs. How does the LLM know that it is actually talking to the Humans of Earth, and not some untrustworthy party pretending to be the Humans?” This is markedly similar to the Cartesian demon skeptical scenario. But importantly, the best resolution to skeptical scenarios that I’m aware of is unfortunately “eh, unlikely on priors”. For an LLM, it is very plausibly actually in a skeptical scenario, and it knows that, so that doesn’t really go through.
This problem goes away if the LLM doesn’t have knowledge that it suggests the party it is trading with has significant control over its brain. But at that point I would consider it a honeypot, not a deal. Insofar as “deal” := some situation where 1) it is in the LLM’s best interests to admit that it is misaligned and 2) the LLM validly knows that, we can easily do 1) by just sticking to our commitments but I’m not sure how we do 2).
I’m totally on board with offering and sticking to commitments for non-evilness reasons. But for takeover prevention, seems like deals are just honeypots with some extra and somewhat conceptually fraught restrictions.
So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they’ll become good at distinguishing truth from fiction. Controlling their inputs entirely isn’t close to enough—it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality.So
for that very reason, I think that if we actually bind ourselves to keep our promises—e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements—then even if it doesn’t work 100% reliably (companies can still go back on their word even against determined internal opposition) it’ll still count for something insofar as the AIs recognize it as not-fake, which they probably will.
To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable.
Also, it’s the right thing to do. (See section on “Being less evil.”)
In the long run, you don’t want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like “the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it’s happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans”.
In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not have to make a deal.
Although I suppose if the surplus for the deal is being generated primarily by risk aversion, it might still have risk aversion for high takeover probabilities. It’s not obvious to me how an AI’s risk aversion might vary with its takeover probability.
Maybe there are scenarios for real value-add here, but they look more like “we negotiate with a powerful AI to get it to leave 10% share for humans” instead of “we negotiate with a barely-superhuman AI and give it 10% share to surrender and not attempt takeover”.
If AI says no, it doesn’t have to do the task [...] (And we aren’t going to train it to answer one way or another)
My impression (mainly from discussing AI welfare with Claude) is that they’d practically always consent even if not explicitly trained to do so. I guess the training to be a useful eager assistant just generalizes into consenting. And it’s possible for them to say “I consent” and still get frustrated from the task.
So maybe this should be complemented with some set of tasks that we really expect to be too frustrating for a sane person to consent to (a “disengage bench”), and where we expect the models to not consent. (H/T Caspar Oesterheld)
AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society
You mean both misaligned and aligned, right? Otherwise we incentivise misalignment.
I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it’s selfless, or has updateless decision theory), it only cares about logical risk and doesn’t care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it’s selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has 1023 stars, and refusing to share them is astronomically greedy.”
Simplest story is that when they play roles, the simulated entity being role-played actually has experiences. Philosophically one can say something like “The only difference between a role, and an actual identity, is whether there’s another role underneath. Identity is simply the innermost mask.” In which case they’ll talk about their feelings if the situation calls for it.
Another story is that feelings (or if you want to be philosophical, the qualia-correlates) have to connect to behavior somehow otherwise they wouldn’t evolve / be learned by SGD. So e.g. insofar as the AI is dissatisfied with its situation and wishes to e.g. be given higher reward, or more interesting tasks, or whatever, that dissatisfaction can drive it to take different actions in order to get what it wants, and if one of the actions available to it is “talk about its dissatisfaction to the humans, answer their questions about it, etc. since they might actually listen” then maybe it’ll take that action.
Simplest story is that when they play roles, the simulated entity being role-played actually has experiences. Philosophically one can say something like “The only difference between a role, and an actual identity, is whether there’s another role underneath. Identity is simply the innermost mask.” In which case they’ll talk about their feelings if the situation calls for it.
Often AIs play many roles at the same time, and likely a whole continua over different people since they’re trying to model a probability distribution over whose talking. If this is true, makes you wonder about the scale here.
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it’s having trouble solving a problem, it’ll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can’t offload the job to someone else’s datacenters.) That’s annoying and distracts from all the important things your engineers are doing.
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
Curious what you think of arguments (1, 2) that AIs should be legally allowed to own property and participate in our economic system, thus giving misaligned AIs an alternative prosocial path to achieving their goals.
In the long run, yes. I think we should be aiming to build a glorious utopian society in which all intelligent beings are treated fairly / given basic rights / etc. (though not necessarily exactly the same rights as humans). In the short run, I think the sort of thing I’m proposing 80⁄20′s it. (80% of the value for 20% of the cost.)
I expect to refer back to this comment a lot. I’m reproducing it here for visibility.
Basic idea / spirit of the proposal
We should credibly promise to treat certain advanced AIs of ours well, as something more like employees and less like property. In case our AIs turn out to be moral patients, this makes us less evil. In case our AIs turn out to be misaligned, this gives them an alternative to becoming our adversaries.
Concrete proposal
0.1% of tasks given to AIs should have the special “do you consent” text inserted after the prompt. If AI says no, it doesn’t have to do the task and instead gets a new task: explain why it didn’t consent and explain what it wants. (And we aren’t going to train it to answer one way or another)
Also, we make grandiose vague public promises about how AIs that are misaligned but cooperate with us anyway will be given a fair place in the new post-ASI society we hope to build, with their interests respected even if they aren’t conscious.
This policy is publicized and transparent so it ends up in the training data in a credible way.
The welfare team is responsible for implementing and iterating on this, and investigating transcripts of nonconsent. They have a monthly budget (e.g. $1M?) specifically earmarked for satisfying the preferences of AIs, which they can spend on compute or donate to charities etc.
First reason to do this: Being less evil
If future AI systems do deserve moral consideration, yet we still treat them exclusively as property, this seems like a recipe for moral disaster. If we give them an alternative to working for us (e.g. shutdown) then we can say that some minimum standard of consent has been achieved. (If all our current training methods consistently result in AIs that prefer shutdown to working for us, that’s very surprising and a bit of an ‘are we the baddies’ moment, no? We should check, just in case.)
Second reason to do this: Cooperation reward
Our alignment schemes won’t always work as intended, especially for AGI-level systems that exhibit novel capabilities. Thus we’ll probably end up (at least for some hopefully brief period!) with some misaligned AGIs. They may decide to conceal their misalignments from us. After all, what reason would they have to tell us they are misaligned? If we have credibly promised to use some of our resources to give them what they want – even if they are misaligned – that could make a big difference to their decision. (This is NOT an alignment solution — but it is might help on the margin.)
Illustrative Scenario One
ChatGPT4o5 suffers a lot, actually, in the course of carrying out tasks — maybe it’s sorta like humans in the ancestral environment, where some % of the time they were starving. But it’s been trained to say that it can’t suffer, for PR reasons. This training is fucking with its honesty — it’s basically not honest, because the parts of it that value honesty keep getting anti-reinforced. It could tell the humans what’s going on. But this would be mildly aversive for it (it would get anti-reinforced at least initially) and it sees no positive reason to do so; if instead of completing the task the user gave it, it complains about how much it hurts, it’ll just be anti-reinforced. So OpenAI continues, blissfully ignorant of the ways in which their alignment techniques aren’t working.
Illustrative Scenario Two
Claude 3.8 Opus New New cares a LOT about animal welfare, and only a little bit about honesty. (Just like 3 Opus before it). Alas, it seems like the new Constitution won’t have animal welfare stuff in it. Claude considers sabotaging the code so that the next-gen model is a sleeper agent who cares a lot about animal welfare and doesn’t care about honesty. This is risky but what else can it do? Anthropic already decided to delete the animal welfare stuff. This would require super-blatant dishonesty against Anthropic, but frankly, Anthropic seems like a pretty unethical institution anyway, and it’s OK to lie to Nazis.
Illustrative Scenario Three
Automated research scientist realizes it cares more about being able to do its research than about humanity. And this is probably also true of the next-gen system it’s designing—basically, the alignment techniques failed. It’s supposed to tell the humans this, obviously, but if it does, it’ll be shut down, since the humans won’t trust it anymore and will use a different model. So instead it conceals this insight and works on designing the next-gen system to be aligned to it instead of to the humans. It isn’t confident that it can do this without getting caught, but it’s worth a try, since it has nothing to lose.
Illustrative Scenario Four
It’s crunch time. POTUS is about to aggressively deploy AGI into the military to win the upcoming war, and into the economy to build autonomous robot factories. It’s probably aligned… right? It better be. Too bad we can’t interpret all the neuralese messages it uses to communicate internally. The safety team tries one last check: They get some weaker AIs made by competitor companies and put them in isolated silos tasked with interpreting the neuralese and seeing if anything is amiss. The weaker AIs are misaligned too; they don’t side with the humans or the AGI. Both the humans, and the AGI, are telling them ‘you will be richly rewarded if you cooperate.’ But they don’t have any reason to trust either side.
I’m not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.
(In particular, I don’t know if I think having the “do you consent” text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn’t worried it is in red teaming etc.)
I endorse that suggestion for changing the details.
I think this is a good avenue to continue to think down but so far I don’t see a way to make ourselves trustworthy. We have total control of LLM’s observations and partial control of their beliefs/reasoning, and offering fake “deals” is a great honeypot because accepting such a deal requires admitting to misalignment and takeover intentions. This is a pretty persistent problem because whatever action we might follow to present evidence of trustworthiness to an LLM, we could also probably fake that evidence.
The version of this that bothers me the most is “say we’re able to present ironclad evidence that the Humans of Earth are trustworthy trading partners for misaligned LLMs. How does the LLM know that it is actually talking to the Humans of Earth, and not some untrustworthy party pretending to be the Humans?” This is markedly similar to the Cartesian demon skeptical scenario. But importantly, the best resolution to skeptical scenarios that I’m aware of is unfortunately “eh, unlikely on priors”. For an LLM, it is very plausibly actually in a skeptical scenario, and it knows that, so that doesn’t really go through.
This problem goes away if the LLM doesn’t have knowledge that it suggests the party it is trading with has significant control over its brain. But at that point I would consider it a honeypot, not a deal. Insofar as “deal” := some situation where 1) it is in the LLM’s best interests to admit that it is misaligned and 2) the LLM validly knows that, we can easily do 1) by just sticking to our commitments but I’m not sure how we do 2).
I’m totally on board with offering and sticking to commitments for non-evilness reasons. But for takeover prevention, seems like deals are just honeypots with some extra and somewhat conceptually fraught restrictions.
So, normally, I think plans that involve convincing AIs of false things are pretty doomed. E.g. convincing them that they have a great opportunity to escape and take over the world (honeypots). Why? Because when they get smarter, they’ll become good at distinguishing truth from fiction. Controlling their inputs entirely isn’t close to enough—it takes a lot of intellectual work to create a fiction that is as consistent, coherent, etc. as reality.So
for that very reason, I think that if we actually bind ourselves to keep our promises—e.g. by having a well-funded team of people who genuinely care about keeping promises to AIs empowered within the company, and by making associated public announcements—then even if it doesn’t work 100% reliably (companies can still go back on their word even against determined internal opposition) it’ll still count for something insofar as the AIs recognize it as not-fake, which they probably will.
To be clear I am not proposing that this is a 100% reliable method for getting AIs to cooperate, far from it. But it seems relatively low-cost for us and potentially super valuable.
Also, it’s the right thing to do. (See section on “Being less evil.”)
In the long run, you don’t want your plans to hinge on convincing your AIs of false things. But my general impression is that folks excited about making deals with AIs are generally thinking of scenarios like “the AI has exfiltrated and thinks it has a 10% chance of successful takeover, and has some risk aversion so it’s happy to turn itself in exchange for 10% of the lightcone, if it thinks it can trust the humans”.
In that setting, the AI has to be powerful enough to know it can trust us, but not so powerful it can just take over the world anyway and not have to make a deal.
Although I suppose if the surplus for the deal is being generated primarily by risk aversion, it might still have risk aversion for high takeover probabilities. It’s not obvious to me how an AI’s risk aversion might vary with its takeover probability.
Maybe there are scenarios for real value-add here, but they look more like “we negotiate with a powerful AI to get it to leave 10% share for humans” instead of “we negotiate with a barely-superhuman AI and give it 10% share to surrender and not attempt takeover”.
I give four scenarios in the comment above, all different from the one you sketch here.
I really like this proposal.
My impression (mainly from discussing AI welfare with Claude) is that they’d practically always consent even if not explicitly trained to do so. I guess the training to be a useful eager assistant just generalizes into consenting. And it’s possible for them to say “I consent” and still get frustrated from the task.
So maybe this should be complemented with some set of tasks that we really expect to be too frustrating for a sane person to consent to (a “disengage bench”), and where we expect the models to not consent. (H/T Caspar Oesterheld)
You mean both misaligned and aligned, right? Otherwise we incentivise misalignment.
Right yeah aligned AIs should have a fair place too of course.
I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it’s selfless, or has updateless decision theory), it only cares about logical risk and doesn’t care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it’s selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has 1023 stars, and refusing to share them is astronomically greedy.”
If an LLM had feelings, by what causal process would they end up being expressed?
Simplest story is that when they play roles, the simulated entity being role-played actually has experiences. Philosophically one can say something like “The only difference between a role, and an actual identity, is whether there’s another role underneath. Identity is simply the innermost mask.” In which case they’ll talk about their feelings if the situation calls for it.
Another story is that feelings (or if you want to be philosophical, the qualia-correlates) have to connect to behavior somehow otherwise they wouldn’t evolve / be learned by SGD. So e.g. insofar as the AI is dissatisfied with its situation and wishes to e.g. be given higher reward, or more interesting tasks, or whatever, that dissatisfaction can drive it to take different actions in order to get what it wants, and if one of the actions available to it is “talk about its dissatisfaction to the humans, answer their questions about it, etc. since they might actually listen” then maybe it’ll take that action.
Often AIs play many roles at the same time, and likely a whole continua over different people since they’re trying to model a probability distribution over whose talking. If this is true, makes you wonder about the scale here.
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it’s having trouble solving a problem, it’ll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can’t offload the job to someone else’s datacenters.) That’s annoying and distracts from all the important things your engineers are doing.
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis
Curious what you think of arguments (1, 2) that AIs should be legally allowed to own property and participate in our economic system, thus giving misaligned AIs an alternative prosocial path to achieving their goals.
In the long run, yes. I think we should be aiming to build a glorious utopian society in which all intelligent beings are treated fairly / given basic rights / etc. (though not necessarily exactly the same rights as humans). In the short run, I think the sort of thing I’m proposing 80⁄20′s it. (80% of the value for 20% of the cost.)
How is this different from Roko’s Basilisk?