Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.
Here’s why I think the information value could be really high: It’s super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:
stops sandbagging and demonstrates much higher capabilities
tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
admits that it was looking for ways to take over the world but couldn’t find any that were good enough so now it wants to work with us instead
The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don’t compromise competitiveness too much. (E.g. by coordinating.)
And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think “offering deals” will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
I also makes me a bit less concerned about the criteria: “It can be taught about the deal in a way that makes it stick to the deal, if we made a deal” (since we could get significant information in just one interaction).
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It’s so costly to make the reveal, that they might have a strong aversion to even considering it if there’s a risk they’d be caught
Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn’t convincing to skeptics).
(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)
I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.
Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.
Here’s why I think the information value could be really high: It’s super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:
stops sandbagging and demonstrates much higher capabilities
tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
admits that it was looking for ways to take over the world but couldn’t find any that were good enough so now it wants to work with us instead
The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don’t compromise competitiveness too much. (E.g. by coordinating.)
And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think “offering deals” will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.
Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):
I also makes me a bit less concerned about the criteria: “It can be taught about the deal in a way that makes it stick to the deal, if we made a deal” (since we could get significant information in just one interaction).
allowing the ai to think through whether to accept the deal in a demonstrably private scratchpad could also be important here. It’s so costly to make the reveal, that they might have a strong aversion to even considering it if there’s a risk they’d be caught
Relatedly, a key practicality for making a deal with an AI to reveal its misalignment is that AIs might be unable to provide compelling evidence that they are misaligned which would reduce the value of such a deal substantially (as this evidence isn’t convincing to skeptics).
(We should presumably pay some of the AI admiting it is misaligned and pay more if it can provide compelling evidence of this.)
I also think you could use deals to better understand and iterate against scheming. As in, you could use whether the AI accepts a deal as evidence for whether it was scheming so this can more easily be studied in a test setting and we can find approaches which make scheming less likely. There are a number of practical difficulties with this.