Either I don’t understand your view about continuous development and threshold things, or I think it’s probably wrong. I’ve decided to spin this out into its own comment, though I originally wrote it as part of this one.
I’m going to think out loud: One story I could tell for how we cross the gap from “the AI would not succeed at taking over” to “it would succeed if it tried”, in a continuous manner, is as @Nina Panickssery points out in a tweet, “The AI could be more or less likely to succeed”.
(My understanding is that we’re actually focusing on the AI’s beliefs about its likelihood of success, which I’ll call AI-p(success), not p(success) according to some observer—the AI’s beliefs are what shape its decisions.) So here’s an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don’t know that that’s true. And we don’t know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don’t get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they’re able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we’ll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
We don’t actually know the decision process of the AI and don’t know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there’s a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don’t know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy “only attempt take over if AI-p(success) ≥ 99%”, then we’re taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It’s not obvious that we’ll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can’t actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it’s plausible that we’ll get this, I might be at like 50⁄50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
It’s unclear if we’ll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don’t try very hard at this kind of red-teaming, then we certainly won’t succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It’s unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It’s not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we’ll just end up pushing the threshold up a couple days or changing the AI’s decision process; it seems like there are a lot of ways we could fail at this step.
I could be totally misunderstanding Nina’s idea, this is all very complicated.
It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)
Either I don’t understand your view about continuous development and threshold things, or I think it’s probably wrong. I’ve decided to spin this out into its own comment, though I originally wrote it as part of this one.
I’m going to think out loud: One story I could tell for how we cross the gap from “the AI would not succeed at taking over” to “it would succeed if it tried”, in a continuous manner, is as @Nina Panickssery points out in a tweet, “The AI could be more or less likely to succeed”.
(My understanding is that we’re actually focusing on the AI’s beliefs about its likelihood of success, which I’ll call AI-p(success), not p(success) according to some observer—the AI’s beliefs are what shape its decisions.) So here’s an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don’t know that that’s true. And we don’t know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don’t get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they’re able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we’ll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
We don’t actually know the decision process of the AI and don’t know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there’s a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don’t know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy “only attempt take over if AI-p(success) ≥ 99%”, then we’re taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It’s not obvious that we’ll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can’t actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it’s plausible that we’ll get this, I might be at like 50⁄50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
I don’t expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren’t sure. OpenAI: “We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability.”Anthropic: “Although we have not determined whether the ASL-3 Standard is required for this model”. Google: “Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1.” Now maybe this is just corporate hedging, but I think it’s way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we’ll enter a regime of “the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely” and probably our p(success) will go up as AI capabilities increase, but it won’t be well justified ever and there will be significant disagreement about the number among experts.
It’s unclear if we’ll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don’t try very hard at this kind of red-teaming, then we certainly won’t succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It’s unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It’s not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we’ll just end up pushing the threshold up a couple days or changing the AI’s decision process; it seems like there are a lot of ways we could fail at this step.
I could be totally misunderstanding Nina’s idea, this is all very complicated.
It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)