One large part of the AI 2027 piece is contigent on the inability of nation state actors to steal model weights. The authors’ take is that while China is going to be able to steal trade secrets, they aren’t going to be able to directly pilfer the model weights more than once or twice. I haven’t studied the topic as deeply as the authors but this strikes me as naive, especially when you consider side-channel[1] methods of ‘distilling’ the models based on API output.
@Daniel Kokotajlo@ryan_greenblatt Did you guys consider these in writing the post? Is there some reason to believe these will be ineffective, or not provide the necessary access that raw weights lifted off the GPUs would give?
I would normally consider “side channel attacks work” to be the naive position, but in the AI 2027 post, the thesis is that there exists a determined, well resourced attacker (China) that already has insiders who can inform them on relevant details about OpenAI infrastructure, and already understands how the models were developed in the first place.
I agree that its very plausible that China would steal the weights of Agent-3 or Agent-4 after stealing Agent-2. This was a toss up when writing the story, we ultimately went with just stealing Agent-2 for a combination of reasons. From memory the most compelling were something like:
OpenBrain + the national security state can pretty quickly/cheaply significantly increase the difficulty and importantly the lead time required for another weights theft.
Through 2027 China is only 2-3 months behind, so the upside from another weights theft (especially when you consider the lead time needed) is not worth the significantly increased cost from (1).
Something that is a bit more under-explored is the potential sabotage between the projects. We have the uncertain assumption that the efforts on both sides would roughly cancel out, but we were quite uncertain on the offense-defense balance. I think a more offense favored reality could change the story quite a bit, basically with the idea of MAIM slowing both sides down a bunch.
(I didn’t write AI 2027, but I did comment on it prior to release.)
The security forecast indicates that OpenBrain gets SL5 (for weights and secrets) by Dec 2027. This is using a bunch of AI labor to accelerate the security time scale. This forecast seems vaguely plausible to me?
I think the security forecast ignores distilling down RL (the possibility of distilling down base models doesn’t make a huge difference from a security perspective IMO) and I think this would make a moderate difference to the bottom line. It doesn’t make a huge difference as secret security is probably strictly harder than being robust to distilling out your RL, and this happens by Dec 2027. It’s also unclear what fraction of the value distilling out RL will give you in the future. AI 2027 seemingly assumes massive pretrainined models and I’m pretty uncertain about this, I could see it going either way between something like pretraining (that involves big models) being most of the value vs RL being most of the value (in which case distilling out the RL into a different model maybe yields huge returns).
The key reason that agent-3 and agent-4 don’t get stolen (despite being created before Dec) is that they use upload limits and it’s hard for China to run a huge operation without this being detected:
Since Agent-3 is such a big file (on the order of 10 terabytes at full precision), OpenBrain is able to execute a relatively quick fix to make theft attempts much more difficult than what China was able to do to steal Agent-2—namely, closing a bunch of high bandwidth internet connections out of their datacenters. Overall this has a relatively low penalty to progress and puts them at “3-month SL4” for their frontier weights, or WSL4 as defined in our security supplement, meaning that another similar theft attempt would now require over 3 months to finish exfiltrating the weights file. Through this method alone they still don’t have guarantees under a more invasive OC5-level effort ($1B budget, 1,000 dedicated experts), which China would be capable of with a more intensive operation, but with elaborate inspections of the datacenters and their espionage network on high-alert, the US intelligence agencies are confident that they would at least know in advance if China was gearing up for this kind of theft attempt.
My current guess is that the RL/post-training would have been stolen again by China (either agent-3 or agent-4), and the full weights probably also would have been stolen again. This is because I’m now less optimistic about upload limits due to a trend toward smaller models trained with more RL and some updates toward higher levels of compression being possible (with e.g. distillation). But, it seems plausible, though less likely that the model would only be stolen once.
directly pilfer the model weights more than once or twice
Worth noting that there are only 3 main models prior to Dec 2027 which is the point when OpenBrain actually has quite good security. So twice would be 2⁄3 models stolen. Fair enough if you think very good security by Dec 2027 is implausible. I don’t currently have a strong view.
A key aspect of the overall scenario is that time is very limited due to pretty fast takeoff which means that month long operations can only happen a few times prior to very powerful AI systems.
One large part of the AI 2027 piece is contigent on the inability of nation state actors to steal model weights. The authors’ take is that while China is going to be able to steal trade secrets, they aren’t going to be able to directly pilfer the model weights more than once or twice. I haven’t studied the topic as deeply as the authors but this strikes me as naive, especially when you consider side-channel[1] methods of ‘distilling’ the models based on API output.
@Daniel Kokotajlo @ryan_greenblatt Did you guys consider these in writing the post? Is there some reason to believe these will be ineffective, or not provide the necessary access that raw weights lifted off the GPUs would give?
I would normally consider “side channel attacks work” to be the naive position, but in the AI 2027 post, the thesis is that there exists a determined, well resourced attacker (China) that already has insiders who can inform them on relevant details about OpenAI infrastructure, and already understands how the models were developed in the first place.
I agree that its very plausible that China would steal the weights of Agent-3 or Agent-4 after stealing Agent-2. This was a toss up when writing the story, we ultimately went with just stealing Agent-2 for a combination of reasons. From memory the most compelling were something like:
OpenBrain + the national security state can pretty quickly/cheaply significantly increase the difficulty and importantly the lead time required for another weights theft.
Through 2027 China is only 2-3 months behind, so the upside from another weights theft (especially when you consider the lead time needed) is not worth the significantly increased cost from (1).
Something that is a bit more under-explored is the potential sabotage between the projects. We have the uncertain assumption that the efforts on both sides would roughly cancel out, but we were quite uncertain on the offense-defense balance. I think a more offense favored reality could change the story quite a bit, basically with the idea of MAIM slowing both sides down a bunch.
(I didn’t write AI 2027, but I did comment on it prior to release.)
The security forecast indicates that OpenBrain gets SL5 (for weights and secrets) by Dec 2027. This is using a bunch of AI labor to accelerate the security time scale. This forecast seems vaguely plausible to me?
I think the security forecast ignores distilling down RL (the possibility of distilling down base models doesn’t make a huge difference from a security perspective IMO) and I think this would make a moderate difference to the bottom line. It doesn’t make a huge difference as secret security is probably strictly harder than being robust to distilling out your RL, and this happens by Dec 2027. It’s also unclear what fraction of the value distilling out RL will give you in the future. AI 2027 seemingly assumes massive pretrainined models and I’m pretty uncertain about this, I could see it going either way between something like pretraining (that involves big models) being most of the value vs RL being most of the value (in which case distilling out the RL into a different model maybe yields huge returns).
The key reason that agent-3 and agent-4 don’t get stolen (despite being created before Dec) is that they use upload limits and it’s hard for China to run a huge operation without this being detected:
(From footnote 62 in May 2027.)
My current guess is that the RL/post-training would have been stolen again by China (either agent-3 or agent-4), and the full weights probably also would have been stolen again. This is because I’m now less optimistic about upload limits due to a trend toward smaller models trained with more RL and some updates toward higher levels of compression being possible (with e.g. distillation). But, it seems plausible, though less likely that the model would only be stolen once.
Worth noting that there are only 3 main models prior to Dec 2027 which is the point when OpenBrain actually has quite good security. So twice would be 2⁄3 models stolen. Fair enough if you think very good security by Dec 2027 is implausible. I don’t currently have a strong view.
A key aspect of the overall scenario is that time is very limited due to pretty fast takeoff which means that month long operations can only happen a few times prior to very powerful AI systems.
Tagging @romeo who did our security forecast.