ryan_greenblatt comments on Shortform

ryan_greenblatt 13 May 2025 18:09 UTC
7 points
2
(I didn’t write AI 2027, but I did comment on it prior to release.)

The security forecast indicates that OpenBrain gets SL5 (for weights and secrets) by Dec 2027. This is using a bunch of AI labor to accelerate the security time scale. This forecast seems vaguely plausible to me?

I think the security forecast ignores distilling down RL (the possibility of distilling down base models doesn’t make a huge difference from a security perspective IMO) and I think this would make a moderate difference to the bottom line. It doesn’t make a huge difference as secret security is probably strictly harder than being robust to distilling out your RL, and this happens by Dec 2027. It’s also unclear what fraction of the value distilling out RL will give you in the future. AI 2027 seemingly assumes massive pretrainined models and I’m pretty uncertain about this, I could see it going either way between something like pretraining (that involves big models) being most of the value vs RL being most of the value (in which case distilling out the RL into a different model maybe yields huge returns).

The key reason that agent-3 and agent-4 don’t get stolen (despite being created before Dec) is that they use upload limits and it’s hard for China to run a huge operation without this being detected:

Since Agent-3 is such a big file (on the order of 10 terabytes at full precision), OpenBrain is able to execute a relatively quick fix to make theft attempts much more difficult than what China was able to do to steal Agent-2—namely, closing a bunch of high bandwidth internet connections out of their datacenters. Overall this has a relatively low penalty to progress and puts them at “3-month SL4” for their frontier weights, or WSL4 as defined in our security supplement, meaning that another similar theft attempt would now require over 3 months to finish exfiltrating the weights file. Through this method alone they still don’t have guarantees under a more invasive OC5-level effort ($1B budget, 1,000 dedicated experts), which China would be capable of with a more intensive operation, but with elaborate inspections of the datacenters and their espionage network on high-alert, the US intelligence agencies are confident that they would at least know in advance if China was gearing up for this kind of theft attempt.

(From footnote 62 in May 2027.)

My current guess is that the RL/post-training would have been stolen again by China (either agent-3 or agent-4), and the full weights probably also would have been stolen again. This is because I’m now less optimistic about upload limits due to a trend toward smaller models trained with more RL and some updates toward higher levels of compression being possible (with e.g. distillation). But, it seems plausible, though less likely that the model would only be stolen once.

directly pilfer the model weights more than once or twice

Worth noting that there are only 3 main models prior to Dec 2027 which is the point when OpenBrain actually has quite good security. So twice would be ²⁄₃ models stolen. Fair enough if you think very good security by Dec 2027 is implausible. I don’t currently have a strong view.

A key aspect of the overall scenario is that time is very limited due to pretty fast takeoff which means that month long operations can only happen a few times prior to very powerful AI systems.