Jackson Wagner comments on AI Control May Increase Existential Risk

Jackson Wagner 11 Mar 2025 21:33 UTC
2 points
2
Semi-related: if I’m reading OpenAI’s recent post “How we think about safety and alignment” correctly, they seem to announce that they’re planning on implementing some kind of AI Control agenda. Under the heading “iterative development” in the section “Our Core Principles” they say:
In the future, we may see scenarios where the model risks become unacceptable even relative to benefits. We’ll work hard to figure out how to mitigate those risks so that the benefits of the model can be realized. Along the way, we’ll likely test them in secure, controlled settings. We may deploy into constrained environments, limit to trusted users, or release tools, systems, or technologies developed by the AI rather than the AI itself.
Given the surrounding context in the original post, I think most people would read those sentences as saying something like: “In the future, we might develop AI with a lot of misuse risk, ie AI that can generate compelling propaganda or create cyberattacks. So we reserve the right to restrict how we deploy our models (ie giving the biology tool only to cancer researchers, not to everyone on the internet).”
But as written, I think OpenAI intends the sentences above to ALSO cover AI control scenarios like: “In the future, we might develop misaligned AIs that are actively scheming against us. If that happens, we reserve the right to continue to use those models internally, even though we know they’re misaligned, while using AI control techniques (‘deploy into constrained environments, limit to trusted users’, etc) to try and get useful superalignment work out of them anyways.”

I don’t have a take on the pros/cons of a control agenda, but I haven’t seen anyone else note this seeming policy statement of OpenAI’s, so I figured I’d write it up.