Jackson Wagner comments on On OpenAI’s Safety and Alignment Philosophy

Jackson Wagner 11 Mar 2025 21:16 UTC
3 points
0
Maybe this is obvious / already common knowledge, but I noticed that OpenAI’s post seems to be embracing an AI control agenda for their future superalignment plans. The heading “iterative development” in the section “Our Core Principles” says the following (emphasis mine):
It’s an advantage for safety that AI models have been growing in usefulness steadily over the years, making it possible for the world to experience incrementally better capabilities. This allows the world to engage with the technology as it evolves, helping society better understand and adapt to these systems while providing valuable feedback on their risks and benefits. Such iterative deployment helps us understand threats from real world use⁠ and guides the research for the next generation of safety measures, systems, and practices.
In the future, we may see scenarios where the model risks become unacceptable even relative to benefits. We’ll work hard to figure out how to mitigate those risks so that the benefits of the model can be realized. Along the way, we’ll likely test them in secure, controlled settings. We may deploy into constrained environments, limit to trusted users, or release tools, systems, or technologies developed by the AI rather than the AI itself. These approaches will require ongoing innovation to balance the need for empirical understanding with the imperative to manage risk. For example, making increasingly capable models widely available by sharing their weights should include considering a reasonable range of ways a malicious party could feasibly modify the model, including by finetuning (see our 2024 statement on open model weights⁠). We continue to develop the Preparedness Framework to help us navigate and react to increasing risks.
I think most people would read those bolded sentences as saying something like: “In the future, we might develop AI with a lot of misuse risk, ie AI that can generate compelling propaganda or create cyberattacks. So we reserve the right to restrict how we deploy our models (ie giving the biology tool only to cancer researchers, not to everyone on the internet).”
But as written, I think OpenAI intends the sentences above to ALSO cover scenarios like: “In the future, we might develop misaligned AIs that are actively scheming against us. If that happens, we reserve the right to continue to use those models internally, even though we know they’re misaligned, while using AI control techniques (‘deploy into constrained environments, limit to trusted users’, etc) to try and get useful superalignment work out of them anyways.”

I don’t have a take on the broader implications of this statement (trying to get useful work out of scheming AIs seems risky but also plausibly doable, and other approaches have their own risks, so idk). But I haven’t seen anyone else note this seeming policy statement of OpenAI’s, so I figured I’d write it up.