I am a member of the technical staff at OpenAI, working in the alignment team, as well as a Catalyst professor of Computer Science at Harvard. See https://windowsontheory.org/ for my blog, and https://x.com/boazbaraktcs for my twitter profile.
Boaz Barak
1980s BBS style
look like ny post
I think this is a decent way to describe it. When models are just “misaligned,“ we can potentially increase reliability via having one model monitor or check the work of the other, since what the model does has some relation the the prompt it is given. When a model is adversarially misaligned then the monitor could completely ignore its prompt and collude with the model it’s supposed to be monitoring.
Re the scheming graph- curious what you think about the other scenarios I posted. I am not sure I completely agree with the decomposition- to me scheming means the AI having its own prompt-independent goals that it pursues covertly, rather pursuing goals that are not completely aligned but still related to the prompt that it was given.
absolutely agree that higher capabilities means higher stakes and hence increased requirements on alignment!
Thanks to everyone who commented, I tried to make a few more fake graphs to capture the disagreements:
I don’t want to make things very formal, but let’s imagine “alignment” as the models generally following the intent of their generalized prompt (including things like constitution, policy, instructions from system/developer etc..). If the models follow a proxy that is related to the prompt but not really what we wanted (e.g. models express a confident answer when they are not really confident, or find a loophole instead of solving the actual task) then we think of them as somewhat misaligned. If a model is covertly following its own long term goals (e.g. a monitor model colluding with another model to sabotage safety training) then they are completely misaligned.
The dot at the bottom right corner is what happens if models satisfy the combination of (1) being extremely capable and (2) extremely misaligned. It is not the only way catastrophic things can happen. In particular, instruction following models can be used by bad people. Also even “small” misalignment can matter a lot when models are deployed in high stakes situations.
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect—monitoring in deployment can miss stuff and eval-awareness is a concern—they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I think some people’s objection is that we are in scenario C: when the models will become more capable, they will start developing long term goals, and stop trying to follow the prompts, but will be smart enough that we don’t notice it. And other people’s concerns are that we are in scenario D- models already have long term misaligned goals that they follow, but they are just good at hiding it, and we will only discover it after it’s too late).
I am sure 4 scenarios don’t capture all option, but I hope this can focus the disagreements.
I deliberately used quotes since I didn’t want a single precise definition. But the general notion is covert pursuit of long term misaligned goals, as opposed to misalignment failures like hallucinations, reward hacking, etc.. that can be better understood as pursuing misspecified short term goals. (They are bad, but can be understood as doing something related to the prompt that was given, but not what we intended. But not pursuing something completely unrelated to the prompt.)
Scheming is the kind of thing that, for example, make it useless to have one model monitor another since they would both be colluding in pursuit of their long term goals.
Just came cross this tweet https://x.com/mattyglesias/status/2038760845845442800?s=20 “AI is bad at writing in roughly the sense that Brian Scalabrine was a bad basketball player”—I think this is over-stating it for writing, but definitely that is true for programming, likely many areas of math.
AI might not be “super human” but it dominates the typical human data labeler in many tasks, and in that sense we have passed the “RLHF plateau”. This is also demonstrated by how often papers these days use LLMs as a judge.
Do you believe current models have large amounts of undetected misalignment? I believe the trend that more capable models are also more aligned (see Leike’s blog I linked) is not just in evaluations but also people’s observations in their real world usage.
That is OK. I think it is a valid question to ask how we will align armies of agents, and as I point out in the post, I actually agree it’s not solved. I believe that there is some level of alignment we will get by having each one of the 100K researchers be aligned, and having aligned monitors etc.., but I also believe in “more is different” and that we will need to invest in explicit alignment methods for the multi-agent setting that are not focused on just model behavior on one prompt in isolation.
The state of AI safety in four fake graphs
added!
For people who want to follow what I write, please consider following my blog https://windowsontheory.org/ - I plan to still often cross post here but not always (and it’s not always promoted to front page). You can also follow me on X (boazbaraktcs).
I wouldn’t say it’s the same and completely familiar. It will require different means than bio and cyber (indeed there are also important differences between bio and cyber, one of which is precisely the fact that it is harder to tell apart valid and malicious coding queries.) I was just saying we can use the same general process and framework of evaluations, mitigations, etc. In this sense I am also happy that we are not dealing with the intelligence agencies for now, since the workflows there might be harder to tell apart.
I definitely agree that the law should catch up with AI! I hope we can set up best practices and that those can be encoded in regulations and laws.
I would hope that this should be a non partisan opinion—even if people like the current president, governments can change, and any tool you give them could be later used by a government you don’t like.
Hi Tom,
I think you are right that the language of the contract will matter if it comes to court. I think it is highly unlikely that it will end up in court, and if the government did try to do mass surveillance and this ended up in court it would likely be a good way to expose this,
Issues such as jailbreaks, ZDR, etc. are real but not new to us. We have to deal with these also in other catastrophic risks settings, such as bio and cyber. This is why I am advocating to treat this in this manner. Note that the “mass” nature of mass surveillance requires not just one jailbreak but deploying jailbreaks at a large scale without being detected. But I agree that like any safety stack, we need to measure and understand the risk.
I am not updating here beyond our blog post, I think LAW is not really a “live issue” for a number of reasons, including the fact that DoW is not in charge of developing weapons but procuring them.
I also think that LAW will be ultimately a question of capabilities, and so I view it less than one where there is an inherent incentive in terms of the government to deploy something that is unreliable.
On the other hand, any government can have an incentive to spy and control people to stay in power, which is why we need all these laws restricting the power of government.
To be clear—right now my lab is not helping the government wage the current war in Iran. The OpenAI deployment will be in the future. And I would not say “I am OK” with it. But I would say that if the elected government decides to take an action that I don’t agree with, including waging war, then that’s a whole different matter if the government is trying to use my system to undermine the democratic process and stay in power indefinitely.
I was deliberately vague, but I would say that some version of log Y axis might make sense, since as capability scale up by a factor 10, you want to reduce the probability of error by some factor C.
So, yes the reason why I said “a lot” in the “a lot of bad shit can happen” is that this gap could indeed bite us. It could be potentially ameliorated by people limiting how AI can be deployed in high-stakes setting or increasing investment in safety—this is basically what happened in other industries like aviation safety. But this is also where the fourth graph on societal readiness comes in