I am a member of the technical staff at OpenAI, working in the alignment team, as well as a Catalyst professor of Computer Science at Harvard. See https://windowsontheory.org/ for my blog, and https://x.com/boazbaraktcs for my twitter profile.
Boaz Barak
If you look at my post (lesswrong copy here) then you will see I discuss this balance
Yes, as I wrote in my post, aligned ASI’s would need to spend some fraction of their resources improving defender in the offense/defense balance. My main point was that the balance is not infinite and so if aligned resources vastly outnumber misaligned resources that should be enough.
My rough heuristic is that intelligence scales with compute, so the crucial condition is the vast majority of FLOPs are deployed for safe intelligence. It seems that a lot of the arguments in the post are how there may be some leak of unsafe or misaligned ASI in one way or another but this doesn’t mean this ASI will have lots of compute at its disposal
maybe footnote 1 means that this post is not for me, but I believe that the world can survive the existence of misaligned/unsafe ASI as long as it is dominated (in terms of compute/intelligence) by aligned and safe ASI. See item 6 here https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/
I was deliberately vague, but I would say that some version of log Y axis might make sense, since as capability scale up by a factor 10, you want to reduce the probability of error by some factor C.
So, yes the reason why I said “a lot” in the “a lot of bad shit can happen” is that this gap could indeed bite us. It could be potentially ameliorated by people limiting how AI can be deployed in high-stakes setting or increasing investment in safety—this is basically what happened in other industries like aviation safety. But this is also where the fourth graph on societal readiness comes in
1980s BBS style
look like ny post
I think this is a decent way to describe it. When models are just “misaligned,“ we can potentially increase reliability via having one model monitor or check the work of the other, since what the model does has some relation the the prompt it is given. When a model is adversarially misaligned then the monitor could completely ignore its prompt and collude with the model it’s supposed to be monitoring.
Re the scheming graph- curious what you think about the other scenarios I posted. I am not sure I completely agree with the decomposition- to me scheming means the AI having its own prompt-independent goals that it pursues covertly, rather pursuing goals that are not completely aligned but still related to the prompt that it was given.
absolutely agree that higher capabilities means higher stakes and hence increased requirements on alignment!
Thanks to everyone who commented, I tried to make a few more fake graphs to capture the disagreements:
I don’t want to make things very formal, but let’s imagine “alignment” as the models generally following the intent of their generalized prompt (including things like constitution, policy, instructions from system/developer etc..). If the models follow a proxy that is related to the prompt but not really what we wanted (e.g. models express a confident answer when they are not really confident, or find a loophole instead of solving the actual task) then we think of them as somewhat misaligned. If a model is covertly following its own long term goals (e.g. a monitor model colluding with another model to sabotage safety training) then they are completely misaligned.
The dot at the bottom right corner is what happens if models satisfy the combination of (1) being extremely capable and (2) extremely misaligned. It is not the only way catastrophic things can happen. In particular, instruction following models can be used by bad people. Also even “small” misalignment can matter a lot when models are deployed in high stakes situations.
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect—monitoring in deployment can miss stuff and eval-awareness is a concern—they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I think some people’s objection is that we are in scenario C: when the models will become more capable, they will start developing long term goals, and stop trying to follow the prompts, but will be smart enough that we don’t notice it. And other people’s concerns are that we are in scenario D- models already have long term misaligned goals that they follow, but they are just good at hiding it, and we will only discover it after it’s too late).
I am sure 4 scenarios don’t capture all option, but I hope this can focus the disagreements.
I deliberately used quotes since I didn’t want a single precise definition. But the general notion is covert pursuit of long term misaligned goals, as opposed to misalignment failures like hallucinations, reward hacking, etc.. that can be better understood as pursuing misspecified short term goals. (They are bad, but can be understood as doing something related to the prompt that was given, but not what we intended. But not pursuing something completely unrelated to the prompt.)
Scheming is the kind of thing that, for example, make it useless to have one model monitor another since they would both be colluding in pursuit of their long term goals.
Just came cross this tweet https://x.com/mattyglesias/status/2038760845845442800?s=20 “AI is bad at writing in roughly the sense that Brian Scalabrine was a bad basketball player”—I think this is over-stating it for writing, but definitely that is true for programming, likely many areas of math.
AI might not be “super human” but it dominates the typical human data labeler in many tasks, and in that sense we have passed the “RLHF plateau”. This is also demonstrated by how often papers these days use LLMs as a judge.
Do you believe current models have large amounts of undetected misalignment? I believe the trend that more capable models are also more aligned (see Leike’s blog I linked) is not just in evaluations but also people’s observations in their real world usage.
That is OK. I think it is a valid question to ask how we will align armies of agents, and as I point out in the post, I actually agree it’s not solved. I believe that there is some level of alignment we will get by having each one of the 100K researchers be aligned, and having aligned monitors etc.., but I also believe in “more is different” and that we will need to invest in explicit alignment methods for the multi-agent setting that are not focused on just model behavior on one prompt in isolation.
The state of AI safety in four fake graphs
added!
For people who want to follow what I write, please consider following my blog https://windowsontheory.org/ - I plan to still often cross post here but not always (and it’s not always promoted to front page). You can also follow me on X (boazbaraktcs).
I wouldn’t say it’s the same and completely familiar. It will require different means than bio and cyber (indeed there are also important differences between bio and cyber, one of which is precisely the fact that it is harder to tell apart valid and malicious coding queries.) I was just saying we can use the same general process and framework of evaluations, mitigations, etc. In this sense I am also happy that we are not dealing with the intelligence agencies for now, since the workflows there might be harder to tell apart.
Compute is always a finite resource, you you can’t just train anything at any scale. Also, I don’t think your description is accurate, at least of the one frontier lab I am familiar with… We are aware of the risks of internal deployment and are monitoring for issues. See for example this blog post.