I am a member of the technical staff at OpenAI, working in the alignment team, as well as a Catalyst professor of Computer Science at Harvard.
Boaz Barak
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior—for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.
I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
The main reason we didn’t give the judge the COTs are:
It didn’t seem to help much
We preferred not to rely on COTs since one of the motivations for confessions are to be an alternative to cot monitoring if in the future cots become less legible (eg soft tokens etc )
If AIs are 100% good at predicting the statements they can get away with respect to the judge, then may well learn the strategy of making such statements. But if they are not 100% at it, the safer approach is to just come clean in the confession when you did something wrong. Since you can give the evidence that you did something wrong, it’s highly unlikely the judge won’t believe you if you confess. And since st least sometimes you might not be able to do it (especially if the judge might get more test time compute than the policy), the path of least resistance could well be that the model learns to just be honest.
It’s hard to say since I am in the habit of reading COTs all the time, so I don’t have the counterfactual of what I could have learned without them..
Why we are excited about confession!
Students in CS 2881r: AI safety did a number of fantastic projects, see https://boazbk.github.io/mltheoryseminar/student_projects.html
Also I finally updated the youtube channel with the final lecture as well as the oral presentations for projects, see
In what way this is a call to quietism?
I do not mean that you should be complacent! And as I said, this does not mean you should let governments, and companies, including my own, off the hook!
Thank you. I am not an economist, but I think that it is unlikely for the entire economy to operate on the model of an AI lab whereby every year you keep just pumping all gains back into AI.
Both investors and the general public have a limited patience, and they will want to see some benefits. While our democracy is not perfect, public opinion has much more impact today than the opinions of factory workers in England in the 1700′s, and so I do hope that we won’t see the pattern where things became worse before they were better. But I agree that it is not sure thing by any means.
However, if AI does indeed still grow in capability and economic growth is significantly above the 2% per capita it has been stuck on for the last ~120 years it would be a very big deal and would open up new options for increasing the social safety net. Many of the dillemas—e.g. how do we reduce the deficit without slashing benefits, etc. - will just disappear with that level of growth. So at least economically, it would be possible for the U.S. to have Scandinavian levels of social services. (Whether the U.S. political system will deliver that is another matter, but at least from the last few years it seems that even the Republican party is not shy about big spending.)
This actually goes to the bottom line, is that I think how AI ends up playing out will end up depending not so much on the economic but on the political factors, which is part of what I wrote about in “Machines of Faithful Obedience”. If AI enables authoritarian government then we could have a scenario with very few winners and a vast majority of losers. But if we keep (and hopefully strenghten) our democracy then I am much more optimistic about how the benefits from AI will be spread.
I don’t think there is something fundamental about AI that makes it obvious in which way it will shift the balance of power between governments and individuals. Sometimes the same technology could have either impact. For example the printing press had the impact of reducing state power in Europe and increasing it in China. So I think it’s still up in the air how it will play out. Actually this is one of the reasons I am happy that so far AI’s development has been in the private sector, and aimed at making money and marketing to consumers, than in developed in government, and focused on military applications, as it well could have been in another timeline.
Yes, in my addendum I said that a more accurate title would have been “I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.”
Comparing nuclear risks to AI is a bit unfair—the reason we can give such details calculations of kinetic force etc.. is because nuclear warheads are real, actually deployed, and can be launched at a moment’s notice. With ASI you cannot do calculations of exactly how many people it would kill precisely because it does not exist.
I am not advocating that policy makers should have taken an “either it doesn’t happen or we all die” mentality for nuclear policy. (While this is not my field, I did do some work in the nuclear disarmament space.)
But I would say that this was (and is) the mindset for the typical person living in an American urban center. (If you live in such an area, you can go to nukemap and see what would be the impact of one or more ~500kt warheads- of the type carried by Russian R-36 missiles—in your vicinity.)
People have been living their lives under the threat that it is possible that they and everyone they know could be extinguished in a moment’s notice. I think the ordinary U.S. and Russian citizen probably should have done more and care more to promote nuclear disarmament. But I don’t think they (we) should live in constant state of fear either.
I think we are in agreement! It is definitely easier for me, given that I believe things are likely to be OK, but I still assign non-trivial likelihood to the possiblity they will not. But regardless of what you believe is more likely, I agree you should both (a) do what is feasible for you to have positive impact in the domains you can influence, and (b) keep being productive, happy, and sane without obsessing on factors you do not control.
Thank you for engaging. I now understand better your point.
The set of things people are worried about AI is very large, and I agree I addressed only part, and maybe not the most important part of what people are worried about. I also agree that “experts” disagree with each other, so you can’t just trust the expert. I can offer my thoughts of how to think of AI, and maybe they will make sense to some people, but they should make their own judgement and not take things on faith.
If I understand correctly, you want for the sake of discussions, to consider the world where AGI takes 20+ years to achieve. People have different definitions of AGI, but it seems safe to say this world would be one where progress significantly undershoots the expectations of many people in the AI space and AI companies. There is a sense of positive feedback loop—I imagine that if AI undershoots expectations then funding will also be squeezed and so this could lead to even more slowdown—and so in such a world it’s possible that over the next 20 years AI’s impact, for both good and bad, will just not be extremely significant.
If we talk about “prosaic harms” we should also talk about “prosaic benefits”. If we take the view of AI as a “normal technology” then our past experience with technologies was that overall the benefits are larger than the harms. Over the long run, we have seen a pretty smooth and consistent increase in life expectency and other metrics of wellbeing. So if AI does not radically reshape society, the baseline expectation should be that it has overall a positive impact. AI may well have positive impact even if it does radically shape humanity (I happen to believe it will) but we have less prior data to base on in that case.
Most exposition of existential risk I have seen count nuclear war as an example of a risk. Bostrom (2001) certainly considers nuclear war as an existential risk. What I meant by it either happens or doesn’t is that since 1945 there has been no nuclear weapon in war, so the average person “did not feel it” and given the U.S. and Russian posture, it is quite possible that a usage by one of them against the other will lead to a total nuclear war.
Also while it was possible to take precautions, like a fallout shelter, the plan to build fallout shelters for most U.S. citizens fizzled and was defunded in the 1970s. So I think it is fair to say that most Americans and Russians did not spend most of their time thinking or actively preparing for nuclear holocaust.
I am not necessarily saying it was the right thing: maybe the fallout shelters should not have been defunded, and should have been built, and people should have advocated for that. But I think it would still have been wise for them to try to live their daily lives without been gripped by fear.
LOL … I have to say that “crystal wishing division” sounds way cooler than “alignment team” :)
However, I think the analogy is wrong on several levels. This is not about “lobbying to go further into the nebula”. If anything, people working in alignment are about steering the ship or controlling the crystal minds to ensure we are safe in the nebula.
To get back to AI, as I wrote, this note is not about dissuading people from holding governments and companies accountable. I am not trying to convince you to not advocate for AI regulations, for AI pauses, or trying to upsell you a chatgpt subscription. You can and should exercise your rights to advocate for the positions you believe in.
Like the case of climate change, people can have different opinions on what society should do and how it should trade off risks vs. progress. I am not trying to change your mind on the tradeoffs for AI. I am merely offering some advice, which you can take or leave as you see fit, for how to think about this in your everyday life.
I do not blame young people or claim that they are “unwise” or “over reacting”. I care a lot about what the future will look like for young people, also because I have two kids myself (ages 13 and 19).
I am really not sure what does it mean to “place sufficiently diverse bets to hedge their risk in all of the possible worlds”. If that means to build a bunker or something, then I am definitely not doing that.
I do not see AI as likely to create a permanent underclass, nor make it so that it would not make sense to date or raise a family. As I said before, I think that the most likely outcome is that AI will lift the quality of life of all humans in a way similar to the lift from pre industrial times. But even in those pre industrial times, people still raised families.
I believe that it is not going to be the case that “if you get it wrong, there’s unlikely to be a safety net” or “any one of us may get lucky in the near after”. Rather I believe that how AI will turn out for all of us is going to be highly correlated: not necessarily 100% correlation (either we all get lucky or we all get very unlucky) but not that far from it either. In fact, I thought that the belief on this strong correlation of AI outcomes was the one thing that MIRI folks and I had in common, but maybe I was wrong.
While this was not the focus of this post, I can completely understand the deep level of insecurity people have about AI. The data is mixed but It does seem that at least some companies short term reaction to AI is to slow down in entry level hiring for jobs that are more exposed to AI. But AI will change so much that this doesn’t mean that this will continue to be the case. Overall times of rapid change can create opportunities for young people, especially ones that have a deeper understanding of AI and know how to use it.
It may end up that the people more impacted are those that have 10-15 years of experience. Enough to be less adaptable but not enough to be financially set and sufficiently senior. But tbh it’s hard to predict. Given our prior experience with a vast increase in the labor force—the industrial revolution- I think in the long run it is likely to vastly increase productivity and welfare even on a per capita basis and so people would be better off (see that Kelsey Piper piece I linked in my addendum). But I agree it’s super hard to predict and there is a whole range of potential scenarios.
Thank you for writing this and I do not feel attacked at all. You are right that I am in a position of material comfort right now.
I would say that if your main focus is existential risk, then the analogy would be more like someone that is standing on a 2 inch mound of sand in the beach saying that we will be fine. I don’t think there is any “hill” for true existential risk.
If you ware talking about impact on the job market, then I agree that while it’s always been the case that 51 year old tenured professors (or formerly tenured, I just gave up on tenure) are more settled than young students, the level of uncertainty is much higher these days. If that is the risk you are most worried about, I am not sure why you would choose to forgo working in an AI capability lab but I respect that choice.
I did not talk about these other risks in this piece mostly because I felt like this is not what most lesswrong people are worried about, but see also this tweet https://x.com/boazbaraktcs/status/2006768877129302399?s=20
As I wrote above, I think the hospice analogy is very off the mark. I think the risk of nuclear war is closer to that, but is also not a good analogy, in the sense that nuclear war was always a zero/one thing—it either happens or it doesn’t, and if it doesn’t you do not feel it at all.
With AI, people already are and will definitely feel it, for both good and bad. I just think the most likely outcome is that the good will be much more than the bad.
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don’t think it’s so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflect. Just like you can train a model that, for example, it will always output an exclamation mark after the word “apple”, and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to “jailbreak itself”—adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.