I am a member of the technical staff at OpenAI, working in the alignment team, as well as a Catalyst professor of Computer Science at Harvard. See https://windowsontheory.org/ for my blog, and https://x.com/boazbaraktcs for my twitter profile.
Boaz Barak
So the Op Ed is only relevant in the 3% of the probability space where people need advice for adapting economically for AI
Glad to have inspired you. Looking at your profile, your P(doom) is 92%. You can think of this op Ed as focusing on the probability space that you assign 8% to
AI is a Meteor. Don’t Be a Dinosaur.
There is also another benefit of working in a lab that is related to the difference between “off policy” and “on policy” reinforcement learning. Even if you had passive access to all the internal information in a lab, you do not gain from that as much as you do by being able to run your own experiments and learn from them. (Or make your own queries to people that have run experiments and learn from them.)
once you write down a table with the first row and column, the result will be batshit insane no matter what numbers you put in.
I am not sure there is a dichotomy between “tool AI” or “agent AI”—an agent is a tool of its principal. I believe it is possible to have superintelligent AI that is still a “machine of faithful obedience.”
I wrote the following on twitter:
To be clear, “AI as a tool” does not mean it has no values.
The metaphor I like is a good (non Supreme Court) judge—you may and often do rely on moral judgement and common sense to interpret the laws—but you do not “legislate from the bench”.
You want this AI to act in many ways like a person of good character, but more like a conscientious civil servant than some moral icon like Ghandi, Mandela, MLK or Mother Theresa.To me the question is whether we want AI to be a “benevolent dictator” or ultimately follow human intent and instructions. As I wrote in my post on the Claude Constitution:
In the document, the authors seem to say that rules’ main benefits are that they “offer more up-front transparency and predictability, they make violations easier to identify, they don’t rely on trusting the good sense of the person following them.”
But I think this misses one of the most important reasons we have rules: that we can debate and decide on them, and once we do so, we all follow the rules even if we do not agree with them. One of the properties I like most about the OpenAI Model Spec is that it has a process to update it and we keep a changelog. This enables us to have a process for making decisions on what rules we want ChatGPT to follow, and record these decisions. It is possible that as models get smarter, we could remove some of these rules, but as situations get more complex, I can also imagine us adding more of them. For humans, the set of laws has been growing over time, and I don’t think we would want to replace it with just trusting everyone to do their best, even if we were all smart and well intentioned.However, I also wrote there that “all of us are proceeding into uncharted waters, and I could be wrong. I am glad that Anthropic and OpenAI are not pursuing the exact same approaches”. I still believe in that.
OpenAI alignment blog on auto review model for codex https://alignment.openai.com/auto-review/
I’m generally huge fan of model based supervision as opposed to programmatic sand boxes. Sand boxes set up an adversarial relation between the model and the sandbox—the model wants to do something and the sandbox blocks it—and as it becomes more powerful the model will win.
(I’ll try to post short takes on alignment blog posts from time to time.)
If I remember I can try to just post these manually
Thank you!
Our OpenAI alignment blog is updated pretty frequently these days with research results that should be of interest for folks here https://alignment.openai.com
You are free to decide who you want to work or interact with. I have no objections for people arguing here for policies on grants or collaborations banning lab employees, even if I personally oppose such policies.
Yes it’s what Ryan said. I don’t mind at all if there is a frontpage post saying “lab employees should not be welcome at venue X” or anything else along these lines. I think it is legitimate for people to make such choices and also to coordinate them here. This counts to me as imposing a social cost on people outside of LW.
I mostly mean that if I post on LW and people are rude then I‘d be less likely to post.
Another issue, that I didn’t get into, is that very extreme rhetoric, including Nazi comparisons, has the risk of radicalizing people and inciting violence, and is also just not good for people’s mental health.
It’s up to LW moderators to enforce norms, but I would suggest that if you want to impose social costs on lab employees, you do it outside LW.
I post on LW for intellectual discussion, and not to make friends. If people are rude here, it won’t cause me to quit OpenAI or change what I’m doing, but it can cause me to stop posting or reading. That may well be the desired outcome, though I personally think it would be unfortunate.
I think it is unhealthy if all discussion on AI safety that actually impacts frontier models happens inside the labs without discussion between people in the labs and people outside them. And at its best LW can facilitate these.
To add something: given this forum’s population, it is quite noteworthy and admirable how welcoming it is to someone like me and other lab employees. I can’t imagine a vegan forum allowing meat company employees to post, even if they work in the department for humane treatment.
I have found lesswrong a valuable venue. For what it’s worth, I’ve been attacked way less here than on X….
I recommend AI lab employees post and lurk more here. LW is a bit of an echo chamber, but at least it’s a different echo chamber than the one we spend most of our time in.
I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:
the “green graph” corresponds to the type of misalignment Ryan describes in this post which is less adversarial and more observable. It is a real problem but also one on which we are making progress. Howeber, as my graph indicates, I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations (including using AIs for capability and alignment research)
the “red graph” corresponds to the more “scheming” or “adversarial“ setting where AIs act covertly to subvert training or monitoring and pursue their own long term goals. like Sam, I don’t see this happening now. I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.
I don’t really agree with any of these bulletpoints. I am not even sure we are on the same page of the definition of ASI and I don’t view “emerging” as a good way to describe training. I feel like we are getting into more fundamental disagreements which I covered to some extent here.
When we have something to publish we will do so. Generally our system cards and other publications contain evaluations of different aspects of safety and alignment by us and third parties. I expect that as capabilities grow and stakes of internal and external deployment are higher, we will continue and expand both our own evaluations and such collaborations.
Risks of internal deployment are something we are tracking and (as this blog shows) are actively working on. I don’t think it will be useful to debate an external expert that doesn’t know the details of our internal setup. However we are continuously working on mitigations, reporting (system cards, blogs) as well as collaborating with third parties
I am not “outwitting the FBI” is a good operationalization, but I agree that whether through misalignment or misuse, none of the existing models (eg GPT 5.5, Opus 4.8, Myhtos, Gemini etc) can cause human extinction