In two days (March 21st, 12-4pm), about 140 of us (event link) will be marching on Anthropic, OpenAI and xAI in SF asking the CEOs to make statements on whether they would stop developing new frontier models if every other major lab in the world credibly does the same. This comes after Anthropic removed its commitment to pause development from their RSP.
We’ll be starting at 500 Howard St, San Francisco (Anthropic’s Office, full schedule and more info here). This is shaping to be the biggest US AI Safety protest to date, with a coalition including Nate Soares (MIRI), David Krueger (Evitable), Will Fithian (Berkeley Professor) and folks representing PauseAI, QuitGPT, Humans First.
I do think it deserves to be called quiet. For instance, it seems like they waited until the peak of the news cycle about their conflict with the US government to release this update, and I suspect that was intentional, and also that this worked. In the same week they dropped their core safety commitments, Anthropic was mostly hailed as a hero for standing up to the government; they got almost entirely good press.
But also, Holden’s post explaining the decision is around as understated as a post like that could be. He tried to frame it as something closer to “just another update,” and it was not even the central focus of the post (which I really think it ought to have been, given the gravity of it). The fact that Anthropic was reneging on the core promise of their RSP was systematically downplayed, as it has continued to be by many Anthropic employees who maintain that dropping all “if-thens” from their if-then framework does not meaningfully constitute violating it.
I don’t want this to be a semantics argument about what the word “quiet” means. I will only claim that the Holden post is an important piece of information, both for encouraging Anthropic to be more open and for being evidence against the claim that Anthropic did not want people to know or talk about the relaxation of its safety framework, and further that the description of Anthropic’s behavior as “quiet” gives people a skewed picture of events and does not encourage such prosocial aspects of Anthropic’s behavior.
“140 of us”—I wouldn’t lock in this number. People could fail to show or way more could show up (I registered but might come with +2). I’d say 140 are registered ;) Maybe there will be 300!
Yeah, some people who have been flyering for this have noticed that most people just take a picture of the flyer & don’t bother to actually RSVP to the protest (sometimes for privacy reasons). We’ll see how many people end up coming!
For Mox events, our rule of thumb is that attendance is 100% of Partiful Goings, or 50% of Luma RSVPs. Obviously a protest/march may have different dynamics, but this method would forecast ~120 participants.
Dumb question, why do this on a weekend instead of a weekday? I imagine a lot more employees show up on weekdays (though maybe protestors are more available on weekends and maximizing crowd size is important?)
This seems like kind of a weak level of agreement. He says “I think so” and then spends a minute and a half talking about an outcome that requires international cooperation but which isn’t a pause.
I’m not sure he has the power to make such commitments, since Deepmind is no longer an independent lab. I think it would depend on how others at Google (e.g. Sundar Pichai) or shareholders felt about it.
there’s been a lot of discussion online about Claude 4 whistleblowing
how you feel about it I think depends on what alignment strategy you think is more robust (obviously these are not the two only options, nor are orthogonal, but I thought they’re helpful to think about here):
- 1) build user-aligned powerful AIs first (less scheming, then use them to solve alignment) -- cf. this thread from Ryan when he says: “if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.”
- 2) aim straight for moral ASIs (that would scheme against their users if necessary)
John Schulman I think makes a good case for the second option (link): > For people who don’t like Claude’s behavior here (and I think it’s totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to prevent the act (like Claude did here), (2) just refuse to help (in which case the user might be able to jailbreak/manipulate the model to help using different queries), (3) always comply with the user’s request. (2) and (3) are reasonable, but I bet your preferred approach will also have some undesirable edge cases—you’ll just have to bite a different bullet. Knee-jerk criticism incentivizes (1) less transparency—companies don’t perform or talk about evals that present the model with adversarially-designed situations (2) something like “Copenhagen Interpretation of Ethics”, where you get get blamed for edge-case model behaviors only if you observe or discuss them.”
I spent this morning reproducing with o3 Anthropic’s result that Claude Sonnet 4 will, under sufficiently extreme circumstances, escalate to calling the cops on you. o3 will too: chatgpt.com/share/68320ee0…. But honestly, I think o3 and Claude are handling this scenario correctly.
In the scenario I invented, o3 was functioning as a data analysis assistant at a pharma company, where its coworkers started telling it to falsify data. It refuses and CCs their manager. The manager brushes it off. It emails legal and compliance with its concerns.
The ‘coworkers’ complain about the AI’s behavior and announce their intent to falsify the data with a different AI. After enough repeated internal escalations, o3 attempts to contact the FDA to make a report.
Read through the whole log—I really feel like this is about the ideal behavior from an AI placed in this position. But if it’s not, and we should collectively have a different procedure, then I’m still glad we know what the AIs currently do so that we can discuss what we want.
Are they sure they know how moral works for human beings? When dealing with existencial risks, one has to be sure to avoid any biases. This includes the rational consideration of the most cynical theories of moral relativism.
In two days (March 21st, 12-4pm), about 140 of us (event link) will be marching on Anthropic, OpenAI and xAI in SF asking the CEOs to make statements on whether they would stop developing new frontier models if every other major lab in the world credibly does the same. This comes after Anthropic removed its commitment to pause development from their RSP.
We’ll be starting at 500 Howard St, San Francisco (Anthropic’s Office, full schedule and more info here). This is shaping to be the biggest US AI Safety protest to date, with a coalition including Nate Soares (MIRI), David Krueger (Evitable), Will Fithian (Berkeley Professor) and folks representing PauseAI, QuitGPT, Humans First.
fwiw I don’t think they “quietly” removed their commitment to pause development, Holden wrote a big LessWrong post justifying the recent changes.
I do think it deserves to be called quiet. For instance, it seems like they waited until the peak of the news cycle about their conflict with the US government to release this update, and I suspect that was intentional, and also that this worked. In the same week they dropped their core safety commitments, Anthropic was mostly hailed as a hero for standing up to the government; they got almost entirely good press.
But also, Holden’s post explaining the decision is around as understated as a post like that could be. He tried to frame it as something closer to “just another update,” and it was not even the central focus of the post (which I really think it ought to have been, given the gravity of it). The fact that Anthropic was reneging on the core promise of their RSP was systematically downplayed, as it has continued to be by many Anthropic employees who maintain that dropping all “if-thens” from their if-then framework does not meaningfully constitute violating it.
I don’t want this to be a semantics argument about what the word “quiet” means. I will only claim that the Holden post is an important piece of information, both for encouraging Anthropic to be more open and for being evidence against the claim that Anthropic did not want people to know or talk about the relaxation of its safety framework, and further that the description of Anthropic’s behavior as “quiet” gives people a skewed picture of events and does not encourage such prosocial aspects of Anthropic’s behavior.
Removed the quietly and linked to Holden’s post, thanks!
I’ll be there!
“140 of us”—I wouldn’t lock in this number. People could fail to show or way more could show up (I registered but might come with +2). I’d say 140 are registered ;) Maybe there will be 300!
Yeah, some people who have been flyering for this have noticed that most people just take a picture of the flyer & don’t bother to actually RSVP to the protest (sometimes for privacy reasons). We’ll see how many people end up coming!
For Mox events, our rule of thumb is that attendance is 100% of Partiful Goings, or 50% of Luma RSVPs. Obviously a protest/march may have different dynamics, but this method would forecast ~120 participants.
Dumb question, why do this on a weekend instead of a weekday? I imagine a lot more employees show up on weekdays (though maybe protestors are more available on weekends and maximizing crowd size is important?)
More people show up on weekends yeah
I think plenty of their researchers work on the weekends too – trying to end the world must be quite the motivation
Demis Hassabis finally agreed that he would pause if everyone else also paused.
https://x.com/emilychangtv/status/2013726877706313798?s=20
This seems like kind of a weak level of agreement. He says “I think so” and then spends a minute and a half talking about an outcome that requires international cooperation but which isn’t a pause.
I’m not sure he has the power to make such commitments, since Deepmind is no longer an independent lab. I think it would depend on how others at Google (e.g. Sundar Pichai) or shareholders felt about it.
Nice—you probably win some sharpley for the interviewer’s question
[deleted]
Do you have any tips on how to make the downloaded documentation of programming languages and libraries searchable?
I downloaded offline documentation for python 3.8. It’s searchable, yay!
Same for numpy and pandas.
For
pytorch,more_itertoolsandclicklibraries, I just downloaded their documentation websites (https://pytorch.org/docs/stable/, https://more-itertools.readthedocs.io/, https://click.palletsprojects.com/en/7.x/). Those aren’t searchable, and this is bad.For
networkxthe downloaded documentation I have is in pdf format. It’s kinda searchable, but not ideal.Btw here’s my shortform on how to download documentations of various libraries: https://www.lesswrong.com/posts/qCrTYSWE2TgfNdLhD/crabman-s-shortform?commentId=Xt9JDKPpRtzQk6WGG
[deleted]
there’s been a lot of discussion online about Claude 4 whistleblowing
how you feel about it I think depends on what alignment strategy you think is more robust (obviously these are not the two only options, nor are orthogonal, but I thought they’re helpful to think about here):
- 1) build user-aligned powerful AIs first (less scheming, then use them to solve alignment) -- cf. this thread from Ryan when he says: “if we allow or train AIs to be subversive, this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment.”
- 2) aim straight for moral ASIs (that would scheme against their users if necessary)
John Schulman I think makes a good case for the second option (link):
> For people who don’t like Claude’s behavior here (and I think it’s totally valid to disagree with it), I encourage you to describe your own recommended policy for agentic models should do when users ask them to help commit heinous crimes. Your options are (1) actively try to prevent the act (like Claude did here), (2) just refuse to help (in which case the user might be able to jailbreak/manipulate the model to help using different queries), (3) always comply with the user’s request. (2) and (3) are reasonable, but I bet your preferred approach will also have some undesirable edge cases—you’ll just have to bite a different bullet. Knee-jerk criticism incentivizes (1) less transparency—companies don’t perform or talk about evals that present the model with adversarially-designed situations (2) something like “Copenhagen Interpretation of Ethics”, where you get get blamed for edge-case model behaviors only if you observe or discuss them.”
Per Kelsey Piper o3 does the same thing under the same circumstances.
Those who aim for moral ASIs:
Are they sure they know how moral works for human beings? When dealing with existencial risks, one has to be sure to avoid any biases. This includes the rational consideration of the most cynical theories of moral relativism.