Anthropic employees: stop deferring to Dario on politics. Think for yourself.
Do your company’s actions actually make sense if it is optimizing for what you think it is optimizing for?
Anthropic lobbied against mandatory RSPs, against regulation, and, for the most part, didn’t even support SB-1047. The difference between Jack Clark and OpenAI’s lobbyists is that publicly, Jack Clark talks about alignment. But when they talk to government officials, there’s little difference on the question of existential risk from smarter-than-human AI systems. They do not honestly tell the governments what the situation is like. Ask them yourself.
A while ago, OpenAI hired a lot of talent due to its nonprofit structure.
Anthropic is now doing the same. They publicly say the words that attract EAs and rats. But it’s very unclear whether they institutionally care.
Dozens work at Anthropic on AI capabilities because they think it is net-positive to get Anthropic at the frontier, even though they wouldn’t work on capabilities at OAI or GDM.
It is not net-positive.
Anthropic is not our friend. Some people there do very useful work on AI safety (where “useful” mostly means “shows that the predictions of MIRI-style thinking are correct and we don’t live in a world where alignment is easy”, not “increases the chance of aligning superintelligence within a short timeframe”), but you should not work there on AI capabilities.
Anthropic’s participation in the race makes everyone fall dead sooner and with a higher probability.
Work on alignment at Anthropic if you must. I don’t have strong takes on that. But don’t do work for them that advances AI capabilities.
I think you should try to clearly separate the two questions of
Is their work on capabilities a net positive or net negative for humanity’s survival?
Are they trying to “optimize” for humanity’s survival, and do they care about alignment deep down?
I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won’t any measure of their social status be far higher after the singularity, if it’s found that they tried to do the best for humanity?
I’m not sure about 1. I think even they’re not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.
You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We’re all on the same ship after all, when it comes to AI alignment.
PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it’s my best guess...
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter
There is very little hope IMHO in increasing spending on technical AI alignment because (as far as we can tell based on how slow progress has been on it over the last 22 years) it is a much thornier problem than AI capability research and because most people doing AI alignment research don’t have a viable story about how they are going to stop any insights / progress they achieve from helping with AI capability research. I mean, if you have a specific plan that avoids these problems, then let’s hear it, I am all ears, but advocacy in general of increasing work on technical alignment is counterproductive IMHO.
EDIT: thank you so much for replying to the strongest part of my argument, no one else tried to address it (despite many downvotes).
I disagree with the position that technical AI alignment research is counterproductive due to increasing capabilities, but I think this is very complicated and worth thinking about in greater depth.
Do you think it’s possible, that your intuition on alignment research being counterproductive, is because you compared the plausibility of the two outcomes:
Increasing alignment research causes people to solve AI alignment, and humanity survives.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity.
And you decided that outcome 2 felt more likely?
Well, that’s the wrong comparison to make.
The right comparison should be:
Increasing alignment research causes people to improve AI alignment, and humanity survives in a world where we otherwise wouldn’t survive.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity in a world where we otherwise would survive.
In this case, I think even you would agree what P(1) > P(2).
P(2) is very unlikely because if increasing alignment research really would lead to such a superintelligence, and it really would kill humanity… then let’s be honest, we’re probably doomed in that case anyways, even without increasing alignment research.
If that really was the case, the only surviving civilizations would have had different histories, or different geographies (e.g. only a single continent with enough space for a single country), leading to a single government which could actually enforce an AI pause.
We’re unlikely to live in a world so pessimistic that alignment research is counterproductive, yet so optimistic that we could survive without that alignment research.
we’re probably doomed in that case anyways, even without increasing alignment research.
I believe we’re probably doomed anyways.
I think even you would agree what P(1) > P(2)
Sorry to disappoint you, but I do not agree.
Although I don’t consider it quite impossible that we will figure out alignment, most of my hope for our survival is in other things, such as a group taking over the world and then using their power to ban AI research. (Note that that is in direct contradiction to your final sentence.) So for example, if Putin or Xi were dictator of the world, my guess is that there is a good chance he would choose to ban all AI research. Why? It has unpredictable consequences. We Westerners (particularly Americans) are comfortable with drastic change, even if that change has drastic unpredictable effects on society; non-Westerners are much more skeptical: there have been too many invasions, revolutions and peasant rebellions that have killed millions in their countries. I tend to think that the main reason Xi supports China’s AI industry is to prevent the US and the West from superseding China and if that consideration were removed (because for example he had gained dictatorial control over the whole world) he’d choose to just shut it down (and he wouldn’t feel that need to have a very strong argument for that shutting it down like Western decision-makers would: non-Western leader shut important things down all the time or at least they would if the governments they led had the funding and the administrative capacity to do so).
Of course Xi’s acquiring dictatorial control over the whole world is extremely unlikely, but the magnitude of the technological changes and societal changes that are coming will tend to present opportunities for certain coalitions to gain and to keep enough power to shut AI research down worldwide. (Having power in all countries hosting leading-edge fabs is probably enough.) I don’t think this ruling coalition necessarily need to believe that AI presents a potent risk of human extinction for them to choose to shut it down.
I am aware that some reading this will react to “some coalition manages to gain power over the whole world” even more negatively than to “AI research causes the extinction of the entire human race”. I guess my response is that I needed an example of a process that could save us and that would feel plausible—i.e., something that might actually happen. I hasten add that there might be other processes that save us that don’t elicit such a negative reaction—including processes the nature of which we cannot even currently imagine.
I’m very skeptical of any intervention that reduces the amount of time we have left in the hopes that this AI juggernaut is not really as potent a threat to us as it currently appears. I was much much less skeptical of alignment research 20 years ago, but since then a research organization has been exploring the solution space and the leader of that organization (Nate Soares) and its most senior researcher (Eliezer) are reporting that the alignment project is almost completely hopeless. Yes, this organization (MIRI) is kind of small, but it has been funded well enough to keep about a dozen top-notch researchers on the payroll and it has been competently led. Also, for research efforts like this, how many years the team had to work on the problem is more important than the size of the team, and 22 years is a pretty long time to end up with almost no progress other than some initial insights (around the orthogonality thesis, the fragility of value, convergent instrumental values, CEV as a solution to if the problem were solvable by the current generation of human beings.
OK, if I’m being fair and balanced, then I have to concede that it was probably only in 2006 (when Eliezer figured out how to write a long intellectually-dense blog post every day) or even only in 2008 (when Anna Salamon join the organization—she was very good at recruiting and had a lot of energy to travel and to meet people) that Eliezer’s research organization could start to pick and choose among a broad pool of very talented people, but still between 2008 and now is 17 years, which again is a long time for a strong team to fail to make even a decent fraction of the progress humanity would seem to need to make on the alignment problem if in fact the alignment problem is solvable by spending more money on it. It does not appear to me to be the sort of problem than can be solved with 1 or 2 additional insights; it seems a lot more like the kind of problem where insight 1 is needed, but before any mere human can find insight 1, all the researchers need to have already known insight 2, and to have any hope of finding insight 2, they all would have had to know insight 3, and so on.
I don’t agree that the probability of alignment research succeeding is that low. 17 years or 22 years of trying and failing is strong evidence against it being easy, but doesn’t prove that it is so hard that increasing alignment research is useless.
People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.
There is a chance that alignment research now might be more useful than alignment research earlier, though there is uncertainty in everything.
It’s unlikely that 22 years of alignment research is insufficient but 23 years of alignment research is sufficient.
But what’s even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.
In the same way adding a little alignment research is unlikely to turn failure into success, adding a little capabilities research is unlikely to turn success into failure.
It’s also unlikely that alignment effort is even deadlier than capabilities effort dollar for dollar. That would mean reallocating alignment effort into capabilities effort paradoxically slows down capabilities and saves everyone.
Even if you are right
Even if you are right that delaying AI capabilities is all that matters, Anthropic still might be a good thing.
Even if Anthropic disappeared, or never existed in the first place, the AI investors will continue to pay money for research, and the AI researchers will continue to do research for money. Anthropic was just the middleman.
If Anthropic never existed, the middlemen would consist of only OpenAI, DeepMind, Meta AI, and other labs. These labs will not only act as the middle man, but lobby against regulation far more aggressively than Anthropic, and may discredit the entire “AI Notkilleveryoneism” movement.
To continue existing at one of these middlemen, you cannot simply stop paying the AI researchers for capabilities research, otherwise the AI investors and AI customers will stop paying you in turn. You cannot stem the flow, you can only decide how much goes through you.
It’s the old capitalist dilemma of “doing evil or getting out-competed by those who do.”
For their part, Anthropic redirected some of that flow to alignment research, and took the small amount of precautions which they could afford to take. They were also less willing to publish capabilities research than other labs. That may be the best one can hope to accomplish against this unstoppable flow from the AI investors to AI researchers.
The small amount of precautions which Anthropic did take may have already costed them their first mover advantage. Had Anthropic raced ahead before OpenAI released ChatGPT, Anthropic may have stolen the limelight, got the early customers and investors, and been bigger than OpenAI.
But what’s even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.
This assumes that alignment success is the mostly likely avenue to safety for humankind whereas like I said, I consider other avenues more likely. Actually there needs to be a qualifier on that: I consider other avenues more likely than the alignment project’s succeeding while the current generation of AI researchers remain free to push capabilities: if the AI capabilities juggernaut could be stopped for 150 years, giving the human population time to get smarter and wiser, then alignment is likely (say p = .7) to succeed in my estimation. I am informed by Eliezer in his latest interview that such a success would probably use some technology other than deep learning to create the AI’s capabilities; i.e., deep learning is particularly hard to align.
Central to my thinking is my belief that alignment is just a significantly harder problem than the problem of creating an AI capable of killing us all. Does any of the reasoning you do in your section “the comparision” change if you started believing that alignment is much much harder than creating a superhuman (unaligned) AI?
It will probably come as no great surprise that I am unmoved by the arguments I have seen (including your argument) that Anthropic is so much better than OpenAI that it helps the global situation for me to support Anthropic (if it were up to me, both would be shut down today if I couldn’t delegate the decision to someone else and if I had to decide now with the result that there is no time for me to gather more information) but I’m not very certain and would pay attention to future arguments for supporting Anthropic or some other lab.
Thank you, I’ve always been curious about this point of view because a lot of people have a similar view to yours.
I do think that alignment success is the most likely avenue, but my argument doesn’t require this assumption.
Your view isn’t just that “alternative paths are more likely to succeed than alignment,” but that “alternative paths are so much more likely to succeed than alignment, that the marginal capabilities increase caused by alignment research (or at least Anthropic), makes them unworthwhile.”
To believe that alignment is that hopeless, there should be stronger proof than “we tried it for 22 years, and the prior probability of the threshold being between 22 years and 23 years is low.” That argument can easily be turned around to argue why more alignment research is equally unlikely to cause harm (and why Anthropic is unlikely to cause harm). I also think multiplying funding can multiply progress (e.g. 4x funding ≈ 2x duration).
If you really want a singleton controlling the whole world (which I don’t agree with), your most plausible path would be for most people to see AI risk as a “desperate” problem, and for governments under desperation to agree on a worldwide military which swears to preserve civilian power structures within each country.[1]
Otherwise, the fact that no country took over the world during the last centuries strongly suggests that no country will in the next few years, and this feels more solid than your argument that “no one figured out alignment in the last 22 years, so no one will in the next few years.”
Out of curiosity, would you agree with this being the most plausible path, even if you disagree with the rest of my argument?
The most plausible story I can imagine quickly right now is the US and China fight a war and the US wins and uses some of the political capital from that win to slow down the AI project, perhaps through control over the world’s leading-edge semiconductor fabs plus pressuring Beijing to ban teaching and publishing about deep learning (to go with a ban on the same things in the West). I believe that basically all the leading-edge fabs in existence or that will be built in the next 10 years are in the countries the US has a lot of influence over or in China. Another story: the technology for “measuring loyalty in humans” gets really good fast, giving the first group to adopt the technology so great an advantage that over a few years the group gets control over the territories where all the world’s leading-edge fabs and most of the trained AI researchers are.
I want to remind people of the context of this conversation: I’m trying to persuade people to refrain from actions that on expectation make human extinction arrive a little quicker because most of our (sadly slim) hope for survival IMHO flows from possibilities other than our solving (super-)alignment in time.
I would go one step further and argue you don’t need to take over territory to shut down the semiconductor supply chain, if enough large countries believed AI risk was a desperate problem they could convince and negotiate the shutdown of the supply chain.
Shutting down the supply chain (and thus all leading-edge semiconductor fabs) could slow the AI project by a long time, but probably not “150 years” since the uncooperative countries will eventually build their own supply chain and fabs.
The ruling coalition can disincentivize the development of a semiconductor supply chain outside the territories it controls by selling world-wide semiconductors that use “verified boot” technology to make it really hard to use the semiconductor to run AI workloads similar to how it is really hard even for the best jailbreakers to jailbreak a modern iPhone.
That’s a good idea! Even today it may be useful for export controls (depending on how reliable it can be made).
The most powerful chips might be banned from export, and have “verified boot” technology inside in case they are smuggled out.
The second most powerful chips might be only exported to trusted countries, and also have this verified boot technology in case these trusted countries end up selling them to less trusted countries who sell them yet again.
People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.
If I believed that, then maybe I’d believe (like you seem to do) that there is no strong reason to believe that alignment project cannot be finished successfully before the capabilities project creates an unaligned super-human AI. I’m not saying scaling and hardware improvement have not been important: I’m saying they were not sufficient: algorithmic improvements were quite necessary for the field to arrive at anything like ChatGPT, and at least as early as 2006, there were algorithm improvements that almost everyone in the machine-learning field recognized as breakthrough or important insights. (Someone more knowledgeable about the topic might be able to push the date back into the 1990s or earlier.)
After the publication 19 years ago by Hinton et al of “A Fast Learning Algorithm for Deep Belief Nets”, basically all AI researchers recognized it as a breakthrough. Building on it, was AlexNet in 2012, again recognized as an important breakthrough by essentially everyone in the field (and if some people missed it then certainly generational adversarial networks, ResNets and AlphaGo convinced them). AlexNet was the first deep model trained on GPUs, a technique essential for the major breakthrough in 2017 reported in the paper “Attention is all you need”.
In contrast, we’ve seen nothing yet in the field of alignment that is as unambiguously a breakthrough as is the 2006 paper by Hinton et al or 2012′s AlexNet or (emphatically) the 2017 paper “Attention is all you need”. In fact I suspect that some researchers could tell that the attention mechanism reported by Bahdanau et al in 2015 or the Seq2Seq models reported on by Sutskever et al in 2014 was evidence that deep-learning language models were making solid progress and that a blockbuster insight like “attention is all you need” is probably only a few years away.
The reason I believe it is very unlikely for the alignment research project to succeed before AI kills us all is that in machine learning or the deep-learning subfield of machine learning, what was recognized by essentially everyone in the field as a minor or major breakthrough has occurred every few years. Many of these breakthrough rely on earlier breakthroughs (i.e., it is very unlikely for the sucessive breakthrough to have occurred if the earlier breakthrough had not been disseminated to the community of researcher). During this time, despite very talented people working on it, there has been zero results in alignment research that the entire field of alignment researchers would consider a breakthrough. That does not mean it is impossible for the alignment project to be finished in time, but it does IMO make it critical for the alignment project to be prosecuted in such a way that it does not inadvertently assist the capabilities project.
Yes, much more money has been spent on capability research the last 20 years than on alignment research, but money doesn’t help all that much to speed up research in which to have any hope of solving the problem, the researchers need insight X or X2, and to have any hope of arriving at insight X, they need insights Y and Y2, and to have much hope at all of arriving at Y, they need insight Z.
Even if building intelligence requires solving many many problems, preventing that intelligence from killing you may just require solving a single very hard problem. We may go from having no idea to having a very good idea.
I don’t know. My view is that we can’t be sure of these things.
People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.
Anthropic ppl had also said approximately this publicly. Saying that it’s too soon to make the rules, since we’d end up mispecifying due to ignorance of tomorrow’s models.
There’s a big difference between regulation which says roughly “you must have something like an RSP”, and regulation which says “you must follow these specific RSP-like requirements”, and I think Mikhail is talking about the latter.
I personally think the former is a good idea, and thus supported SB-1047 along with many other lab employees. It’s also pretty clear to me that locking in circa-2023 thinking about RSPs would have been a serious mistake, and so I (along with many others) am generally against very specific regulations because we expect they would on net increase catastrophic risk.
When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
Here I am on record supporting SB-1047, along with many of my colleagues. I will continue to support specific proposed regulations if I think they would help, and oppose them if I think they would be harmful; asking “when” independent of “what” doesn’t make much sense to me and doesn’t seem to follow from anything I’ve said.
My claim is not “this is a bad time”, but rather “given the current state of the art, I tend to support framework/liability/etc regulations, and tend to oppose more-specific/exact-evals/etc regulations”. Obviously if the state of the art advanced enough that I thought the latter would be better for overall safety, I’d support them, and I’m glad that people are working on that.
AFAIK Anthropic has not unequivocally supported the idea of “you must have something like an RSP” or even SB-1047 despite many employees, indeed, doing so.
As you may be aware, several weeks ago Anthropic submitted a Support if Amended letter regarding SB 1047, in which we suggested a series of amendments to the bill. … In our assessment the new SB 1047 is substantially improved, to the point where we believe its benefits likely outweigh its costs.
...
We see the primary benefits of the bill as follows:
Developing SSPs and being honest with the public about them. The bill mandates the adoption of safety and security protocols (SSPs), flexible policies for managing catastrophic risk that are similar to frameworks adopted by several of the most advanced developers of AI systems, including Anthropic, Google, and OpenAI. However, some companies have still not adopted these policies, and others have been vague about them. Furthermore, nothing prevents companies from making misleading statements about their SSPs or about the results of the tests they have conducted as part of their SSPs. It is a major improvement, with very little downside, that SB 1047 requires companies to adopt some SSP (whose details are up to them) and to be honest with the public about their SSP-related practices and findings.
...
We believe it is critical to have some framework for managing frontier AI systems that
roughly meets [requirements discussed in the letter]. As AI systems become more powerful, it’s
crucial for us to ensure we have appropriate regulations in place to ensure their safety.
“we believe its benefits likely outweigh its costs” is “it was a bad bill and now it’s likely net-positive”, not exactly unequivocally supporting it. Compare that even to the language in calltolead.org.
Edit: AFAIK Anthropic lobbied against SSP-like requirements in private.
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).
Our worldviews do not match, and I fail to see how yours makes sense. Even when I relax my predictions about the future to take in a wider set of possible paths… I still don’t get it.
AI is here. AGI is coming whether you like it or not. ASI will probably doom us.
Anthropic, as an org, seems to believe that there is a threshold of power beyond which creating an AGI more powerful than that would kill us all.
OpenAI may believe this also, in part, but it seems like their expectation of where that threshold is is further away than mine. Thus, I think there is a good chance they will get us all killed. There is substantial uncertainty and risk around these predictions.
Now, consider that, before AGI becomes so powerful that utilizing it for practical purposes becomes suicide, there is a regime where the AI product gives its wielder substantial power. We are currently in that regime. The further AI gets advanced, the more power it grants.
Anthropic might get us all killed. OpenAI is likely to get us all killed. If you tryst the employees of Anthropic to not want to be killed by OpenAI… then you should realize that supporting them while hindering OpenAI is at least potentially a good bet.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
It seems to me that other possibilities exist, besides “has model with numbers” or “confused.” For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that’s not this world.
nuclear weapons have different game theory. if your adversary has one, you want to have one to not be wiped out; once both of you have nukes, you don’t want to use them.
also, people were not aware of real close calls until much later.
with ai, there are economic incentives to develop it further than other labs, but as a result, you risk everyone’s lives for money and also create a race to the bottom where everyone’s lives will be lost.
I think you (or @Adam Scholl) need to argue why people won’t be angry at you if you developed nuclear weapons, in a way which doesn’t sound like “yes, what I built could have killed you, but it has an even higher chance of saving you!”
Otherwise, it’s hard to criticize Anthropic for working on AI capabilities without considering whether their work is a net positive. It’s hard to dismiss the net positive arguments as “idiosyncratic utilitarian BOTEC,” when you accept “net positive” arguments regarding nuclear weapons.
Allegedly, people at Anthropic have compared themselves to Robert Oppenheimer. Maybe they know that one could argue they have blood on their hands, the same way one can argue that about Oppenheimer. But people aren’t “rioting” against Oppenheimer.
I feel it’s more useful to debate whether it is a net positive, since that at least has a small chance of convincing Anthropic or their employees.
My argument isn’t “nuclear weapons have a higher chance of saving you than killing you”. People didn’t know about Oppenheimer when rioting about him could help. And they didn’t watch The Day After until decades later. Nuclear weapons were built to not be used.
With AI, companies don’t build nukes to not use them; they build larger and larger weapons because if your latest nuclear explosion is the largest so far, the universe awards you with gold. The first explosion past some unknown threshold will ignite the atmosphere and kill everyone, but some hope that it’ll instead just award them with infinite gold.
Anthropic could’ve been a force of good. It’s very easy, really: lobby for regulation instead of against it so that no one uses the kind of nukes that might kill everyone.
In a world where Anthropic actually tries to be net-positive, they don’t lobby against regulation and instead try to increase the chance of a moratorium on generally smarter-than-human AI systems until alignment is solved.
We’re not in that world, so I don’t think it makes as much sense to talk about Anthropic’s chances of aligning ASI on first try.
(If regulation solves the problem, it doesn’t matter how much it damaged your business interests (which maybe reduced how much alignment research you were able to do). If you really care first and foremost about getting to aligned AGI, then regulation doesn’t make the problem worse. If you’re lobbying against it, you really need to have a better justification than completely unrelated “if I get to the nuclear banana first, we’re more likely to survive”.)
I’ve just read this post, and it is disturbing what arguments Anthropic made about how the US needs to be ahead of China.
I didn’t really catch up to this news, and I think I know where the anti-Anthropic sentiment is coming from now.
I do think that Anthropic only made those arguments in the context of GPU export controls, and trying to convince the Trump administration to do export controls if nothing else. It’s still very concerning, and could undermine their ability to argue for strong regulation in the future.
That said, I don’t agree with the nuclear weapon explanation.
Suppose Alice and Bob were each building a bomb. Alice’s bomb has a 10% chance of exploding and killing everyone, and a 90% chance of exploding into rainbows and lollipops and curing cancer. Bob’s bomb has a 10% chance of exploding and killing everyone, and a 90% chance of “never being used” and having a bunch of good effects via “game theory.”
I think people with ordinary moral views will not be very angry at Alice, but forgive Bob because “Bob’s bomb was built to not be used.”
I don’t believe the nuclear bomb was truly built to not be used from the point of view of the US gov. I think that was just a lie to manipulate scientists who might otherwise have been unwilling to help.
I don’t think any of the AI builders are anywhere close to “building AI not to be used”. This seems even more clear than with nuclear, since AI has clear beneficial peacetime economically valuable uses.
Regulation does make things worse if you believe the regulation will fail to work as intended for one reason or another. For example, my argument that putting compute limits on training runs (temporarily or permanently) would hasten progress to AGI by focusing research efforts on efficiency and exploring algorithmic improvements.
It has been pretty clearly announced to the world by various tech leaders that they are explicitly spending billions of dollars to produce “new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth”. This pronouncement has not yet incited riots. I feel like discussing whether Anthropic should be on the riot-target-list is a conversation that should happen after the OpenAI/Microsoft, DeepMind/Google, and Chinese datacenters have been burnt to the ground.
Once those datacenters have been reduced to rubble, and the chip fabs also, then you can ask things like, “Now, with the pressure to race gone, will Anthropic proceed in a sufficiently safe way? Should we allow them to continue to exist?” I think that, at this point, one might very well decide that the company should continue to exist with some minimal amount of compute, while the majority of the compute is destroyed. I’m not sure it makes sense to have this conversation while OpenAI and DeepMind remain operational.
That’s a very good heuristic. I bet even Anthropic agrees with it. Anthropic did not release their newer models until OpenAI released ChatGPT and the race had already started.
That’s not a small sacrifice. Maybe if they released it sooner, they would be bigger than OpenAI right now due to the first mover advantage.
I believe they want the best for humanity, but they are in a no-win situation, and it’s a very tough choice what they should do. If they stop trying to compete, the other AI labs will build AGI just as fast, and they will lose all their funds. If they compete, they can make things better.
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter.
Even if they don’t support all the regulations you believe in, they’re the big AI company supporting relatively much more regulation than all the others.
I don’t know, I may be wrong. Sadly it is so very hard to figure out what’s good or bad for humanity in this uncertain time.
I don’t think that most people, upon learning that Anthropic’s justification was “other companies were already putting everyone’s lives at risk, so our relative contribution to the omnicide was low” would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like “it’s not okay to risk extinction, period.” That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they’d overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity.
Many ordinary people (maybe most) would have done the same thing in their shoes.
Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that’s the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won’t feel that outraged at their “consequentialism.” They may be outraged an perceived incompetence, but Anthropic definitely won’t be the only one accused of incompetence.
If you trust the employees of Anthropic to not want to be killed by OpenAI
In your mind, is there a difference between being killed by AI developed by OpenAI and by AI developed by Anthropic? What positive difference does it make, if Anthropic develops a system that kills everyone a bit earlier than OpenAI would develop such a system? Why do you call it a good bet?
AGI is coming whether you like it or not
Nope.
You’re right that the local incentives are not great: having a more powerful model is hugely economically beneficial, unless it kills everyone.
But if 8 billion humans knew what many of LessWrong users know, OpenAI, Anthropic, DeepMind, and others cannot develop what they want to develop, and AGI doesn’t come for a while.
From the top of my head, it actually likely could be sufficient to either (1) inform some fairly small subset of 8 billion people of what the situation is or (2) convince that subset that the situation as we know it is likely enough to be the case that some measures to figure out the risks and not be killed by AI in the meantime are justified. It’s also helpful to (3) suggest/introduce/support policies that change the incentives to race or increase the chance of (1) or (2).
A theory of change some have for Anthropic is that Anthropic might get in position to successfully do one of these two things.
My shortform post says that the real Anthropic is very different from the kind of imagined Anthropic that would attempt to do these nope. Real Anthropic opposes these things.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
Are there good models that support that Anthropic is a good bet? I’m genuinely curious.
I assume that naively, if any side had more of the burden of proof, it would be Anthropic. They have many more resources, and are the ones doing the highly-impactful (and potentially negative) work.
My impression was that there was very little probablistic risk modeling here, but I’d love to be wrong.
I think it’s totally fine to think that Anthropic is a net positive. Personally, right now, I broadly also think it’s a net positive. I have friends on both sides of this.
I’d flag though that your previous comment suggested more to me than “this is just you giving your probability”
> Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
I feel like there are much nicer ways to phase that last bit. I suspect that this is much of the reason you got disagreement points.
Fair enough. I’m frustrated and worried, and should have phrased that more neutrally. I wanted to make stronger arguments for my point, and then partway through my comment realized I didn’t feel good about sharing my thoughts.
I think the best I can do is gesture at strategy games that involve private information and strategic deception like Diplomacy and Stratego and MtG and Poker, and say that in situations with high stakes and politics and hidden information, perhaps don’t take all moves made by all players at literally face value. Think a bit to yourself about what each player might have in their uands, what their incentives look like, what their private goals might be. Maybe someone whose mind is clearer on this could help lay out a set of alternative hypotheses which all fit the available public data?
The private data is, pretty consistently, Anthropic being very similar to OpenAI where it matters the most and failing to mention in private policy-related settings its publicly stated belief on the risk that smarter-than-human AI will kill everyone.
funding—the company need money to perform research on safety alignment (X risks, and assuming they do want to to this), and to get there they need to publish models so that they can 1) make profits from them, 2) attract more funding. A quick look on the funding source shows Amazon, Google, some other ventures, and some other tech companies
empirical approach—they want to take empirical approach to AI safety and would need some limited capable models
But both of the points above are my own speculations
Nobody at Anthropic can point to a credible technical plan for actually controlling a generally superhuman model. If it’s smarter than you, knows about its situation, and can reason about the people training it, this is a zero-shot regime.
The world, including Anthropic, is acting as if “surely, we’ll figure something out before anything catastrophic happens.”
That is unearned optimism. No other engineering field would accept “I hope we magically pass the hardest test on the first try, with the highest stakes” as an answer. Just imagine if flight or nuclear technology were deployed this way. Now add having no idea what parts the technology is made of. We’ve not developed fundamental science about how any of this works.
As much as I enjoy Claude, it’s ordinary professional ethics in any safety-critical domain: you shouldn’t keep shipping SOTA tech if your own colleagues, including the CEO, put double-digit chances on that tech causing human extinction.
You’re smart enough to know how deep the gap is between current safety methods and the problem ahead. Absent dramatic change, this story doesn’t end well.
In the next few years, the choices of a technical leader in this field could literally determine not what the future looks like, but whether we have a future at all.
If you care about doing the right thing, now is the time to get more honest and serious than the prevailing groupthink wants you to be.
I think it’s accurate to say that most Anthropic employees are abhorrently reckless about risks from AI (though my guess is that this isn’t true of most people who are senior leadership or who work on Alignment Science, and I think that a bigger fraction of staff are thoughtful about these risks at Anthropic than other frontier AI companies). This is mostly because they’re tech people, who are generally pretty irresponsible. I agree that Anthropic sort of acts like “surely we’ll figure something out before anything catastrophic happens”, and this is pretty scary.
I don’t think that “AI will eventually pose grave risks that we currently don’t know how to avert, and it’s not obvious we’ll ever know how to avert them” immediately implies “it is repugnant to ship SOTA tech”, and I wish you spelled out that argument more.
I agree that it would be good if Anthropic staff (including those who identify as concerned about AI x-risk) were more honest and serious than the prevailing Anthropic groupthink wants them to be.
What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone’s AI development safer, but it’s not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.
You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from “obviously you should never build something that has a big chance of killing everyone”.
(I don’t think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)
Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, …).
If this is their belief, they should state it and advocate for the US government to prevent everyone in the world, including them, from building what has a double-digit chance of killing everyone. They’re not doing that.
P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30% --> You need to add to this the probability that Anthropic is first and that the other companies are not going to create AGI if Anthropic already created it. this is by default not the case
I agree, the net impact is definitely not the difference between these numbers.
Also I meant something more like P(doom|Anthropic builds AGI first).I don’t think people are imagining that the first AI company to achieve AGI will have an AGI monopoly forever. Instead some think it may have a large impact on what this technology is first used for and what expectations/regulations are built around it.
It would be easier to argue with you if you proposed a specific alternative to the status quo and argued for it. Maybe “[stop] shipping SOTA tech” is your alternative If so: surely you’re aware of the basic arguments for why Anthropic should make powerful models; maybe you should try to identify cruxes.
Separately from my other comment: It is not the case that the only appropriate thing to do when someone is going around killing your friends and your family and everyone you know is to “try to identify cruxes”.
It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action. It is not my job to convince Anthropic staff they are doing something wrong. Indeed, the economic incentives point extremely strongly towards Anthropic staff being the hardest to convince of true beliefs here. The standard you invoke here seems pretty crazy to me.
It is not clear to me that Anthropic “unilaterally stopping” will result in meaningfully better outcomes than the status quo, let alone that it would be anywhere near the best way for Anthropic to leverage its situation.
Like—I was a ML expert who, roughly ten years ago, decided to not advance capabilities and instead work on safety-related things, and when the returns to that seemed too dismal stopped doing that also. How much did my ‘unilateral stopping’ change things? It’s really hard to estimate the counterfactual of how much I would have actually shifted progress; on the capabilities front I had several ‘good ideas’ years early but maybe my execution would’ve sucked, or I would’ve been focused on my bad ideas instead. (Or maybe me being at the OpenAI lunch table and asking people good questions would have sped the company up by 2%, or w/e, independent of my direct work.)
How many people are there like me? Also not obvious, but probably not that many. (I would guess most of them ended up in the MIRI orbit and I know them, but maybe there are lurkers—one of my friends in SF works for generic tech companies but is highly suspicious of working for AI companies, for reasons roughly downstream of MIRI, and there might easily be hundreds of people in that boat. But maybe the AI companies would only actually have wanted to hire ten of them, and the others objecting to AI work didn’t actually matter.)
It is not clear to me that Anthropic “unilaterally stopping” will result in meaningfully better outcomes than the status quo
I think that just Anthropic, OpenAI, and DeepMind stopping would plausibly result in meaningfully better outcomes than the status quo. I still see no strong evidence that anyone outside these labs is actually pursuing AGI with anything like their level of effectiveness. I think it’s very plausible that everyone else is either LARPing (random LLM startups), or largely following their lead (DeepSeek/China), or pursuing dead ends (Meta’s LeCun), or some combination.
The o1 release is a good example. Yes, everyone and their grandmother was absent-mindedly thinking about RL-on-CoTs and tinkering with relevant experiments. But it took OpenAI deploying a flashy proof-of-concept for everyone to pour vast resources into this paradigm. In the counterfactual where the three major labs weren’t there, how long would it have taken the rest to get there?
I think it’s plausible that if only those three actors stopped, we’d get +5-10 years to the timelines just from that. Which I expect does meaningfully improve the outcomes, particularly in AI-2027-style short-timeline worlds.
So I think getting any one of them to individually stop would be pretty significant, actually (inasmuch as it’s a step towards “make all three stop”).
I think more than this, when you look at the labs you will often see the breakthru work was done by a small handful of people or a small team, whose direction was not popular before their success. If just those people had decided to retire to the tropics, and everyone else had stayed, I think that would have made a huge difference to the trajectory. (What does it look like if Alec Radford had decided to not pursue GPT? Maybe the idea was ‘obvious’ and someone else gets it a month later, but I don’t think so.)
I see no principle by which I should allow Anthropic to build existentially dangerous technology, but disallow other people from building it. I think the right choice is for no lab to build it. I am here not calling for particularly much censure of Anthropic compared to all labs, and my guess is we can agree that in aggregate building existentially dangerous AIs is bad and should face censure.
If you are killing me and my friends because you think it better that you do the killing than someone else, then actually I will still ask you to stop, because I draw a hard line around killing me and my friends. Naturally, I have a similar line around developing tech that will likely kill me and my friends.
I think this would fail Anthropic’s ideological Turing test. For example, they might make arguments like: by being a frontier lab, they can push for impactful regulation in a way they couldn’t if they weren’t; they can set better norms and demonstrate good safety practices that get adopted by others; or they can conduct better safety research that they could not do without access to frontier models. It’s totally reasonable to disagree with this, or argue that their actions so far (e.g., lukewarm support and initial opposition to SB 1047) show that they are not doing this, but I don’t think these arguments are, in principle, ridiculous.
Yeah, sorry, I think it’s just very tricky for me to pass Anthropic’s ITT, because to imitate Anthropic, I would need to be concurrently saying stuff like “by being a frontier lab, we can push for impactful regulation”, typing stuff like “this bill will impose multi-million dollar fines for minor, technical violations, representing a risk to smaller companies” about a NY bill with requirements only for $100m+ training runs that would not impose multi-million dollar fine for minor violations, and misleading a part of me about Dario’s role (he is the Anthropic’s politics and policy lead and was a lot more involved in SB 1047 than many at Anthropic think).
It’s generally harder to pass ITT of an entity that lies to itself and others than to point out why it is incoherent and ridiculous.
In my mind, a good predictor of Anthropic’s actions is something in the direction of “a bunch of Sam Altmans stuck with potentially unaligned employees (who care about x-risk), going hard on trying to win the race”.
A bill passed two chambers of New York State legislature. It incorporated a lot of feedback from this community. This bill’s author actually talked about it as a keynote speaker at an event organized by FAR at the end of May.
There’s no good theory of change for Anthropic compatible with them opposing and misrepresenting this bill. If you work at Anthropic on AI capabilities, you should stop.
We’ve given some feedback to this bill, like we do with many bills both at federal and state level. Despite improvements, we continue to have some concerns
(Many such cases!)
- RAISE is overly broad/unclear in some of its key definitions which makes it difficult to know how to comply
- If the state believes there is a compliance deficiency in a lab’s safety plan, it’s not clear you’d get an opportunity to correct it before enforcement kicks in
- Definition of ‘safety incident’ is extremely broad/unclear and the turnaround time is v short (72 hours!). This could make for lots of unnecessary over-reporting that distracts you from actual big issues
- It also appears multi-million dollar fines could be imposed for minor, technical violations—this represents a real risk to smaller companies
If there isn’t anything at the federal level, we’ll continue to engage on bills at the state level—but as this thread highlights, this stuff is complicated.
Any state proposals should be narrowly focused on transparency and not overly prescriptive. Ideally there would be a single rule for the country.
Here’s what the bill’s author says in response:
Jack, Anthropic has repeatedly stressed the urgency and importance of the public safety threats it’s addressing, but those issues seem surprisingly absent here.
Unfortunately, there’s a fair amount in this thread that is misleading and/or inflammatory, especially “multi-million dollar fines could be imposed for minor, technical violations—this represents a real risk to smaller companies.”
An army of lobbyists are painting RAISE as a burden for startups, and this language perpetuates that falsehood. RAISE only applies to companies that are spending over $100M on compute for the final training runs of frontier models, which is a very small, highly-resourced group.
In addition, maximum fines are typically only applied by courts for severe violations, and it’s scaremongering to suggest that the largest penalties will apply to minor infractions.
The 72 hour incident reporting timeline is the same as the cyber incident reporting timeline in the financial services industry, and only a short initial report is required.
AG enforcement + right to cure is effectively toothless, could lead to uneven enforcement, and seems like a bad idea given the high stakes of the issue.
I’m not saying that it’s implausible that the consequences might seem better. I’m stating it’s still morally wrong to race toward causing a likely extinction-level event as that’s a pretty schelling place for a deontological lines against action.
Ah. In that case we just disagree about morality. I am strongly in favour of judging actions by their consequences, especially for incredibly high stakes actions like potential extinction level events. If an action decreases the probability of extinction I am very strongly in favour of people taking it.
I’m very open to arguments that the consequences would be worse, that this is the wrong decision theory, etc, but you don’t seem to be making those?
I too believe we should ultimately judge things based on their consequences. I believe that having deontological lines against certain actions is something that leads humans to make decisions with better consequences, partly because we are bounded agents that cannot well-compute the consequences of all of our actions.
For instance, I think you would agree that it would be wrong to kill someone in order to prevent more deaths, today here in the Western world. Like, if an assassin is going to kill two people, but says if you kill one then he won’t kill the other, if you kill that person you should still be prosecuted for murder. It is actually good to not cross these lines even if the local consequentialist argument seems to check out. I make the same sort of argument for being first in the race toward an extinction-level event. Building an extinction-machine is wrong, and arguing you’ll be slightly more likely to pull back first does not stop it from being something you should not do.
I think when you look back at a civilization that raced to the precipice and committed auto-genocide, and ask where the lines in the sand should’ve been drawn, the most natural one will be “building the extinction machine, and competing to be first to do so”. So it is wrong to cross this line, even for locally net positive tradeoffs.
I think this just takes it up one level of meta. We are arguing about the consequences of a ruleset. You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people—this is your prerogative, but I don’t think this really gets you out of the regress of people disagreeing about what the best actions are.
Engaging with the object level of whether your proposed ruleset is a good one, I feel torn.
For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved. If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
And it is reasonable in extreme conditions (e.g. when those prohibitions are violated by others acting against you) to abandon standard ethical prohibitions. For example, I think it was just for Allied soldiers to kill Nazi soldiers in World War II. I think having nuclear weapons is terrible and questionable but I generally don’t support countries unilaterally abandoning their nuclear weapons, leaving them vulnerable to other nuclear-armed nations. Obviously, there are many disanalogies, but my point is that you need to establish how much a given deontological prohibition makes sense in unusual situations, rather than just appealing to moral intuition.
I’m not here to defend Anthropic’s actions on the object level—they are not acting as I would in their situation, but they may have sound reasons. But they are not acting badly enough that I confidently assume bad faith. They have had positive effects, like their technical research and helping RSPs become established, though I disagree with some of their policy positions.
Another disanalogy between this and murder is that there are multiple AGI labs, and only one needs to cause human extinction. If Anthropic ceased to exist, other labs would continue this work. I’d argue that Anthropic is accelerating development by researching capabilities and intensifying commercial pressure, and this is bad. But when arguing about acceleration’s harm, we must weigh it against Anthropic’s potential good—this becomes more of an apples-to-apples comparison rather than a clear deontological violation.
If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
Not a crux for either of us, but I disagree. When is the last time that someone shut down a multi-billion dollar profit arm of a company due to ethics, and especially due to the threat of extinction? If Anthropic announced they were ceasing development / shutting down because they did not want to cause an extinction-level event, this would have massive ramifications through society as people started to take this consequence more seriously, and many people would become more scared, including friends of employees at the other companies and more of the employees themselves. This would have massive positive effects.
For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved.
This implies one should never draw lines in the sand about good/bad behavior if society has not reached consensus on it. In contrast, I think it is good to not do many behaviors even if your society has not yet reached consensus on it. For instance, if a government has not yet regulated that language-models shouldn’t encourage people to kill themselves, and then language models do and 1000s of ppl die (NB: this is a fictional example), this isn’t ethically fine just because it wasn’t illegal. I think we should act in ways that we believe will make sense as policies even before they have achieved consensus, and this is part of what makes someone engaged in ethics rather than in simply “doing what you are told”.
You bring up Nazism. I think that it was wrong to go along with Nazism even though the government endorsed it. Surely there are ethical lines against causing an extinction-level event even if your society has not come to a consensus on where those lines are yet. And even if we never achieve consensus, everyone should still attempt to figure out the answer and live by it, rather than give up on having such ethical lines.
I’m not here to defend Anthropic’s actions on the object level—they are not acting as I would in their situation, but they may have sound reasons. But they are not acting badly enough that I confidently assume bad faith. They have had positive effects, like their technical research and helping RSPs become established, though I disagree with some of their policy positions.
Habryka wrote about how the bad-faith comment was a non-sequiter in another thread. I will here say that the “I’m not here to defend Anthropic’s actions on the object level” doesn’t make sense to me. I am saying they should stop racing, and you are saying they should not, and we are exchanging arguments for this, currently coming down to the ethics of racing toward an extinction-level event and whether there are deontological lines against doing that. I agree that you are not attempting to endorse all the details of what they are doing beyond that, but I believe you are broadly defending their IMO key object-level action of doing multi-billion dollar AI capabilities research and building massive industry momentum.
You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people—this is your prerogative, but I don’t think this really gets you out of the regress of people disagreeing about what the best actions are.
It reads to me that you’re just talking around the point here. I said that people shouldn’t race toward extinction-level threats for deontological reasons, you said we should evaluate the direct consequences, I said deontological reasons are endorsed by a consequentialist framework so we should analyze it deontologically, and now you’re saying that I’m conceding the initial point that we should be doing the consequentialist analysis. No, I’m saying we should do a deontological analysis, and this is in conflict with you saying we should just judge based on the direct consequences that we know how to estimate.
I’d argue that Anthropic is accelerating development by researching capabilities and intensifying commercial pressure, and this is bad. But when arguing about acceleration’s harm, we must weigh it against Anthropic’s potential good—this becomes more of an apples-to-apples comparison rather than a clear deontological violation.
You keep trying to engage me in this consequentialist analysis, and say that sometimes (e.g. during times of war) the deontological rules can have exceptions, but you have not argued for why this is an exception. If people around you in society start committing murder, would you then start murdering? If people around you started lying, would you then start lying? I don’t think so. Why then, if people around you are racing to an extinction-level event, does the obvious rule of “do not race toward an extinction-level event” get an exception? Other people doing things that are wrong (even if they get away with it!) doesn’t make those things right.
The point I was trying to make is that, if I understood you correctly, you were trying to appeal to common sense morality that deontological rules like this are good on consequentialist grounds. I was trying to give examples why I don’t think this immediately follows and you need to actually make object level arguments about this and engage with the counter arguments. If you want to argue for deontological rules, you need to justify why those rules
I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
I don’t think this follows from naive moral intuition. A crucial disanalogy with murder is that if you don’t kill someone, the counterfactual is that the person is alive. While if you don’t race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway. This means that we need to be engaging in discussion about the consequences of there being another actor pushing for this, the consequences of other actions this actor may take, and how this all nets out, which I don’t feel like you’re doing.
I expect AGI to be either the best or worse thing that has ever happened, and this means that important actions will typically be high variance, with major positive or negative consequences. Declining to engage in things with the potential for high negative consequences severely restricts your action space. And given that it’s plausible that there’s a terrible outcome even if we do nothing, I don’t think the act-omission distinction applies
I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.
A crucial disanalogy with murder is that if you don’t kill someone, the counterfactual is that the person is alive. While if you don’t race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway.
This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
Crucially it is not to be assumed that we will build AGI in the next 1-2 decades. If the countries of the world decided to ban training runs of a particular size, because we don’t want to take this sort of extinction-level risk, then it would not happen. Assuming this out of the hypothesis space will get you into bad ethical territory. Suppose a military general says “War is inevitable, the only question is how fast it’s over when it starts and how few deaths there are.” This general would never take responsibility for instigating. Similarly if you assume with certainty that AGI will be developed risking in next few decades, you absolve yourself of all responsibility for being the one who does so.
Declining to engage in things with the potential for high negative consequences severely restricts your action space.
I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.
When the stakes get high it is not time to start lying, cheating, killing, or unilaterally betting the extinction of the human race. If it is for someone, then they simply can’t be trusted to follow these ethical principles when it matters.
Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.
Yes that is correct
This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of “taking actions that may lead to death”, which I think is more analogous—hopefully we can agree Anthropic won’t intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone’s probability of dying, while introducing some novel risks.
I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.
I think we’re talking past each other. I understood you as arguing “deontological rules against X will systematically lead to better consequences than trying to evaluate each situation carefully, because humans are fallible”. I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.
I am arguing that “things to do with human extinction from AI, when there’s already a meaningful likelihood” are not a domain where ethical prohibitions like “never do things that could lead to human extinction” are productive. For example, you help run LessWrong, which I’d argue has helped raise the salience of AI x-risk, which plausibly has accelerated timelines. I personally think this is outweighed by other effects, but that’s via reasoning about the consequences. Your actions and Anthropic’s feel more like a difference in scale than a difference in kind.
Assuming this out of the hypothesis space will get you into bad ethical territory
I am not arguing that AI x-risk is inevitable, in fact I’m arguing the opposite. AI x-risk is both plausible and not inevitable. Actions to reduce this seem very valuable. Actions that do this will often have side effects that increase risk in other ways. In my opinion, this is not sufficient cause to immediately rule them out.
Meanwhile, I would consider anyone pushing hard to make frontier AI to be highly reckless if they were the only one who could cause extinction, and they could unilaterally stop—this is a way to unilaterally bring risk to zero, which is better than any other action. But Anthropic has no such action available, and so I want them to take the actions that reduce risk as much as possible. And there are arguments for proceeding and arguments for stopping.
> As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of “taking actions that may lead to death”, which I think is more analogous—hopefully we can agree Anthropic won’t intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone’s probability of dying, while introducing some novel risks.
This is simplifying away key details.
If you go up to a person with a deadly illness and non-consensually do a dangerous surgery on them, this is wrong. If you kill them via this, their family has a right to sue you / prosecute you for murder. Once again, simply because some bad outcome is likely, you do not have ethical mandate to now go and cause it yourself. Deontology is typically about forbidding classes of action that on net make the world worse even when locally you have a good reason. Talking about “taking actions that lead to death” explicitly obfuscates the mechanism. I know you won’t endorse this once I point it out, but under this strictly-consequentialist framework “blogging on LessWrong about extinction-risk from AI” and “committing murder” are just two different “actions that lead to death” and neither can be thought of as having different deontological lines drawn. On the contrary, “don’t commit murder” and “don’t build a doomsday machine” are simple and natural deontological rules, whereas “don’t build a blogging platform with unusually high standards for truthseeking” is not.
I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.
I am not trying to argue for an especially novel deontological rule… “building a doomsday machine” is wrong. It’s a far greater sin than murder. I think you’d do better to think of the AI companies as more like competing political factions each of whom’s base is very motivated toward committing a genocide against their neighbors. If your political faction commits a genocide; and you were merely a top-200 ranked official who didn’t particularly want a genocide, you still bear moral responsibility for it even though you only did paperwork and took meetings and maybe worked in a different department. Just because there are two political factions whose bases are uncomfortably attracted to the idea of committing genocide does not now make it ethically clear for you to make a third one that hungers for genocide but has wiser people in charge.
I am not advocating for some new interesting deontological rule. I am arguing that the obvious rule against building a doomsday machine applies here straightforwardly. Deontological violations don’t stop being bad just because other people are committing them. You cannot commit murder just because other people do, and you cannot build a doomsday machine just because other people are. You generally shouldn’t build doomsday machines even if you have a good reason. To argue against this you should show why deontological rules break down, and then apply it to this case, but the doctor example you gave doesn’t show that, because by-default you aren’t actually allowed to non-consensually do risky surgeries on people even if it makes sense on the consequentialist calculus.
I continue to feel like we’re talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there’s a really good reason breaking them seems locally beneficial, because on average, the decision theory that’s willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone’s consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for “don’t commit murder”, I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I’m completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I’m strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what’s disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I’m purely defending the abstract point of “plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden”. You’re welcome to critique Anthropic’s actual actions as much as you like. But you seem to be making a much more general claim.
If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there’s a really good reason breaking them seems locally beneficial, because on average, the decision theory that’s willing to do harmful things for complex reasons performs badly.
Hm… I would say that one should follow deontological rules like “don’t lie” and “don’t steal” and so on because we fail to understand or predict the knock-on consequences. For instance they can get the world into a much worse equilibrium of mutual liars/stealers, for instance, in ways that are hard to predict. And being a good person can get the world into a much better equilibrium of mutually-honorable people in ways that are hard to predict. And also because, if it does screw up in some hard to predict way, then when you look back, it will often be the easiest line in the sand to draw.
For instance, if SBF is wondering at what point he could have most reliably intervened on his whole company collapsing and ruining the reputation of things associated with it, he might talk about certain deals he made or strategic plays with Binance or the US Govt, for he is not a very ethical person; I would talk about not taking customer deposits.
If and when we get to an endgame where tons of AI systems are sociopathically lying and stealing money and ultimately killing the humans, I suspect people of SBF’s mindset again to talk about how the US and China should’ve played things, or how Musk should’ve played OpenAI, and how Amodei should’ve done played with DC. And I will talk about not racing to develop the unaligned AI systems in the first place.
To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
I don’t really know why you think that this generalization can’t be made to things we’ve not seen before. So many things I experience haven’t been seen before in history. How many centuries have we had to develop ethical intuitions for how to write on web forums? There are still answers to these questions, and I can identify ethical and unethical behaviors, as can you (e.g. sockpuppeting, doxing, brigading, etc). There can be ethical lines in novel situations, not only historically common ones.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I’m strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what’s disanalogous?
I am not sure what I would propose if I believed Nuclear Winter was a serious existential threat; it seems plausible to me that the ethical thing would be to unilaterally disarm. I suspect that at the very least if I were a country I would openly and aggressively campaign for mutual disarmament. (If any AI capabilities company openly campaigned for making it illegal to develop AI then I suspect I would consider that plausibly quite ethical).
I’m purely defending the abstract point of “plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden”.
To be clear, I think you’re defending a somewhat stronger claim. You write further up thread:
I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
My current stance is that all actors currently in this space are doing things prohibited by basic deontology. This is not merely an unfortunate outcome, but is a grave sin, for they are building doomsday machines, likely the greatest evil that we will ever experience in our history (regardless of if they are successful). So I want to emphasize that the boundary here is not between “better and worse plans” but between “moral murky and morally evil plans”. Insofar as you commit a genocide or worse, history should remember your names as people of shame who we must take pain never to repeat. Insofar as you played with the idea, thought you could control it, and failed, then history should still think of you this way.
I believe we disagree over where the deontological lines are, given you are defending “vaguely similar plans to Anthropic’s”. Perhaps you could point to where you think they are? Presumably you think that a Larry Page style “this is just the next stage in evolution” indifference to human extinction AI-project would be morally wrong?
Here’s two lines that I think might cross into being acceptable [edit: or rather, “only morally murky”] from my perspective.
I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.
I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.
I of course do not think any current project looks superficially like these.
Here’s two lines that I think might cross into being acceptable from my perspective.
I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.
I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.
I of course do not think any current project looks superficially like these.
Okay, after reading this it seems to me that we broadly do agree and are just arguing over price. I’m arguing that it is permissible to try to build a doomsday machine if there are really good reasons to believe it is net good for the probability of doomsday. It sounds like you agree, and give two examples of what “really good reasons” could be. I’m sure we disagree on the boundaries of where the really good reasons lie, but I’m trying to defend the point that you actually need to think about the consequences.
What am I missing? Is it that you think these two are really good reasons, not because of the impact on the consequences, but because of the attitude/framing involved?
I’m not Ben, but I think you don’t understand. I think explaining what you are doing loudly in public isn’t like “having a really good reason to believe it is net good” is instead more like asking for consent.
Like you are saying “please stop me by shutting down this industry” and if you don’t get shut down, that it is analogous to consent: you’ve informed society about what you’re doing and why and tried to ensure that if everyone else followed a similar sort of policy we’d be in a better position.
(Not claiming I agree with Ben’s perspective here, just trying to explain it as I understand it.)
Ah! Thanks a lot for the explanation, that makes way more sense, and is much weaker than what I thought Ben was arguing for. Yeah this seems like a pretty reasonable position, especially “take actions where if everyone else took them we would be much better off” and I am completely fine with holding Anthropic to that bar. I’m not fully sold re the asking for consent framing, but mostly for practical reasons—I think there’s many ways that society is not able to act constantly, and the actions of governments on many issues are not a reflection of the true informed will of the people, but I expect there’s some reframe here that I would agree with.
and is much weaker than what I thought Ben was arguing for.
I don’t think Ryan (or I) was intending to imply a measure of degree, so my guess is unfortunately somehow communication still failed. Like, I don’t think Ryan (or Ben) are saying “it’s OK to do these things you just have to ask for consent”. Ryan was just trying to point out a specific way in which things don’t bottom out in consequentialist analysis.
If you end up walking away with thinking that Ben believes “the key thing to get right for AI companies is to ask for consent before building the doomsday machine”, which I feel like is the only interpretation of what you could mean by “weaker” that I currently have, then I think that would be a pretty deep misunderstanding.
There is something important to me in this conversation about not trusting one’s consequentialist analysis when evaluating proposals to violate deontological lines, and from my perspective you still haven’t managed to paraphrase this basic ethical idea or shown you’ve understood it, which I feel a little frustrated over. Ah well. I still have been glad of this opportunity to argue it through, and I feel grateful to Neel for that.
I actually agree with Neel that, in principle, an AI lab could race for AGI while acting responsibly and IMO not violating deontology.
Releasing models exactly at the level of their top competitor, immediately after the competitor’s release and a bit cheaper; talking to the governments and lobbying for regulation; having an actually robust governance structure and not doing a thing that increases the chance of everyone dying.
This doesn’t describe any of the existing labs, though.
But they are not acting badly enough that I confidently assume bad faith
I like a lot of your comment, but this feels like a total non-sequitur. Did anyone involved in this conversation say that Anthropic was acting under false pretenses? I don’t think anyone brought up concerns that rest on assumptions of bad faith (though to be clear, Anthropic employees have mostly told me I should assume something like bad faith from Anthropic as an institution, and that people should try to hold it accountable the same way any other AI lab, and to not straightforwardly trust statements Anthropic makes without associated commitments, so I do think I would assume bad faith, but it mostly just feels besides the point in this discussion).
it was just for Allied soldiers to kill Nazi soldiers in World War II
Killing anyone who hasn’t done anything to lose deontological protection is wrong and clearly violates deontology.
As a Nazi soldier, you lose deontological protection.
There are many humans who are not even customers of any of the AI labs; they clearly have not lost deontological protection, and it’s not okay to risk killing them without their consent.
I disagree with this as a statement about war, I’m sure a bunch of Nazi soldiers were conscripted, did not particularly support the regime, and were participating out of fear. Similarly, malicious governments have conscripted innocent civilians and kept them in line through fear in many unjust wars throughout history. And even people who volunteered may have done this due to being brainwashed by extensive propaganda that led to them believing they were doing the right thing. The real world is messy and strict deontological prohibitions break down in complex and high stakes situations, where inaction also has terrible consequences—I strongly disagree with a deontological rule that says countries are not about to defend themselves against innocent people forced to do terrible things
My deontology prescribes not to join a Nazi army regardless of how much fear you’re in. It’s impossible to demand of people to be HPMOR!Hermione, but I think this standard works fine for real-world situations.
(While I do not wish any Nazi soldiers death, regardless of their views or reasons for their actions. There’s a sense in which Nazi soldiers are innocent regardless of what they’ve done; none of them are grown up enough to be truly responsible for their actions. Every single death is very sad, and I’m not sure there has ever been even a single non-innocent human. At the same time, I think it’s okay to kill Nazi soldiers (unless they’re in a process of surrenderring, etc.) or lie to them, and they don’t have deontological protection.)
You’re arguing it’s okay to defend yourself against innocent people forced to do terrible things. I agree with that, and my deontology agrees with that.
At the same time, killing everyone because otherwise someone else could’ve killed them with a higher chance = killing many people who aren’t ever going to contribute to any terrible things. I think, and my deontology thinks, that this is not okay. Random civilians are not innocent Nazi soldiers; they’re simply random innocent people. I ask of Anthropic to please stop working towards killing them.
And do you feel this way because you believe that the general policy of obeying such deontological prohibitions will on net result in better outcomes? Or because you think that even if there were good reason to believe that following a different policy would lead to better empirical outcomes, your ethics say that you should be deontologically opposed regardless?
I think the general policy of obeying such deontological rules leads to better outcomes; this is the reason for having deontology in the first place. (I agree with that old post on what to do when it feels like there’s a good reason to believe that following a different policy would lead to better outcomes.)
(Just as a datapoint, while largely agreeing with Ben here, I really don’t buy this concept of deontological protection of individuals. I think there are principles we have about when it’s OK to kill someone, but I don’t think the lines we have here route through individuals losing deontological protection.
Killing a mass murderer while he is waiting for trial is IMO worse than killing a civilian in collateral damage as part of taking out an active combatant, because it violates and messes with different processes, which don’t generally route through individuals “losing deontological protection” but instead are more sensitive to the context the individuals are in)
Locally: can you give an example of when it’s okay to kill someone who didn’t lose deontological protection, where you want to kill them because of the causal impact of their death?
To me the issue goes the other way. The idea of “losing deontological protection” suggests I’m allowed to ignore deontological rules when interacting with someone. But that seems obviously crazy to me. For instance I think there’s a deontological injunction against lying, but just because someone lies doesn’t now mean I’m allowed to kill them. It doesn’t even mean I’m allowed to lie to them. I think lying to them would still be about as wrong as it was before, not a free action I can take whenever I feel like it.
I mean, a very classical example that I’ve seen a few times in media is shooting a civilian who is about to walk into a minefield in which multiple other civilians or military members are located. It seems tragic but obviously the right choice to shoot them if they don’t heed your warning.
IDK, I also think it’s the right choice to pull the lever in the trolley problem, though the choice becomes less obvious the more it involves active killing as opposed to literally pulling a lever.
Suppose I hire a hitman to kill you. But suppose there already are 3 hitmen trying to kill you, and I’m hoping my hitman would reach you first, and I know that my hitman has really bad aim. Once the first hitman reaches you and starts shooting, the other hitmen will freak out and run away, so I’m hoping you’re more likely to survive.
I have no other options for saving you, since the only contact I have is a hitman, and he’s very bad at English and doesn’t understand any instructions except trying to kill someone.
In this case, you can argue to the court that my plan to save you was retarded. But you cannot concede that my plan actually was a good idea consequentially, but deontologically unethical. Since I didn’t intend to kill anyone.
Deontology only kicks in when your plan involves making someone die, or greatly increasing the chance someone dies.
I feel like this it’s actually a great analogy! The only difference is that if your hitman starts shooting and doesn’t kill anyone, you get infinite gold.
You know that in real life you go to police instead of hiring a hitman, right?
And I claim that it’s really not okay to hire a hitman who might lower the chance of the person ending up dead, especially when your brain is aware of the infinite gold part.
The good strategy for anyone in that situation to follow is to go to the police or go public and not hire any additional hitmen.
I don’t agree that deontology is about intent. Deontology is about action. Deontology is about not hiring hitmen to kill someone even if you have a really good reason, and even if your intent is good. Deontology is substantially about schelling lines of action where everything gets hard to predict and goes bad after you commit it.
I imagine that your incompetent hitman has only like a 50% chance of succeeding, whereas the others have ~100%, that seems deontologically wrong to me.
It seems plausible that what you mean to say by the hypothetical is that he has 0% chance.
I admit this is more confusing and I’m not fully resolved on this.
I notice I am confused about how you can get that epistemic state in real life.
I observe that society will still prosecute you for attempted murder if you buy a hitman off the dark web, even one with a clearly incompetent reputation for 0⁄10 kills or whatever.
I think society’s ability to police this line is not as fine grained as you’re imagining, and so you should not buy incompetent hitmen in order to not kill your friend, unless you’re willing to face the consequences.
To be honest I couldn’t resist writing the comment because I just wanted to share the silly thought :/
Now that I think about it, it’s much more complicated. Mikhail Samin is right that the personal incentive of reaching AGI first really complicates the good intentions. And while a lot of deontology is about intent, it’s hyperbole to say that deontology is just intent.
I think if your main intent is to save someone (and not personal gain), and your plan doesn’t require or seek anyone’s death, then it is deontologically much less bad than evil things like murder. But it may still be too bad for you to do, if you strongly lean towards deontology rather than consequentialism. Even if the court doesn’t find you guilty of first degree murder, it may still find you guilty of… some… things.
One might argue that the enormous scale (risking everyone’s death instead of only one person), makes it deontologically worse. But I think the balance does not shift in favor of deontology and against consequentialism as we increase the scale (it might even shift a little in favor of consequentialism?).
That’s fair, but the deontological argument doesn’t work for anyone building the extinction machine who is unconvinced by x-risk arguments, or deludes themselves that it’s not actually an extinction machine, or that extinction is extremely unlikely, or that the extinction machine is the only thing that can prevent extinction (as in all the alignment via AI proposals) etc. etc.
I suppose if you think it’s less likely there will be killing involved if you’re the one holding the overheating gun than if someone else is holding it, that hard line probably goes away.
Just because someone else is going to kill me, doesn’t mean we don’t have an important societal norm against murder. You’re not allowed to kill old people just because they’ve only got a few years left, or kill people with terminal diseases.
I am not quite sure what an overheating gun refers to, I am guessing the idea is that it has some chance of going off without being fired.
Anyhow, if that’s accurate, it’s acceptable to decide to be the person holding an overheating gun, but it’s not acceptable to (for example) accept a contract to assassinate someone so that you get to have the overheating gun, or to promise to kill slightly fewer people with the gun than the next guy. Like, I understand consequentially fewer deaths happen, but our society has deontological lines against committing murder even given consequentialist arguments, which are good. You’re not allowed to commit murder even if you have a good reason.
I fully expect we’re doomed, but I don’t find this attitude persuasive. If you don’t want to be killed, you advocate for actions that hopefully result in you not being killed, whereas this action looks like it just results in you being killed by someone else. Like you’re facing a firing squad and pleading specifically with just one of the executioners.
For me the missing argument in this comment thread is the following: Has anyone spelled out the arguments for how it’s supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race? Suppose whichever AI lab is most receptive to social censure could actually be persuaded to drop out; don’t we then just end in an Evaporative Cooling of Group Beliefs situation where the remaining participants in the race are all the more intransigent?
Has anyone spelled out the arguments for how it’s supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race?
An AI lab dropping out helps in two ways:
timelines get longer because the smart and accomplished AI capabilities engineers formerly employed by this lab are no longer working on pushing for SOTA models/no longer have access to tons of compute/are no longer developing new algorithms to improve performance even holding compute constant. So there is less aggregate brainpower, money, and compute dedicated to making AI more powerful, meaning the rate of AI capability increase is slowed. With longer timelines, there is more time for AI safety research to develop past its pre-paradigmatic stage, for outreach effort to mainstream institutions to start paying dividends in terms of shifting public opinion at the highest echelons, for AI governance strategies to be employed by top international actors, and for moonshots like uploading or intelligence augmentation to become more realistic targets.
race dynamics become less problematic because there is one less competitor other top labs have to worry about, so they don’t need to pump out top models quite as quickly to remain relevant/retain tons of funding from investors/ensure they are the ones who personally end up with a ton of power when more capable AI is developed.
I believe these arguments, frequently employed by LW users and alignment researchers, are indeed valid. But I believe their impact will be quite small, or at the very least meaningfully smaller than what other people on this site likely envision.
And since I believe the evaporative cooling effects you’re mentioning are also real (and quite important), I indeed conclude pushing Anthropic to shut down is bad and counterproductive.
the smart and accomplished AI capabilities engineers formerly employed by this lab are no longer working on pushing for SOTA models/no longer have access to tons of compute/are no longer developing new algorithms to improve performance
For that to be case, instead of engineers entering another company, we should suggest other tasks. There are very questionable technologies shipped indeed (for example, social media with automatic recommendation algorithms) but someone would have to connect the engineers to the tasks.
I agree with sunwillrise but I think there is an even stronger argument for why it would be good for an AI company to drop out of the race. It is a strong jolt that has a good chance of waking up the world to AI risk. It sends a clear message:
We were paper billionaires and we were on track to be actual billionaires, but we gave that up because we were too concerned that the thing we were building could kill literally everyone. Other companies are still building it. They should stop too.
I don’t know exactly what effect that would have on public discourse, but the effect would be large.
A board firing a CEO is a pretty normal thing to happen, and it was very unclear that the firing had anything to do with safety concerns because the board communicated so little.
A big company voluntarily shutting down because its product is too dangerous is (1) a much clearer message and (2) completely unprecedented, as far as I know.
In my ideal world, the company would be very explicit that they are shutting down specifically because they are worried about AGI killing everyone.
My understanding was that LessWrong, specifically, was a place where bad arguments are (aspirationally) met with counterarguments, not with attempts to suppress them through coordinated social action. Is this no longer the case, even aspirationally?
I think it would be bad to suppress arguments! But I don’t see any arguments being suppressed here. Indeed, I see Zack as trying to create a standard where (for some reason) arguments about AI labs being reckless must be made directly to the people who are working at those labs, and other arguments should not be made, which seems weird to me. The OP seems to me like it’s making fine arguments.
I don’t think it was ever a requirement for participation on LessWrong to only ever engage in arguments that could change the minds of the specific people who you would like to do something else, as opposed to arguments that are generally compelling and might affect those people in indirect ways. It’s nice when it works out, but it really doesn’t seem like a tenet of LessWrong.
Ah, I had (incorrectly) interpreted “It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action” as being an alternative to engaging at all with the arguments of people who disagree with your positions here, rather than an alternative to having that standard in the outside world with people who are not operating under those norms.
Sure, censure among people who agree with you is a fine thing for a comment to do. I didn’t read Mikhail’s comment that way because it seemed to be asking Anthropic people to act differently (but without engaging with their views).
It’s OK to ask people to act differently without engaging with your views! If you are stabbing my friends and family I would like you to please stop, and I don’t really care about engaging with your views. The whole point of social censure is to ask people to act differently even if they disagree with you, that’s why we have civilization and laws and society.
I think Anthropic leadership should feel free to propose a plan to do something that is not “ship SOTA tech like every other lab”. In the absence of such a plan, seems like “stop shipping SOTA tech” is the obvious alternative plan.
Clearly in-aggregate the behavior of the labs is causing the risk here, so I think it’s reasonable to assume that it’s Anthropic’s job to make an argument for a plan that differs from the other labs. At the moment, I know of no such plan. I have some vague hopes, but nothing concrete, and Anthropic has not been very forthcoming with any specific plans, and does not seem on track to have one.
I think Anthropic leadership should feel free to propose a plan to do something that is not “ship SOTA tech like every other lab”. In the absence of such a plan, seems like “stop shipping SOTA tech” is the obvious alternative plan.
Note that Anthropic, for the early years, did have a plan to not ship SOTA tech like every other lab, and changed their minds. (Maybe they needed the revenue to get the investment to keep up; maybe they needed the data for training; maybe they thought the first mover effects would be large and getting lots of enterprise clients or w/e was a critical step in some of their mid-game plans.) But I think many plans here fail once considered in enough detail.
Anthropic’s responsible scaling policy does mention pausing scaling if the capabilities of their models exceeds their best safety methods:
“We have designed the ASL system to strike a balance between effectively targeting catastrophic risk and incentivising beneficial applications and safety progress. On the one hand, the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures. But it does so in a way that directly incentivizes us to solve the necessary safety issues as a way to unlock further scaling, and allows us to use the most powerful models from the previous ASL level as a tool for developing safety features for the next level.”
I think OP and others in the thread are wondering why Anthropic doesn’t stop scaling now given the risks. I think the reason why is that in practice doing so would create a lot of problems:
How would Anthropic fund their safety research if Claude is no longer SOTA and becomes less popular?
Is Anthropic supposed to learn from and test only models at current levels of capability and how does it learn about future advanced model behaviors? I haven’t heard a compelling argument for how we could solve superalignment by studying much less advanced models. Imagine trying to align GPT-4 or o3 by only studying and testing GPT-2 from 2019. In reality, future models will probably have lots of unknown unknowns and emergent properties that are difficult or impossible to predict in advance. And then there’s all the social consequences of AI like misuse which are difficult to predict in advance.
Although I’m skeptical that alignment can be solved without a lot of empirical work on frontier models I still think it would better if AI progress were slower.
I don’t expect Anthropic to stick to any of their policies when competitive pressure means they have to train and deploy and release or be left behind. None of their commitments are of a kind they wouldn’t be able to walk back on.
Anthropic accelerates capabilities more than safety; they don’t even support regulation, with many people internally being misled about Anthropic’s efforts. None of their safety efforts meaningfully contributed to solving any of the problems you’d have to solve to have a chance of having something much smarter than you that doesn’t kill you.
I’d be mildly surprised if there’s a consensus at Anthropic that they can solve superalignment. The evidence they’re getting shows, according to them, that we live in an alignment-is-hard world.
If any of these arguments are Anthropic’s, I would love for them to say that out loud.
I’ve generally been aware of/can come up with some arguments; I haven’t heard them in detail from anyone at Anthropoid and would love for Anthropic to write up the plan that includes reasoning why shipping SOTA models helps humanity survive instead of doing the opposite thing.
The last time I saw Anthropic’s claimed reason for existing, it later became an inspiration for
I’m confused about why you’re pointing to Anthropic in particular here. Are they being overoptimistic in a way that other scaling labs are not, in your view?
Unlike other labs, Anthropic is full of people who care and might leave capabilities work or push for the leadership to be better. It’s a tricky place to be in: if you’re responsible enough, you’ll hear more criticism than less responsible actors, because criticism can still change what you’re doing.
Other labs are much less responsible, to be clear. There’s it a lot (I think) my words here can do about that, though.
Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
I understand your argument and it has merit, but I think the reality of the situation is more nuanced.
Humanity has long build buildings and bridges without access to formal engineering methods for predicting the risk of collapse. We might regard it as unethical to build such a structure now without using the best practically available engineering knowledge, but we do not regard it as having been unethical to build buildings and bridges historically due to the lack of modern engineering materials and methods. They did their best, more or less, with the resources they had access to at the time.
AI is a domain where the current state of the art safety methods are in fact being applied by the major companies, as far as I know (and I’m completely open to being corrected on this). In this respect, safety standards in the AI field are comparable to those of other fields. The case for existential risk is approximately as qualitative and handwavey as the case for safety, and I think that both of these arguments need to be taken seriously, because they are the best we currently have. It is disappointing to see the cavalier attitude with which pro-AI pundits dismiss safety concerns, and obnoxious to see the overly confident rhetoric deployed by some in the safety world when they tweet about their p(doom). It is a weird and important time in technology, and I would like to see greater open-mindedness and thoughtfulness about the ways to make progress on all of these important issues.
No other engineering field would accept “I hope we magically pass the hardest test on the first try, with the highest stakes” as an answer.
Perhaps the answer is right there, in the name. The future Everett branches where we still exist will indeed be the ones where we have magically passed the hardest test on the first try.
Branches like that don’t have a lot of reality-fluid and lost most of the value of our lightcone; you’re much more likely to find yourself somewhere before that.
Does “winning the race” actually give you a lever to stop disaster, or does it just make Anthropic the lab responsible for the last training run?
Does access to more compute and more model scaling, with today’s field understanding, truly give you more control—or just put you closer to launching something you can’t steer? Do you know how to solve alignment given even infinite compute?
Is there any sign, from inside your lab, that safety is catching up faster than capabilities? If not, every generation of SOTA increases the gap, not closes it.
“Build the bomb, because if we don’t, someone worse will.”
Once you’re at the threshold where nobody knows how to make these systems steerable or obedient, it doesn’t matter who is first—you still get a world-ending outcome.
If Anthropic, or any lab, ever wants to really make things go well, the only winning move is not to play, and try hard to make everyone not play.
If Anthropic was what it imagines itself being, it would build robust field-wide coordination and support regulation that would be effective globally, even if it means watching over your shoulder for colleagues and competitors across the world.
If everyone justifies escalation as “safety”, there is no safety.
In the end, if the race leads off a cliff, the team that runs fastest doesn’t “win”: they just get there first. That’s not leadership. It’s tragedy.
If you truly care about not killing everyone, will have to be a point—maybe now—where some leaders stop, even if it costs, and demand a solution that doesn’t sacrifice the long-term for a financial gain due to having a model slightly better than those of your competitors.
Anthropic is in a tricky place. Unlike other labs, it is full of people who care. The leadership has to adjust for that.
That makes you one of the few people in history who has the chance to say “no” to the spiral to the end of the world and demand of your company to behave responsibly.
(note: many of these points are AI-generated by a model with 200k tokens of Arbital in its context; though heavily edited.)
I have great empathy and deep respect for the courage of the people currently on hunger strikes to stop the AI race. Yet, I wish they hadn’t started them: these hunger strikes will not work.
Hunger strikes can be incredibly powerful when there’s a just demand, a target who would either give in to the demand or be seen as a villain for not doing so, a wise strategy, and a group of supporters.
I don’t think these hunger strikes pass the bar. Their political demands are not what AI companies would realistically give in to because of a hunger strike by a small number of outsiders.
A hunger strike can bring attention to how seriously you perceive an issue. If you know how to make it go viral, that is; in the US, hunger strikes are rarely widely covered by the media. And even then, you are more likely to marginalize your views than to make them go more mainstream: if people don’t currently think halting frontier general AI development requires hunger strikes, a hunger strike won’t explain to them why your views are correct: this is not self-evident just from the description of the hunger strike, and so the hunger strike is not the right approach here and now.
Also, our movement does not need martyrs. You can be a lot more helpful if you eat well, sleep well, and are able to think well and hard. Your life is also very valuable, it is a part of what we’re fighting for; saving a world without you is slightly sadder than saving a world with you; and perhaps more importantly to you, it will not help. It needs to already be seen by the public as legitimate, to make them more sympathetic towards your cause and exert pressure. It needs to target decision makers who have the means to give in and advance your cause by doing that, for it to have any meaning at all.
At the moment, these hunger strikes are people vibe-protesting. They feel like some awful people are going to kill everyone, they feel powerless, and so they find a way to do something that they perceive as having a chance of changing the situation.
Please don’t risk your life; especially, please don’t risk your life in this particular way that won’t change anything.
Action is better than inaction; but please stop and think of your theory of change for more than five minutes, if you’re planning to risk your life, and then don’t risk your life[1]; please pick actions thoughtfully and wisely and not because of the vibes[2].
You can do much more if you’re alive and well and use your brain.
Not to say that you shouldn’t be allowed to risk your life for a large positive impact. I would sacrifice my life for some small chance of preventing AI risk. But most people who think they’re facing a choice to sacrifice their life for some chance of making a positive impact are wrong and don’t actually face it; so I think the bar for risking one’s life should be very high. In particular, when people have time to carefully do the math, I really want them to carefully do the math before deciding to risk their lives, and in this specific case, some of my frustration is from the people clearly getting their math wrong.
I think as a community, we also would really want to make people err on the side of safety, and have a strong norm of assumption that most people who decide to sacrifice their lives got their math wrong, especially if a community that shares their values disagrees with them on the consequences of the sacrifice. People really shouldn’t be risking their lives without having carefully thought of the theory of change (when they have the ability to do so).
I’d bet if we find people competent in how movements achieve their goals, they will say that these particular hunger strikes are not great; and I expect it to be the case most of the time when individuals who share values with a larger movement decide to go on a hunger strike even as the larger movement thinks that would not be effective.
My strong impression is that the person on the hunger strike in front of Anthropic is doing this primarily because he feels like it is the proper thing to do in this situation, like it’s the action someone should be taking here.
Hi Mikhail, thanks for offering your thoughts on this. I think having more public discussion on this is useful and I appreciate you taking the time to write this up.
I think your comment mostly applies to Guido in front of Anthropic, and not our hunger strike in front of Google DeepMind in London.
Hunger strikes can be incredibly powerful when there’s a just demand, a target who would either give in to the demand or be seen as a villain for not doing so, a wise strategy, and a group of supporters.
I don’t think these hunger strikes pass the bar. Their political demands are not what AI companies would realistically give in to because of a hunger strike by a small number of outsiders.
I don’t think I have been framing Demis Hassabis as a villain and if you think I did it would be helpful to add a source for why you believe this.
I’m asking Demis Hassabis to “publicly state that DeepMind will halt the development of frontier AI models if all the other major AI companies agree to do so.” which I think is a reasonable thing to state given all public statements he made regarding AI Safety. I think that is indeed something that a company such as Google DeepMind would give in.
A hunger strike can bring attention to how seriously you perceive an issue. If you know how to make it go viral, that is; in the US, hunger strikes are rarely widely covered by the media.
I’m currently in the UK, and I can tell you that there’s already been twopieces published on Business Insider. I’ve also given three interviews in the past 24 hours to journalists to contribute to major publications. I’ll try to add links later if / once these get published.
At the moment, these hunger strikes are people vibe-protesting. They feel like some awful people are going to kill everyone, they feel powerless, and so they find a way to do something that they perceive as having a chance of changing the situation.
Again, I’m pretty sure I haven’t framed people as “awful”, and would be great if you could provide sources to that statement. I also don’t feel powerless. My motivation for doing this was in part to provide support to Guido’s strike in front of Anthropic, which feels more like helping an ally, joining forces.
I find it actually empowering to be able to be completely honest about what I actually think DeepMind should do to help stop the AI race and receive so much support from all kinds of people on the street, including employees from Google, Google DeepMind, Meta and Sony. I am also grateful to have Denys with me, who flew from Amsterdam to join the hunger strike, and all the journalists who have taken the time to talk to us, both in person and remotely.
Action is better than inaction; but please stop and think of your theory of change for more than five minutes, if you’re planning to risk your life, and then don’t risk your life[1]; please pick actions thoughtfully and wisely and not because of the vibes[2].
I agree to the general point that taking decisions based on an actual theory of change is a much more effective way to have an impact in the world. I’ve personally thought quite a lot about why doing this hunger strike in front of DeepMind is net good, and I believe it’s having the intended impact, so I disagree with your implication that I’m basing my decisions on vibes. If you’d like to know more I’d be happy to talk to you in person in front of the DeepMind office or remotely.
Now, taking a step back and considering Guido’s strike, I want to say that even if you think that his actions were reckless and based on vibes, it’s worth evaluating whether his actions (and their consequences) will eventually turn out to be net negative. For one I don’t think I would have been out in front of DeepMind as I type this if it was not for Guido’s action, and I believe what we’re doing here in London is net good. But most importantly we’re still at the start of the strikes so it’s hard to tell what will happen as this continues. I’d be happy to have this discussion again at the end of the year, looking back.
Finally, I’d like to acknowledge the health risks involved. I’m personally looking over my health and there are some medics at King’s Cross that would be willing to help quickly if anything extreme was to happen. And given the length of the strikes so far I think what we’re doing is relatively safe, though I’m happy to be proven otherwise.
your comment mostly applies to Guido in front of Anthropic
Yep!
I don’t think I have been framing Demis Hassabis as a villain
A hunger strike is not a good tool if you don’t want to paint someone as a villain in the eyes of the public when they don’t give in to your demand.
publicly state that DeepMind will halt the development of frontier AI models if all the other major AI companies agree to do so.” which I think is a reasonable thing to state
It is vanishingly unlikely that all other major AI companies would agree to do so without the US government telling them to; this statement would be helpful, but only to communicate their position and not because of the commitment itself. Why not ask them to ask the government to stop everyone (maybe conditional on China agreeing to stop everyone in China)?
I’ve also given three interviews in the past 24 hours to journalists to contribute to major publications
If any of them go viral in the US with a good message, I’ll (somewhat) change my mind!
I disagree with your implication that I’m basing my decisions on vibes
This was mainly my impression after talking to Guido; but do you want to say more about the impact you think you’ll have?
I’d be happy to have this discussion again at the end of the year, looking back
(Can come back to it at the end of the year; if you have any advance predictions, they might be helpful to have posted!)
And given the length of the strikes so far I think what we’re doing is relatively safe, though I’m happy to be proven otherwise
I hope you remain safe and are not proven otherwise! Hunger strikes do carry negative risks though. Do you have particular plans for how long to be on the hunger strike for?
A hunger strike is not a good tool if you don’t want to paint someone as a villain in the eyes of the public when they don’t give in to your demand.
Is there any form of protest that doesn’t implicitly imply that the person you’re protesting is doing something wrong? When the thing wrong is “causing human extinction” it seems to me kind of hard for that to not automatically be assumed ‘villainous’.
(Asking genuinely, I think it quite probably the answer is ‘yes’.)
Something like: Hunger strikes are optimized hard specifically for painting someone as a villain because they decide to make someone suffer or die (or be inhumanely fed), this is different from other forms of protests that are more focused on, e.g., that specific decisions are bad and should be revoked, but don’t necessarily try to make people perceive the other side as evil.
I don’t really see the problem with painting people as evil in principle, given that some people are evil. You can argue against it in specific cases, but I think the case for AI CEOs being evil is strong enough that it can’t be dismissed out of hand.
The case in question is “AI CEOs are optimising for their short-term status/profits, and for believing things about the world which maximise their comfort, rather than doing the due diligence required of someone in their position, which is to seriously check whether their company is building something which kills everyone”
Whether this is a useful frame for one’s own thinking—or a good frame to deploy onto the public—I’m not fully sure, but I think it does need addressing. Of course it might also differ between CEOs. I think Demis and Dario are two of the CEOs who it’s relatively less likely to apply to, but also I don’t think it applies weakly enough for them to be dismissed out of hand even in their cases.
“People are on hunger strikes” is not really a lot of evidence for “AI CEOs are optimizing for their short-term status/profits and are not doing the due diligence” in the eyes of the public.
I don’t think there’s any problem with painting people and institutions as evil, I’m just not sure why you would want to do this here, as compared to other things, and would want people to have answers to how they imagine a hunger strike would paint AI companies/CEOs and what would be the impact of that, because I expect little that could move the needle.
It is vanishingly unlikely that all other major AI companies would agree to do so without the US government telling them to; this statement would be helpful, but only to communicate their position and not because of the commitment itself. Why not ask them to ask the government to stop everyone (maybe conditional on China agreeing to stop everyone in China)?
This seems to be exactly the point of the demand? This is a demand that would be cheap (perhaps even of negative cost) for DeepMind to accept (because the other AI companies wouldn’t agree to that), and would also be a major publicity win for the Pause AI crowd. Even counting myself skeptical of the hunger strikes, I think this is a very smart move.
the demand is that a specific company agrees to halt if everyone halts; this does not help in reality, because in fact it won’t be the case that everyone halts (abscent gov intervention).
Action is better than inaction; but please stop and think of your theory of change for more than five minutes,
I think there’s a very reasonable theory of change—X-risk from AI needs to enter the Overton window. I see no justification here for going to the meta-level and claiming they did not think for 5 minutes, which is why I have weak downvoted in addition to strong disagree.
This tactic might not work, but I am not persuaded by your supposed downsides. The strikers should not risk their lives, but I don’t get the impression that they are. The movement does need people who are eating → working on AI safety research, governance, and other forms of advocacy. But why not this too? Seems very plausibly a comparative advantage for some concerned people, and particularly high leverage when very few are taking this step. If you think they should be doing something else instead, say specifically what it is and why these particular individuals are better suited to that particular task.
I see no justification here for going to the meta-level and claiming they did not think for 5 minutes
Michaël Trazzi’s comment, which he wrote a few hours before he started his hunger strike, isn’t directly about hunger striking but it does indicate to me that he put more than 5 minutes of thought into the decision, and his comment gestures at a theory of change.
I spoke to Michaël in person before he started. I told him I didn’t think the game theory worked out (if he’s not willing to die, GDM should ignore him; if he does die then he’s worsening the world, since he can definitely contribute better by being alive, and GDM should still ignore him). I don’t think he’s going to starve himself to death or serious harm, but that does make the threat empty. I don’t really think that matters too much on a game-theoretic-reputation method since nobody seems to be expecting him to do that.
His theory of change was basically “If I do this, other people might” which seems to be true: he did get another person involved. That other person has said they’ll do it for “1-3 weeks” which I would say is unambiguously not a threat to starve oneself to death.
As a publicity stunt it has kinda worked in the basic sense of getting publicity. I think it might change the texture and vibe of the AI protest movement in a direction I would prefer it to not go in. It certainly moves the salience-weighted average of public AI advocacy towards Stop AI-ish things.
As Mikhail said, I feel great empathy and respect for these people. My first instinct was similar to yours, though - if you’re not willing to die, it won’t work, and you probably shouldn’t be willing to die (because that also won’t work / there are more reliable ways to contribute / timelines uncertainty).
I think ‘I’m doing this to get others to join in’ is a pretty weak response to this rebuttal. If they’re also not willing to die, then it still won’t work, and if they are, you’ve wrangled them in at more risk than you’re willing to take on yourself, which is pretty bad (and again, it probably still won’t work even if a dozen people are willing to die on the steps of the DeepMind office, because the government will intervene, or they’ll be painted as loons, or the attention will never materialize and their ardor will wain).
I’m pretty confused about how, under any reasonable analysis, this could come out looking positive EV. Most of these extreme forms of protest just don’t work in America (e.g. the soldier who self-immolated a few years ago). And if it’s not intended to be extreme, they’ve (I presume accidentally) misbranded their actions.
Fair enough. I think these actions are +ev under a coarse grained model where some version of “Attention on AI risk” is the main currency (or a slight refinement to “Not-totally-hostile attention on AI risk”). For a domain like public opinion and comms, I think that deploying a set of simple heuristics like “Am I getting attention?” “Is that attention generally positive?” “Am I lying or doing something illegal?” can be pretty useful.
Michael said on twitter here that he’s had conversations with two sympathetic DeepMind employees, plus David Silver, who was also vaguely sympathetic. This itself is more +ev than I expected already, so I’m updating in favour of Michael here.
It’s also occurred to me that if any of the CEOs cracks and at least publicly responds the hunger strikers, then the CEOs who don’t do so will look villainous, so you actually only need to have one of them respond to get a wedge in.
“Attention on AI risk” is a somewhat very bad proxy to optimize for, where available tactics include attention that would be paid to luddites, lunatics, and crackpots caring about some issue.
The actions that we can take can:
Use what separates us from people everyone considers crazy: that our arguments check out and our predictions hold; communicate those;
Spark and mobilize existing public support;
Be designed to optimize for positive attention, not for any attention.
I don’t think DeepMind employees really changed their minds? Like, there are people at DeepMind with p(doom) higher than Eliezer’s; they would be sympathetic; would they change anything they’re doing? (I can imagine it prompting them to talk to others at DeepMind, talking about the hunger strike to validate the reasons for it.)
I don’t think Demis responding to the strike would make Dario look particularly villainous, happy to make conditional bets. How villainous someone looks here should be pretty independent, outside of eg Demis responding, prompting a journalist to ask Dario, which takes plausible deniability away from him.
I’m also not sure how effective it would be to use this to paint the companies (or the CEOs—are they even the explicit targets of the hunger strikes?) as villainous.
To clarify, “think for five minutes” was an appeal to people who might want to do these kinds of things in the future, not a claim about Guido or Michael.
That said, I do in fact claim they have not thought carefully about their theory of change, and the linked comment from Michael lists very obvious surface-level reasons for why do this in front of anthropic and not openai; I really would not consider this on the level of demonstrating having thought carefully about the theory of change.
I think there’s a very reasonable theory of change—X-risk from AI needs to enter the Overton window
While in principle, as I mentioned, a hunger strike can bring attention, this is not an effective way to do this for the particular issue that AI will kill everyone by default. The diff to communicate isn’t “someone is really scared of AI ending the world”; it’s “scientists think AI might literally kill everyone and also here are the reasons why”.
claiming they did not think for 5 minutes
This was not a claim about these people but an appeal to potential future people to maybe do research on this stuff before making decisions like this one.
That said, I talked to Guido prior to the start of the hunger strike, tried to understand his logic, and was not convinced he had any kind of reasonable theory of change guiding his actions, and my understanding is that he perceives it as the proper action to take, in a situation like that, which is why I called this vibe-protesting.
I don’t get the impression that they are
(It’s not very clear what would be the conditions for them to stop the hunger strikes.)
But why not this too?
Hunger strikes can be very effective and powerful if executed wisely. My comment expresses my strong opinion that this did not happen here, not that it can’t happen in general.
At the moment, these hunger strikes are people vibe-protesting.
I think I somewhat agree, but also I think this is a more accurate vibe than “yay tech progress”. It seems like a step in the right direction to me.
Please don’t risk your life; especially, please don’t risk your life in this particular way that won’t change anything.
Action is better than inaction; but please stop and think of your theory of change for more than five minutes, if you’re planning to risk your life, and then don’t risk your life; please pick actions thoughtfully and wisely and not because of the vibes.
You repeat a recommendation not to risk your life. Um, I’m willing to die to prevent human extinction. The math is trivial. I’m willing to die to reduce the risk by a pretty small percentage. I don’t think a single life here is particularly valuable on consequentialist terms.
There’s important deontology about not unilaterally risking other people’s lives, but this mostly goes away in the case of risking your own life. This is why there are many medical ethics guidelines that separate self-experimentation as a special case from rules for experimenting on others (and that’s been used very well in many cases and aligns incentives). I think one should have dignity and respect yourself, but I think there are many self-respecting situations where one should take major personal sacrifices and risk one’s whole life. (Somewhat similarly there are many situations to risk being prosecuted unjustly by the state and spending a great deal of your life in-prison.)
There’s important deontology about not unilaterally risking other people’s lives, but this mostly goes away in the case of risking your own life.
I don’t think so, I agree we shouldn’t have laws around this, but insofar as we have deontologies to correct for circumstances where historically our naive utility maximizing calculations have been consistently biased, I think there have been enough cases of people uselessly martyring themselves for their causes to justify a deontological rule not to sacrifice your own actual life.
Edit: Basically, I don’t want suicidal people to back-justify batshit insane reasons why they should die to decrease x-risk instead of getting help. And I expect these are the only people who would actually be at risk for a plan which ends with “and then I die, and there is 1% increased probability everyone else gets the good ending”.
At the time, South Vietnam was led by President Ngo Dinh Diem, a devout Catholic who had taken power in 1955, and then instigated oppressive actions against the Buddhist majority population of South Vietnam. This began with measures like filling civil service and army posts with Catholics, and giving them preferential treatment on loans, land distribution, and taxes. Over time, Diem escalated his measures, and in 1963 he banned flying the Buddhist flag during Vesak, the festival in honour of the Buddha’s birthday. On May 8, during Vesak celebrations, government forces opened fire on unarmed Buddhists who were protesting the ban, killing nine people, including two children, and injured many more.
[...]
Unfortunately, standard measures for negotiation – petitions, street fasting, protests, and demands for concessions – were ignored by the Diem government, or met with force, as in the Vesak shooting.
[...]
Since conventional measures were failing, the Inter-Sect Committee decided to consider more extreme measures, including the idea of a voluntary self-immolation. While extreme, they hoped it would create an international media incident, to draw attention to the suffering of Buddhists in South Vietnam. They noted in their meeting minutes the power of photographs to focus international attention: “one body can reach where ten thousand leaflets cannot.” It was to be a Bodhisattva deed to help awaken the world.
[...]
On June 10, the Inter-Sect Committee contacted at least four Saigon-based members of the international media, telling them to be present for a “major event” that would occur the next morning. One of them was a photographer from the Associated Press, Malcolm Browne, who said he had “no idea” what he’d see, beyond expecting some kind of protest. When Thich Quang Duc and his attendants exited the car, Browne was 15 meters away, just outside the ring of chanting monks. Browne took more than 100 photos, fighting off nausea from the smell of burning gasoline and human flesh, and struggling with the horror, as he created a permanent visual record of Thich Quang Duc’s sacrifice.
The sacrifice was not in vain. The next day, Browne’s photos made the front page of newspapers around the world. They shocked people everywhere, and galvanized mass protests in South Vietnam. US President John F. Kennedy reportedly exclaimed “Jesus Christ!” upon first seeing the photo. The US government, which had been instrumental in installing and supporting the anti-communist Diem, withdrew its support, and just a few months later supported a coup that led to Diem’s death, a change in government, and the end of anti-Buddhist policy2.
Nielsen also includes unsuccessful or actively repugnant examples of it.
The sociologist Michael Biggs6 has identified more than 500 self-immolations as protest in the four decades after Thich Quang Duc, most or all of which appear to have been inspired in part by Thich Quang Duc.
I’ve discussed Thich Quang Duc’s sacrifice in tacitly positive terms. But I don’t want to uncritically venerate this kind of sacrifice. As with Kravinsky’s kidney donation, while it had admirable qualities, it also had many downsides, and the value may be contested. Among the 500 self-immolations identified by Biggs, many seem pointless, even evil. For example: more than 200 people in India self-immolated in protest over government plans to reserve university places for lower castes. This doesn’t seem like self-sacrifice in service of a greater good. Rather, it seems likely many of these people lacked meaning in their own lives, and confused the grand gesture of the sacrifice for true meaning. Moral invention is often difficult to judge, in part because it hinges on redefining our relationship to the rest of the universe.
I also think this paragraph about Quang Duc is quite relevant:
Quang Duc was not depressed nor suicidal. He was active in his community, and well respected. Another monk, Thich Nhat Hanh, who had lived with him for the prior year, wrote that Thich Quang Duc was “a very kind and lucid person… calm and in full possession of his mental faculties when he burned himself.” Nor was he isolated and acting alone or impulsively. As we’ll see, the decision was one he made carefully, with the blessing of and as part of his community.
I’m not certain if there’s a particular point you want me to take away from this, but thanks for the information, and including an unbiased sample from the article you linked. I don’t think I changed my mind so much from reading this though.
Do you also believe there is a deontological rule against suicide? I have heard rumor that most people who attempt suicide and fail, regret it. At the same time, I think some lives are worse than death (for example, see Amanda Luce’s Book Review: Two Arms And A Head that won the ACX book review prize), and so I believe it should be legal and sometimes supported, even if it were the case that most attempted suicides have been regretted.
I have heard rumor that most people who attempt suicide and fail, regret it.
After doing some research on this, I think this is unlikely to be true. The only quantitative study I found says that among its sample of suicide attempt survivors, 35.6% are glad to have survived, while 42.7% feel ambivalent, and 21.6% regret having survived. I also found a couple of sources agreeing with your “rumor”, but one cited just a suicide awareness trainer as its source, while the other cited the above study as the only evidence for its claim, somehow interpreting it as “Previous research has found that more than half of suicidal attempters regret their suicidal actions.” (Gemini 2.5 Pro says “It appears the authors of the 2023 paper misinterpreted or misremembered the findings of the 2005 study they cited.”)
If this “rumor” was true, I would expect to see a lot of studies supporting it, because such studies are easy to do and the result would be highly useful for people trying to prevent suicides (i.e., they can use it to convince potential suicide attempters that they’re likely to regret it). Evidence to the contrary are likely to be suppressed or not gathered in the first place, as almost nobody wants to encourage suicides. (The above study gathered the data incidentally, for a different purpose.) So everything seems consistent with the “rumor” being false.
Interesting, thanks. I think I had heard the rumor before and believed it.
In the linked study, it looks like they asked the people about regret very shortly after the suicide attempt. This could both bias the results towards less regret to have survived (little time to change their mind) or more regret to have survived (people might be scared to signal intent to retry suicide, for fear of being committed, which I think sometimes happens soon after failed attempts).
I think very very many people are not making an informed decision when they decide to commit suicide.
For example, I think quantum immortality is quite plausibly a thing. Very few people know about quantum immortality and even fewer have seriously thought about it. This means that almost everyone on the planet might have a very mistaken model of what suicide actually does to their anticipated experience.[1] Also, many people are religious and believe in a pleasant afterlife. Many people considering suicide are mentally ill in a way that compromises their decision making. Many people think transhumanism is impossible and won’t arrange for their brain to be frozen for that reason.
I agree that there is some threshold on the fraction of ill-considered suicides relative to total suicides such that suicide should be legal if we were below that threshold. I used to think we were maybe below that threshold. After I began studying physics at uni and so started taking quantum immortality more seriously, I switched to thinking we are maybe above the threshold.
You might find yourself in a branch where your suicide attempt failed, but a lot of your body and mind were still destroyed. If you keep exponentially decreasing the amplitude of your anticipated future experience in the universal wave function further, you might eventually find that it is now dominated by contributions from weird places and branches far-off in spacetime or configuration space that were formerly negligible, like aliens simulating you for some negotiation or other purpose.
I don’t really know yet how to reason well about what exactly the most likely observed outcome would be here. I do expect that by default, without understanding and careful engineering our civilisation doesn’t remotely have the capability for yet, it’d tend to be very Not Good.
This all feels galaxy-brained to me and like it proves too much. By analogy I feel like if you thought about population ethics for a while and came to counterintuitive conclusions, you might argue that people who haven’t done that shouldn’t be allowed to have children; or if they haven’t thought about timeless decision theory for a while they aren’t allowed to get a carry license.
I don’t think it proves too much. Informed decision-making comes in degrees, and some domains are just harder? Like, I think my threshold for leaving people free to make their own mistakes if they are the only ones harmed by them is very low, compared to where the human population average seems to be at the moment. But my threshold is, in fact, greater than zero.
For example, there are a bunch of things I think bystanders should generally prevent four year old human children from doing, even if the children insist that they want to do them. I know that stopping four year old children from doing these things will be detrimental in some cases, and that having such policies is degrading to the childrens’ agency. I remember what it was like being four years old and feeling miserable because of kindergarten teachers who controlled my day and thought they knew what was best for me. I still think the tradeoff is worth it on net in some cases.
I just think that the suicide thing happens to be a case where doing informed decision-making is maybe just too tough for way too many humans and thus some form of ban could plausibly be worth it on net. Sports betting is another case where I was eventually convinced that maybe a legal ban of some form could be worth it.
(I agree with Lucious in that I think it is important that people have the option of getting cryopreserved and also are aware of all the reality-fluid stuff before they decide to kill themselves.)
“Important” is ambiguous, in that I agree it matters, but it does for this civilization to ban whole life options from people until they have heard about niche philosophy. Most people will never hear about niche philosophy.
I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus.
However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility.
I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus.
I agree that quantum mechanics is not really central for this on a philosophical level. You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you. It’s just that it seems at present very unclear and arguable whether the physical universe is in fact anywhere near that large, whereas I would claim that a universal wavefunction which constantly decoheres into different branches containing different versions of us is pretty strongly implied to be a thing by the laws of physics as we currently understand them.
However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility.
It is very late here and I should really sleep instead of discussing this, so I won’t be able to reply as in-depth as this probably merits. But, basically, I would claim that this is not the right way to do expected utility calculations when it comes to ensembles of identical or almost-identical minds.
A series of thought experiments might maybe help illustrate part of where my position comes from:
Imagine someone tells you that they will put you to sleep and then make two copies of you, identical down to the molecular level. They will place you in a room with blue walls. They will place one copy of you in a room with red walls, and the other copy in another room with blue walls. Then they will wake all three of you up.
What color do you anticipate seeing after you wake up, and with what probability?
I’d say 2⁄3 blue, 1⁄3 red. Because there will now be three versions of me, and until I look at the walls I won’t know which one I am.
Imagine someone tells you that they will put you to sleep and then make two copies of you. One copy will not include a brain. It’s just a dead body with an empty skull. Another copy will be identical to you down to the molecular level. Then they will place you in a room with blue walls, and the living copy in a room with red walls. Then they will wake you and the living copy up.
What color do you anticipate seeing after you wake up, and with what probability? Is there a 1⁄3 probability that you ‘die’ and don’t experience waking up because you might end up ‘being’ the corpse-copy?
I’d say 1⁄2 blue, 1⁄2 red, and there is clearly no probability of me ‘dying’ and not experiencing waking up. It’s just a bunch of biomass that happens to be shaped like me.
As 2, but instead of creating the corpse-copy without a brain, it is created fully intact, then its brain is destroyed while it is still unconscious. Should that change our anticipated experience? Do we now have a 1⁄3 chance of dying in the sense that we might not experience waking up? Is there some other relevant sense in which we die, even if it does not affect our anticipated experience?
I’d say no and no. This scenario is identical to 2 in terms of the relevant information processing that is actually occurring. The corpse-copy will have a brain, but it will never get to use it, so it won’t affect my expected anticipated experience in any way. Adding more dead copies doesn’t change my anticipated experience either. My best scoring prediction will be that I have 1⁄2 chance of waking up to see red walls, and 1⁄2 chance of waking up to see blue walls.
In real life, if you die in the vast majority of branches caused by some event, i.e. that’s where the majority of the amplitude is, but you survive in some, the calculation for your anticipated experience would seem to not include the branches where you die for the same reason it doesn’t include the dead copies in thought experiments 2 and 3.
(I think Eliezer may have written about this somewhere as well using pretty similar arguments, maybe in the quantum physics sequence, but I can’t find it right now.)
You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you.
Again, not sure why a large universe is needed. The expected utility ends up the same either way, whether you have some fraction of branches in which you remain alive or some probability of remaining alive.
Regarding the expected utility calculus. I agree with everything you said but i don’t see how any of it allows you to disregard the counterfactual suffering from not committing suicide in your expected value calculation.
Maybe the crux is whether we consider the utility of each “you” (i.e. you in each branch) individually, and add it up for the total utility, or wether we consider all “you”s to have just one shared utility.
Let’s say that not committing suicide gives you −1 utility in n branches but commiting suicide gives you −100 utility in n/m branches and 0 utility in n−n/m branches
If we treat all copies of you as having separate utilities and add them all up for a total expected utility calculation, not committing suicide gives −n utility while committing suicide leads to −100n/m utility. Therefore, as long as m>100, it is better to commit suicide.
If, on the other hand you treat them as having one shared utility, you get either −1 or −100 utility, and −100 is of course worse.
Do you agree that this is the crux? If so, why do you think that all the copies share one utility rather than their utilities adding up?
In a large universe, you do not end. Like, not in expectation see some branch versus other; you just continue, the computation that is you continues. When you open your eyes, you’re not likely to find yourself as a person in a branch computed only relatively rarely; still, that person continues, and does not die.
Attemted suicide reduces your reality-fluid- how much you’re computed and how likely you are to find yourself there- but you will continue to experience the world. If you die in a nuclear explosion, the continuation of you will be somewhere else, sort-of isekaied; and mostly you will find yourself not in a strange world that recovers the dead but in a world where the nuclear explosion did not appear; still, in a large world, even after a nuclear explosion, you continue.
You might care about having a lot of reality-fluid, because this makes your actions more impactful, because you can spend your lightcone better, and improve the average experience in the large universe. You might also assign negative utility to others seeing you die; they’ll have a lot of reality-fluid in worlds where you’re dead and they can’t talk to you, even as you continue. But I don’t think it works out to assigning the same negative utility to dying as in branches of small worlds.
Yes, but the number of copies of you still reduces (or the probability that you are alive in standard probability theory, or the number of branches in many worlds). Why are these not equivalent in terms of the expected utility calculus?
Imagine they you’re an agent in the game of life. Your world, your laws of physics are computed on a very large independent computers; all performing the same computation.
You exist within the laws of causality of your world, computed as long as at least one server computes your world. If some of them stop performing the computation, it won’t be a death of a copy; you’ll just have one fewer instance of yourself.
You are of course right that there’s no difference between reality-fluid and normal probabilities in a small world: it’s just how much you care about various branches relative to each other, regardless of whether all of them will exist or only some.
I claim that the negative utility due to stopping to exist is just not there, because you don’t actually stop to exist in a way you reflectively care about, when you have fewer instances. For normal things (e.g., how much do you care about paperclips), the expected utility is the same; but here, it’s the kind of terminal value that i expect for most people would be different; guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!
I claim that the negative utility due to stopping to exist is just not there
But we are not talking about negative utility due to stopping to exist. We are talking about avoiding counterfactual negative utility by committing suicide, which still exists!
guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!
I think this is an artifact of thinking of all of the copies having a shared utility (i.e. you) rather than separate utilities that add up (i.e. so many yous will suffer if you don’t commit suicide). If they have separate utilities, we should think of them as separate instances of yourself.
it’s the kind of terminal value that i expect for most people would be different; guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!
And even in the case where we are assigning negative utility to death, most people are really considering counterfactual utility from being alive, and 95% of that (expected) counterfactual utility is lost whether 95% of the “instances of you” die or whether there is a 95% chance that “you” die.
I think there is, and I think cultural mores well support this. Separately, I think we shouldn’t legislate morality and though suicide is bad, it should be legal[1].
At the same time, I think some lives are worse than death (for example, see Amanda Luce’s Book Review: Two Arms And A Head that won the ACX book review prize), and so I believe it should be legal and sometimes supported
There also exist cases where it is in fact correct from a utilitarian perspective to kill, but this doesn’t mean there is no deontological rule against killing. We can argue about the specific circumstances where we need these rule carve-outs (eg war), but I think we’d agree that when it comes to politics and policy, there ought to be no carve-outs, since people are particularly bad at risk-return calculations in that domain.
But also this would mean we have to deal with certain liability issues, eg if ChatGPT convinces a kid to kill themselves, we’d like to say this is manslaughter or homicide iff the kid otherwise would’ve gotten better, but how do we determine that? I don’t know, and probably on net we should choose freedom instead, or this isn’t actually much a problem in practice.
Makes sense. I don’t hold this stance; I think my stance is that many/most people are kind of insane on this, but that like with many topics we can just be more sane if we try hard and if some of us set up good institutions around it for helping people have wisdom to lean on in thinking about it, rather than having to do all their thinking themselves with their raw brain.
(I weakly propose we leave it here, as I don’t think I have a ton more to say on this subject right now.)
At the moment, these hunger strikes are people vibe-protesting.
To clarify, I meant that the choice of actions was based on the vibes, not on careful consideration, this seeming like the right thing to do in these circuimstances.
You repeat a recommendation not to risk your life
I maybe formulated this badly.
I do not disagree with that part of your comment. I did, in fact, risk being prosecuted unjustly by the state and spending a great deal of my life in prison. I was also aware of the kinds of situations I’d want to go for hunger strikes in while in prison, though didn’t think about that often.
And I, too, am willing to die to reduce the risk by a pretty small chance.
Most of the time, though, I think people who think they have this choice don’t actually face it; I think the bar for risking one’s life should be very high. In particular, when people have time to carefully do the math, I really want them to carefully do the math before deciding to risk their lives, and in this particular case, some of my frustration is from the people getting their math wrong.
I think as a community, we also would really want to make people err on the side of safety, and have a strong norm of assumption that most people who decide to sacrifice their lives got their math wrong. People really shouldn’t be risking their lives without having carefully thought of the theory of change when they have the ability to do so.
Like, I’d bet if we find people competent in how movements achieve their goals, they will say that these particular hunger strikes are not great; and I expect it to be the case most of the time when individuals who share values with a larger movement decide to go on a hunger strike even as the larger movement thinks that would not be effective.
I think I somewhat agree that these hunger strikes will not shut down the companies or cause major public outcry.
I think that there is definitely something to be said that potentially our society is very poor at doing real protesting, and will just do haphazard things and never do anything goal-directed. That’s potentially a pretty fundamental problem.
But setting that aside (which is a big thing to set aside!) I think the hunger-strike is moving in the direction of taking this seriously. My guess is most projects in the world don’t quite work, but they’re often good steps to help people figure out what does work. Like, I hope this readies people to notice opportunities for hunger strikes, and also readies them to expect people to be willing to make large sacrifices on this issue.
People do in fact try to be very goal-directed about protesting! They have a lot of institutional knowledge on it!
You can study what worked and what didn’t work in the past, and what makes a difference between a movement that succeeds and a movement that doesn’t. You can see how movements organize, how they grow local leaders, how they come up with ideas that would mobilize people.
A group doesn’t have to attempt a hunger strike to figure out what the consequences would be; it can study and think, and I expect that to be a much more valuable use of time than doing hunger strikes.
I’d be interested to read a quick post from you that argued “Hunger-strikes are not the right tool for this situation; here is what they work for and what they don’t work for. Here is my model of this situation and the kind of protests that do make sense.”
I don’t know much about protesting. Most of the recent ones that get big enough that I hear about them have been essentially ineffectual as far as I can recall (Occupy Wallstreet, Women’s March, No Kings). I am genuinely interested in reading about effective and clearly effective protests led by anyone currently doing protests, or within the last 10 years. Even if on a small scale.
(My thinking isn’t that protests have not worked in the past – I believe they have, MLK, Malcolm X, Women’s Suffrage Movement, Vietnam War Protest, surely more – but that the current protesting culture has lost its way and is no longer effective.)
I am genuinely interested in reading about effective and clearly effective protests led by anyone currently doing protests, or within the last 10 years. Even if on a small scale.
“Protest movements could be more effective than the best charities”—SSIR
About two weeks ago, I published an article in Stanford Social Innovation Review (SSIR), a magazine for those interested in philanthropy, social science and non-profits. … Although my article is reasonably brief (and I obviously recommend reading it in full!) here’s a quick summary of what I spoke about, plus some nuances I forgot or wasn’t able to add:
...
There is a reasonable amount of evidence that shows that protest movement can have significant impacts, across a variety of outcomes from policy, public opinion, public discourse, voting behaviour, and corporate behaviour. I’ll leave this point to be explained in greater detail in our summary of our literature review on protest outcomes!
...
3. A summary of Social Change Lab’s literature reviews, who we are, and our next steps
We’ve recently conducted two literature reviews, looking over 60+ academic studies across political science, sociology and economics, to tackle some key questions around protest movements. Specifically, we had two main questions:
What are the outcomes of protest and protest movements? - Literature Review
What factors make some protest movements more likely to succeed relative to others? - Literature Review
(Would be interested in someone going through this paper and writing a post or comment highlighting some examples and why they’re considered successful.)
Not quite responding to your main point here, but I’ll say that this position would seem valid to me and good to say if you believed it.
Some people make major self-sacrifices wisely, and others for poor reasons and due to misaligned social pressures. I feel that this is an example of the latter, so I do not endorse it, I think people who care about this issue should not endorse it, and I hope someone helps them and they stop.
I don’t know what personal life tradeoffs any of them are making, so I have a hard time speaking to that. I just found out that Michael Trazzi is one of the people doing a hunger strike; I don’t think it’s true of him that he hasn’t thought seriously about the issues given how he’s been intellectually engaged for 5+ years.
(Social movements (and comms and politics) are not easy to reason about well from first principles. I think Michael is wrong to be making this particular self-sacrifice, not because he hasn’t thought carefully about AI but because he hasn’t thought carefully about hunger strikes.)
Relevantly, if any of them actually die, and if also it does not cause major change and outcry, I will probably think they made a foolish choice (where ‘foolish’ means ‘should have known in advance this was the wrong call on a majorly important decision’).
My modal guess is that they will all make real sacrifice, and stick it out for 10-20 days, then wrap up.
Follow-up: Michael Trazzi wrapped up after 7 days due to fainting twice and two doctors saying he was getting close to being in a life-threatening situation.
(Slightly below my modal guess, but also his blood glucose level dropped unusual fast.)
Sky News (#3 News Channel in the UK) ran a 5-minute segment on the Google DeepMind Hunger Strike
The linked video seems to me largely successful at raising awareness of the anti-extinction position – it is not exaggerated, it is not mocked, it is accurately described and taken seriously. I take this as evidence of the strikes being effective at their goals (interested if you disagree).
I think the main negative update about Dennis (in line with your concerns) is that he didn’t tell his family he was doing this. I think that’s quite different from the Duc story I linked above, where he made a major self-sacrifice with the knowledge and support of his community.
Yep, I’ve seen the video. Maybe a small positive update overall, because could’ve been worse?
It seems to me that you probably shouldn’t optimize for publicity for publicity’s sake, and even then, hunger strikes are not a good way.
Hunger strikes are very effective tools in some situations; but they’re not effective for this. You can raise awareness a lot more efficiently than this.
“The fears are not backed up with evidence” and “AI might improve billions of lives” is what you get when you communicate being in fear of something without focusing on the reasons why.
On the object level it’s (also) important to emphasize that these guys don’t seem to be seriously risking their lives. At least one of them noted he’s taking vitamins, hydrating etc. On consequentialist grounds I consider this to be an overdetermined positive.
a hunger strike will eventually kill you even if you take vitamins, electrolytes, and sugar. (a way to prevent death despite the target not giving in is often a group of supporters publicly begging the person on the hunger strike to stop and not kill themselves for some plausible reasons, but sometimes people ignore that and die.) I’m not entirely sure what Guido’s intention is if Anthropic doesn’t give in.
Sure, I just want to defend that it would also be reasonable if they were doing a more intense and targeted protest. “Here is a specific policy you must change” and “I will literally sacrifice my life if you don’t make this change”. So I’m talking about the stronger principle.
I don’t strongly agree or disagree with your empirical claims but I do disagree with the level of confidence expressed. Quoting a comment I made previously:
I’m undecided on whether things like hunger strikes are useful but I just want to comment to say that I think a lot of people are way too quick to conclude that they’re not useful. I don’t think we have strong (or even moderate) reason to believe that they’re not useful.
When I reviewed the evidence on large-scale nonviolent protests, I concluded that they’re probably effective (~90% credence). But I’ve seen a lot of people claim that those sorts of protests are ineffective (or even harmful) in spite of the evidence in their favor.[1] I think hunger strikes are sufficiently different from the sorts of protests I reviewed that the evidence might not generalize, so I’m very uncertain about the effectiveness of hunger strikes. But what does generalize, I think, is that many peoples’ intuitions on protest effectiveness are miscalibrated.
[1] This may be less relevant for you, Mikhail Samin, because IIRC you’ve previously been supportive of AI pause protests in at least some contexts.
ETA: To be clear, I’m responding to the part of your post that’s about whether hunger strikes are effective. I endorse positive message of the second half of your post.
ETA 2: I read Ben Pace’s comment and he is making some good points so now I’m not sure I endorse the second half.
To be very clear, I expect large social movements that use protests as one of its forms of action to have the potential to be very successful and impactful if done well. Hunger strikes are significantly different from protests. Hunger strikes can be powerful, but they’re best for very different contexts.
I think we should show some solidarity to people committed to their beliefs and making a personal sacrifice, rather than undermining them by critiquing their approach.
Given that they’re both young men and the hunger strikes are occurring in the first world, it seems unlikely anyone will die. But it does seem likely they or their friends will read this thread.
Beyond that, the hunger strike is only on day 2 and is has already received a small amount of media coverage. Should they go viral then this one action alone will have a larger differential impact on reducing existential risk than most safety researchers will achieve in their entire careers.
This is surprising to hear on LessWrong, where we value truth without having to think of object-level reasons for why it is good to say true things. But on the object level: it would be very dangerous for a community to avoid saying true things because it is afraid of undermining someone’s sacrifice; this would lead to a lot of needless, and even net-negative, sacrifice, without mechanisms for self-correction. Like, if I ever do something stupid, please tell me (and everyone) that instead of respecting my sacrifice: I would not want others to repeat my mistakes.
(There are lots of ways to get media coverage and it’s not always good in expectation. If they go viral, in a good way/with a good message, I will somewhat change my mind.)
Aside from whether or not the hunger strikes are a good idea, I’m really glad they have emphasized conditional commitments in their demands
I think that we should be pushing on these much much more: getting groups to say “I’ll do X if abc groups do X as well”
And should be pushing companies/governments to be clear whether their objection is “X policy is net-harmful regardless of whether anyone else does it” vs “X is net-harmful for us if we’re the only ones to do it”
[I recognize that some of this pushing/clarification might make sense privately, and that groups will be reluctant to stay stuff like this publicly because of posturing and whatnot.]
(While I like it being directed towards coordination, it would not actually make a difference, as it won’t be the case that all AI companies want to stop, and so it would still not be of great significance. The thing that works is a gov-supported ban on developing ASI anywhere in the world. A commitment to stop if everyone else stops doesn’t actually come into force unless everyone is required to stop anyway.
An ask that works is, e.g., “tell the government they need to stop everyone, including us”.)
An ask that works is, e.g., “tell the government they need to stop everyone, including us”.)
For sure, I think that would be a reasonable ask too. FWIW, I think if multiple leading AI companies did make a statement like the one outlined, I think that would increase the chance of non-complying ones being made to halt by the government, even though they hadn’t made a statement themselves. That is, even one prominent AI company making this statement then starts to widen the Overton window
“There is no justice in the laws of Nature, no term for fairness in the equations of motion. The universe is neither evil, nor good, it simply does not care. The stars don’t care, or the Sun, or the sky. But they don’t have to! We care! There is light in the world, and it is us!”
And someday when the descendants of humanity have spread from star to star they won’t tell the children about the history of Ancient Earth until they’re old enough to bear it and when they learn they’ll weep to hear that such a thing as Death had ever once existed!
We’re sending copies of the book to everyone with >5k followers!
If you have >5k followers on any platform (or know anyone who does), (ask them to) DM me the address for a physical copy of If Anyone Builds It, or an email address for a Kindle copy.
So far, sent 13 copies to people with 428k followers in total.
At the beginning of November, I learned about a startup called Red Queen Bio, that automates the development of viruses and related lab equipment. They work together with OpenAI, and OpenAI is their lead investor.
Today, we are launching Red Queen Bio (http://redqueen.bio), an AI biosecurity company, with a $15M seed led by @OpenAI. Biorisk grows exponentially with AI capabilities. Our mission is to scale biological defenses at the same rate. A
on who we are + what we do!
[...]
We also need *financial* co-scaling. Governments can’t have exponentially scaling biodefense budgets. But they can create the right market incentives, as they have done for other safety-critical industries. We’re engaging with policymakers on this both in the US and abroad. 7⁄19
[...]
We are committed to cracking the business model for AI biosecurity. We are borrowing from fields like catastrophic risk insurance, and working directly with the labs to figure out what scales. A successful solution can also serve as a blueprint for other AI risks beyond bio. 9⁄19
On November 15, I saw that and made a tweet about it: Automated virus-producing equipment is insane. Especially if OpenAI, of all companies, has access to it. (The tweet got 1.8k likes and 497k views.)
In the tweet, I said that there is, potentially, literally a startup, funded by and collaborating with OpenAI, with equipment capable of printing arbitrary RNA sequences, potentially including viruses that could infect humans, connected to the internet or managed by AI systems.
I asked whether we trust OpenAI to have access to this kind of equipment, and said that I’m not sure what to hope for here, except government intervention.
The only inaccuracy that was pointed out to me was that I mentioned that they were working on phages, and they denied working on phages specifically.
At the same time, people close to Red Queen Bio publicly confirmed the equipment they’re automating would be capable of producing viruses (saying that this equipment is a normal thing to have in a bio lab and not too expensive).
A few days later, Hannu Rajaniemi, a Red Queen Bio co-founder and fiction author, responded to me in a quote tweet and in comments:
This inaccurate tweet has been making the rounds so wanted to set the record straight.
We use AI to generate countermeasures and run AI reinforcement loops in safe model systems that help train a defender AI that can generalize to human threats
The question of whether we can do this without increasing risk was a foundational question for us before starting Red Queen. The answer is yes, with certain boundaries in place. We are also very concerned about AI systems having direct control over automated labs and DNA synthesis in the future.
They did not answer any of the explicitly asked questions, which I repeated several times:
- Do you have equipment capable of producing viruses? - Are you automating that equipment? - Are you going to produce any viruses?
- Are you going to design novel viruses (as part of generating countermeasures or otherwise)? - Are you going to leverage AI for that?
- Are OpenAI or OpenAI’s AI models going to have access to the equipment or software for the development or production of viruses?
It seems pretty bad that this startup is not being transparent about their equipment and the level of possible automation. It’s unclear whether they’re doing gain-of-function research. It’s unclear what security measures they have or are going to have in place.
I would really prefer for AIs, and for OpenAI (known for prioritizing convenience over security)’s models especially, to not have ready access to equipment that can synthesize viruses or software that can aid virus development.
I’m a little confused about what’s going on since apparently the explicit goal of the company is to defend against biorisk and make sure that biodefense capabilities keep up with AI developments, and when I first saw this thread I was like “I’m not sure of what exactly they’ll do, but better biodefense is definitely something we need so this sounds like good news and I’m glad that Hannu is working on this”.
I do also feel that the risk of rogue AI makes it much more important to invest in biodefense! I’d very much like it if we had the degree of automated defenses that the “rogue AI creates a new pandemic” threat vector was eliminated entirely. Of course there’s the risk of the AI taking over those labs but in the best case we’ll also have deployed more narrow AI to identify and eliminate all cybersecurity vulnerabilities before that.
And I don’t really see a way to defend against biothreats if we don’t do something like this (which isn’t to say one couldn’t exist, I also haven’t thought about this extensively so maybe there is something), like the human body wouldn’t survive for very long if it didn’t have an active immune system.
Today, we are launching Red Queen Bio (http://redqueen.bio), an AI biosecurity company, with a $15M seed led by OpenAI. Biorisk grows exponentially with AI capabilities. Our mission is to scale biological defenses at the same rate.
Since 2016, I have been building HelixNano, a clinical stage biotech (and still my main gig), with Nikolai Eroshenko. Recently, HelixNano teamed up with OpenAI to push AI bio’s limits. To our surprise, we saw models invent genuinely new wet lab methods (publication soon).
We got super excited. There was a path to superhuman drug designers. But we couldn’t ignore the shadow of superhuman virus designers. A world with breakthrough AI drugs can’t exist without new biological defenses. We spun out Red Queen Bio to build them.
AI biosecurity is a different game from traditional biodefense, with relatively static threats and flat budgets. What do you do when the attack surface grows at the rate of AI progress, driven by trillions of dollars of compute?
Red Queen Bio’s core thesis is **defensive co-scaling.** You have to couple defensive capabilities and funding to the same technological and financial forces that drive the AGI race, otherwise they can’t keep up.
We work with frontier labs to map AI biothreats and pre-build medical countermeasures against them. For co-scaling to work, this needs to improve as models do, and scale with compute. So our pipeline is built upon the leading models themselves, lab automation and RL.
We also need *financial* co-scaling. Governments can’t have exponentially scaling biodefense budgets. But they can create the right market incentives, as they have done for other safety-critical industries. We’re engaging with policymakers on this both in the US and abroad.
RQB’s work is driven by a civilizational need. But the economic incentives are ultimately on our side too. The capital behind what may be the biggest industrial transformation in human history is not going to tolerate unpriced tail risk on the scale of COVID or bigger.
We are committed to cracking the business model for AI biosecurity. We are borrowing from fields like catastrophic risk insurance, and working directly with the labs to figure out what scales. A successful solution can also serve as a blueprint for other AI risks beyond bio.
This is bigger than us. No company, AI lab or government is going to solve defensive co-scaling alone. Accordingly, we are committed to open collaboration with them all. Red Queen Bio is a Public Benefit Corporation, with governance to ensure mission takes precedence over any individual partnership.
In case it’s not obvious, Red Queen Bio and defensive co-scaling are very much inspired by VitalikButerin’s d/acc philosophy. We find it inspiring, but differ in a couple of important ways.
First, we are skeptical that the d/acc approach of building purely defensive capabilities first is possible: in our view, they have to piggyback on general capabilities.
In contrast to d/acc, we also believe it’s hard to maintain defender advantage through de-centralization alone. For the sci-fi fans, writing DARKOME (a near-future biotech thriller) in part changed my mind on this!
But we heartily agree with VitalikButerin on the brightness and centrality of human kindness and agency.
In the face of fast AI timelines and the enormity of the stakes, it’s easy to feel trapped in the AGI race dynamic. But the incentive structures driving it are not physical laws. They are no more real than others we can create.
By launching Red Queen Bio, we are choosing a different race. One where defense keeps up with offense and economics spurs safety.
The starting pistol has gone off. It’s time to run together.
Thanks for sharing, this is extremely important context—I’m way more ok with dual use threats from a company actively trying to reduce bio risk from AI who seem to have vaguely reasonable threat models, than just reckless gain of function people with insane threat models. It’s much less clear to me how much risk is ok to accept from projects actively doing reasonable things to make it better, but it’s clearly non zero (I don’t know if this place is actually doing reasonable things, but Mikhail provides no evidence against)
I think it was pretty misleading for Mikhail not to include this context in the original post.
Uhm yeah valid I guess the issue was illusion of transparency: I mostly copied the original post from my tweet, which was quote-tweeting the announcement, and I didn’t particularly think about adding more context because had it cached that the tweet is fine (I checked with people closely familiar with RQB before tweeting, and it did include all of the context by virtue of quote-tweeting the original announceemnt) and when posting to lw did not realize i’m not directly adding all of the context that was included in the tweet if people don’t click on the link.
Added the context to the original post.
Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.
Like, I can see a case for doing gain-of-function research responsibly to develop protection against threats (vaccines, proteins that would bind for viruses, etc.), but this should include incredible transparency, strong security (BSL & computer security & strong guardrails around what exactly AI models have automated access to), etc.
Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.
I can’t really fault them for not answering or being fully honest, from their perspective you’re a random dude who’s attacking them publicly and trying to get them lots of bad PR. I think it’s often very reasonable to just not engage in situations like that. Though I would judge them for outright lying
That’s somewhat reasonable. (They did engage though: made a number of comments and quote-tweeted my tweet, without addressing at all the main questions.)
Sure, but there’s a big difference between engaging in PR damage control mode and actually seriously engaging. I don’t take them choosing to be in the former as significant evidence of wrong doing
Since 2016, I have been building HelixNano, a clinical stage biotech (and still my main gig), with Nikolai Eroshenko. Recently, HelixNano teamed up with OpenAI to push AI bio’s limits. To our surprise, we saw models invent genuinely new wet lab methods (publication soon).
We got super excited. There was a path to superhuman drug designers. But we couldn’t ignore the shadow of superhuman virus designers. A world with breakthrough AI drugs can’t exist without new biological defenses. We spun out Red Queen Bio to build them.
Based on this, they didn’t need to set up a new company. They already had an existing biotech company that was focused on its own research, when they realized that “oh fuck, based on our current research things could get really bad unless someone does something”… and then they went Heroic Responsibility and spun out a whole new company to do something, rather than just pretending that no dangers existed or making vague noises and asking for government intervention or something.
It feels like being hostile toward them is a bit Copenhagen Ethics, in that if they hadn’t tried to do the right thing, it’s possible that nobody would have heard about this and things would have been much easier for them. But since they were thinking about their consequences of their research and decided to do something about it and said that in public, they’re now getting piled on for not answering every question they’re asked on X. (And if I were them, I might also have concluded that the other side is so hostile that every answer might be interpreted in the worst possible light and that it’s better not to engage.)
equipment they’re automating would be capable of producing viruses (saying that this equipment is a normal thing to have in a bio lab
This seems to fall into the same genre as “that word processor can be used to produce disinformation,” “that image editor can be used to produce ‘CSAM’,” and “the pocket calculator is capable of displaying the number 5318008.”
If a word processor falling into the hands of terrorists could easily generate a memetic virus capable of inducing schizophrenia in hundreds of millions of people, then I believe such concerns are warranted.
“Virus” is doing a lot of work here. It makes a big difference whether they’re capable of making phages or mammalian viruses:
Phages:
Often have a small genome, 3 kbp, easy to synthesize
Can be cultured in E. coli or other bacteria, which are easy to grow
More importantly, E. coli will take up a few kb of naked DNA, so you can just insert the genome directly into them to start the process (you can even do it without living E. coli if you use what’s basically E. coli juice)
I could order and culture one of these easily
Mammalian viruses (as I understand the situation)
Much larger genome, 30 kbp, somewhat harder to synthesize
Have to be cultured in mammalian cell cultures, which are less standard
More importantly, mammalian cells don’t just take up DNA, so you’d have to first package your viral DNA into an existing adenovirus scaffold, or some other large-capacity vector (maybe you could do it with a +ve sense RNA virus and a lipid based vector, but that’s a whole other kettle of fish)
The above might be false because I actually have no idea how to culture a de novo mammalian virus because it’s a much rarer thing to do
If they have the equipment to make phages but not to culture mammalian cells then that’s probably fine. If they’re doing AI-powered GoF research then, well, lmao I guess.
DNA phage genomes have a median size of ~50kb, whereas RNA phage genomes are more around the 4kb mark.
Similarily, mammalian DNA viruses are usually >100kb, but their RNA viruses are usually <20kb.
Oddly enough the smallest known virus, porcine circovirus, is ssDNA, mammalian, and only 1.7kb
But yes, mammalian viruses are generally more difficult to culture, probably downstream of mammalian cells being more difficult to culture. Phages also typically only inject their genetic material into the cells, which bootstraps itself into a replication factory. Mammalian viruses, which generally instead sneak their way in and deliver the payload, often deliver their genetic material alongside proteins required to start the replication.
I didn’t particularly present any publicly available evidence in my tweet. Someone close to Red Queen Bio confirmed that they have the equipment and are automating it here.
I thought it’d just be very fun to develop a new sense.
Remember vibrating belts and ankle bracelets that made you have a sense of the direction of north? (1, 2)
I made some LLMs make me an iOS app that does this! Except the sense doesn’t go away the moment you stop the app!
I am pretty happy about it! I can tell where’s north and became much better at navigating and relating different parts of the (actual) territory in my map. Previously, I would remember my paths as collections of local movements (there, I turn left); now, I generally know where places are, and Google Maps feel much more connected to the territory.
It can vibrate when you face north; even better, if you’re in headphones, it can give you spatial sounds coming from north; better still, a second before playing a sound coming from north, it can play a non-directional cue sound to make you anticipate the north sound and learn very quickly.
None of this interferes with listening to any other kind of audio.
It’s all probably less relevant to the US, as your roads are in a grid anyway; great for London though.
If you know how to make it have more pleasant sounds, or optimize directional sounds (make realistic binaural audio), and want to help, please do! The source code is on GitHub: https://github.com/mihonarium/sonic-compass/
This is really cool! My ADHD makes me rather place-blind, if I’m not intentionally forcing myself to pay attention to a route and my surroundings, I can get lost or disoriented quite easily. I took the same bus route to school for a decade, and I can’t trace the path, I only remember a sequence of stops. Hopefully someone makes an Android version, I’d definitely check it out.
it is a chatbot with 200k tokens of context about AI safety. it is surprisingly good- better than you expect current LLMs to be- at answering questions and counterarguments about AI safety. A third of its dialogues contain genuinely great and valid arguments.
You can try the chatbot at https://whycare.aisgf.us (ignore the interface; it hasn’t been optimized yet). Please ask it some hard questions! Especially if you’re not convinced of AI x-risk yourself, or can repeat the kinds of questions others ask you.
Send feedback to ms@contact.ms.
A couple of examples of conversations with users:
I know AI will make jobs obsolete. I’ve read runaway scenarios, but I lack a coherent model of what makes us go from “llms answer our prompts in harmless ways” to “they rebel and annihilate humanity”.
It’s better than stampy (try asking both some interesting questions!). Stampy is cheaper to run though.
I wasn’t able to get LLMs to produce valid arguments or answer questions correctly without the context, though that could be scaffolding/skill issue on my part.
Good job trying and putting this out there. Hope you iterate on it a lot and make it better.
Personally, I utterly despise this current writing style. Maybe you can look at the Void bot on Bluesky, which is based on Gemini pro—it’s one of the rare bots I’ve seen whose writing is actually ok.
Thanks, but, uhm, try to not specify “your mom” as the background and “what the actual fuck is ai alignment” as your question if you want it to have a writing style that’s not full of “we’re toast”
Maybe the option of not specifying the writing style at all, for impatient people like me?
Unless you see this as more something to be used by advocacy/comms groups to make materials for explaining things to different groups, which makes sense.
If the general public is really the target, then adding some kind of voice mode seems like it would reduce latency a lot
This specific page is not really optimized for any use by anyone whatsoever; there are maybe five bugs each solvable with one query to claude, and all not a priority; the cool thing i want people to look at is the chatbot (when you give it some plausible context)!
(Also, non-personalized intros to why you should care about ai safety are still better done by people.)
I really wouldn’t want to give a random member of the US general public a thing that advocates for AI risk while having a gender drop-down like that.[1]
The kinds of interfaces it would have if we get to scale it[2] would be very dependent on where specific people are coming from. I.e., demographic info can be pre-filled and not necessarily displayed if it’s from ads; or maybe we ask one person we’re talking to to share it with two other people, and generate unique links with pre-filled info that was provided by the first person; etc.
Voice mode would have a huge latency due to the 200k token context and thinking prior to responding.
which I really want to! someone please give us the budget and volunteers!
at the moment, we have only me working full-time (for free), $10k from SFF, and ~$15k from EAs who considered this to be the most effective nonprofit in this field.
reach out if you want to donate your time or money. (donations are tax-deductible in the us.)
Is the 200k context itself available to use anywhere? How different is it from the Stampy.ai dataset? Nw if you don’t know due to not knowing what exactly stampy’s dataset is.
I get questions a lot, from regular ml researchers on what exactly alignment is and I wish I had an actually good thing to send them. Currently I either give a definition myself or send them to alignmentforum.
Nope, I’m somewhat concerned about unethical uses (eg talking to a lot of people without disclosing it’s ai), so won’t publicly share the context.
If the chatbot answers questions well enough, we could in principle embed it into whatever you want if that seems useful. Currently have a couple of requests like that. DM me somewhere?
a big problem about AI safety advocacy is that we aren’t reaching enough people fast enough, this problem doesn’t have the same familiarity amongst the public as climate change or even factory farming and we don’t have people running around in the streets preaching about the upcoming AI apocalypse, most lesswrongers can’t even come up with a quick 5min sales pitch for lay people even if their live literally depended on it.
this might just be the best advocacy tool i have seen so far, if only we can get it to go viral it might just make the difference.
edit:
i take this part back
most lesswrongers can’t even come up with a quick 5min sales pitch for lay people even if their live literally depended on it.
i have seen some really bad attempts at explaining AI-x risk in laymen terms and just assumed it was the norm, most of which were from older posts.
now looking at newer posts i think the situation is has greatly improved, not ideal but way better then i thought.
i still think this tool would be a great way to reach the wider public especially if it incorporates a better citation function so people can check the source material (it does sort of point the user to other websites but not technical papers).
Thanks! I think we’re close to a point where I’d want to put this in front of a lot of people, though we don’t have the budget for this (which seems ridiculous, given the stats we have for our ads results etc.), and also haven’t yet optimized the interface (as in, half the US public won’t like the gender dropdown).
Also, it’s much better at conversations than at producing 5min elevator pitches. (Hard to make it good at being where the user is while getting to a point instead of being very sycophantic).
The end goal is to be able to explain the current situation to people at scale.
Question: does LessWrong has any policies/procedures around accessing user data (e.g., private messages)? E.g., if someone from Lightcone Infrastructure wanted to look at my private DMs or post drafts, would they be able to without approval from others at Lightcone/changes to the codebase?
Expanding on Ruby’s comment with some more detail, after talking to some other Lightcone team members:
Those of us with access to database credentials (which is all the core team members, in theory) would be physically able to run those queries without getting sign-off from another Lightcone team member. We don’t look at the contents of user’s DMs without their permission unless we get complaints about spam or harassment, and in those cases also try to take care to only look at the minimum information necessary to determine whether the complaint is valid, and this has happened extremely rarely[1]. Similarly, we don’t read the contents or titles of users’ never-published[2] drafts. We also don’t look at users’ votes except when conducting investigations into suspected voting misbehavior like targeted downvoting or brigading, and when we do we’re careful to only look at the minimum amount of information necessary to render a judgment, and we try to minimize the number of moderators who conduct any given investigation.
We do see drafts that were previously published and then redrafted in certain moderation views. Some users will post something that gets downvoted and then redraft it; we consider this reasonable because other users will have seen the post and it could easily have been archived by e.g. archive.org in the meantime.
I occasionally incidentally see drafts by following our automated error-logging to the page where the error occurred, which could be the edit-post page, and in those cases I have looked enough to check things like whether it contains embeds, whether collaborative editing is turned on, etc. In those cases I try not to read the actual content. I don’t think I’ve ever stumbled onto a draft dramapost this way, but if I did I would treat it as confidential until it was published. (I wouldn’t do this with a DM.)
Is there an immutable (or at least “not mutable by the person accessing the database”) access log which will show which queries were run by which users who have database credentials? If there is, I suspect that mentioning that will alleviate many concerns.
No. It turns out after a bit of digging that this might be technically possible even though we’re a ~7-person team, but it’d still be additional overhead and I’m not sure I buy that the concerns it’d be alleviating are that reasonable[1].
Not a confident claim. I personally wouldn’t be that reassured by the mere existence of such a log in this case, compared to my baseline level of trust in the other admins, but obviously my epistemic state is different from that of someone who doesn’t work on the site. Still, I claim that it would not substantially reduce the (annualized) likelihood of an admin illicitly looking at someone’s drafts/DMs/votes; take that as you will. I’d be much more reassured (in terms of relative risk reduction, not absolute) by the actual inability of admins to run such queries without a second admin’s thumbs-up, but that would impose an enormous burden on our ability to do our jobs day-to-day without a pretty impractical level of investment in new tooling (after which I expect the burden would merely be “very large”).
I think it would be feasible to increase the friction on improper access, but it’s basically impossible to do in a way that’s loophole-free. The set of people with database credentials is almost identical to the set of people who do development on the site’s software. So we wouldn’t be capturing a log of only typed in manually, we’d be capturing a log of mostly queries run by their modified locally-running webserver, typically connected to a database populated with a mirror snapshot of the prod DB but occasionally connected to the actual prod DB.
Thanks for response; my personal concerns[1] would somewhat be alleviated, without any technocal changes, by:
Lightcone Infrastructure explicitly promising not to look at private messages unless a counterparty agrees to that (e.g., becasue a counterparty reports spam);
Everyone with such access explicitly promising to tell others at Lightcone Infrastructure when they access any private content (DMs, drafts).
Clarifying in the first case: If Bob signs up and DMs 20 users, and one reports spam, are you saying that we can only check his DM, or that at this time we can then check a few others (if we wish to)?
TBH the main thing that helps with in practice is that it forces teams to get off the “emailed spreadsheet of shared passwords” model of access management. Which mainly becomes useful if someone is leaving the team in a hurry under less than ideal circumstances.
“That problem is not on the urgent/important pareto frontier” is absolutely a valid answer though, especially since AFAIK LW doesn’t store any data more sensitive than passwords / a few home addresses.
We have policies to not look at user data. Vote data and DM data are the most sacred, though we will look at votes if the patterns suggest fraudulent behavior (e.g. mass downvoting of a person). We tend to inform/consult others on this, but no, there’s nothing technical blocking someone from accessing the data on their own.
I don’t think we currently have one. As far as I know, LessWrong hasn’t had any requests made of it by law enforcement that would trip a warrant canary while I’ve been working here (since July 5th, 2022). I have no information about before then. I’m not sure this is at the top of our priority list; we’d need to stand up some new infrastructure for it to be more helpful than harmful (i.e. because we forgot to update it, or something).
I want to make a thing that talks about why people shouldn’t work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn’t going to work. I think you’d probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.
Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful in practice.
I think it’s relatively common for people I encounter to think both:
Anthropic leadership is engaged in somewhat scumy appeasment of safety motivated employees in ways that are misleading or based on kinda obviously motivated reasoning. (Which results in safety motivated employees having a misleading picture of what the organization is doing and why and what people expect to happen.)
Anthropic is strongly net positive despite this and working on capabilities there is among the best things you can do.
An underlying part of this view is typically that moderate improvements in effort spent on prosaic safety measures substantially reduces risk. I think you probably strongly disagree with this and this might be a major crux.
Personally, I agreee with what Zach said. I think working on capabilities[1] at Anthropic is probably somewhat net positive but would only be the best thing to work on if you had very strong comparative advantage relative to all the other useful stuff (e.g. safety research). So probably most altruistic people with views similar to mine should do something else. I currently don’t feel very confident that capabilities at Anthropic is net positive and could imagine swinging towards thinking it is net negative based on additional evidence
fwiw I agree with most but not all details, and I agree that Anthropic’s commitments and policy advocacy have a bad track record, but I think that Anthropic capabilities is nevertheless net positive, because Anthropic has way more capacity and propensity to do safety stuff than other frontier AI companies.
I wonder what you believe about Anthropic’s likelihood of noticing risks from misalignment relative to other companies, or of someday spending >25% of internal compute on (automated) safety work.
If people work for Anthropic because they’re misled about the nature of the company, I don’t think arguments on whether they’re net-positive have any local relevance.
Still, to reply: They are one of the companies in the race to kill everyone.
Spending compute on automated safety work does not help. If the system you’re running is superhuman, it kills you instead of doing your alignment homework; if it’s not superhuman, it can’t solve your alignment homework.
Anthropic is doing some great research; but as a company at the frontier, their main contribution could’ve been making sure that no one builds ASI until it’s safe; that there’s legislation that stops the race to the bottom; that the governments understand the problem and want to regulate; that the public is informed of what’s going on and what legislation proposes.
Instead, Anthropic argues against regulation in private, lies about legislation in public, misleads its employees about its role in various things.
***
If Anthropic had to not stay at the frontier to be able to spend >25% of their compute on safety, do you expect they would?
Do you really have a coherent picture of the company in mind, where it is doing all the things it’s doing now (such as not taking steps that would slow down everyone), and yet would behave responsibly when it matters most and also pressure not to is the highest?
I recall a video circulating that showed Dario had changed his position on racing with China that feels perhaps relevant. People can of course change their mind, but I still dislike it.
Horizon Institute for Public Service is not x-risk-pilled
Someone saw my comment and reached out to say it would be useful for me to make a quick take/post highlighting this: many people in the space have not yet realized that Horizon people are not x-risk-pilled.
Edit: some people reached out to me to say that they’ve had different experiences (with a minority of Horizon people).
My sense is Horizon is intentionally a mixture of people who care about x-risk and people who broadly care about “tech policy going well”. IMO both are laudable goals.
My guess is Horizon Institute has other issues that make me not super excited about it, but I think this one is a reasonable call.
Importantly, AFAICT some Horizon fellows are actively working against x-risk (pulling the rope backwards, not sideways). So Horizon’s sign of impact is unclear to me. For a lot of people, “tech policy going well” means “regulations that don’t impede tech companies’ growth”.
My two cents: People often rely too much on whether someone is “x-risk-pilled” and not enough on evaluating their actual beliefs/skills/knowledge/competence . For example, a lot of people could pass some sort of “I care about existential risks from AI” test without necessarily making it a priority or having particularly thoughtful views on how to reduce such risks.
Here are some other frames:
Suppose a Senator said “Alice, what are some things I need to know about AI or AI policy?” How would Alice respond?
Suppose a staffer said “Hey Alice, I have some questions about [AI2027, superintelligence strategy, some Bengio talk, pick your favorite reading/resource here].” Would Alice be able to have a coherent back-and-forth with the staffer for 15+ mins that goes beyond a surface level discussion?
Suppose a Senator said “Alice, you have free reign to work on anything you want in the technology portfolio—what do you want to work on?” How would Alice respond?
In my opinion, potential funders/supporters of AI policy organizations should be asking these kinds of questions. I don’t mean to suggest it’s never useful to directly assess how much someone “cares” about XYZ risks, but I do think that on-the-margin people tend to overrate that indicator and underrate other indicators.
Relatedly, I think people often do some sort of “is this person an EA” or is this person an “xrisk person”, and I would generally encourage people to try to use this sort of thinking less. It feels like AI policy discussions are getting sophisticated enough that we can actually Have Nuanced Conversations and evaluate people less on some sort of “do you play for the Right Team” axis and more on “what is your specific constellation of beliefs/skills/priorities/proposals” dimensions.
I would otherwise agree with you, but I think the AI alignment ecosystem has been burnt many times in the past over giving a bunch of money to people who said they cared about safety, but not asking enough questions about whether they actually believed “AI may kill everyone and that is a near or the number 1 priority of theirs”.
I’m not sure if we disagree— I think there are better ways to assess this than the way the “is this an xrisk person or not” tribal card often gets applied.
Example: “Among all the topics in AI policy and concerns around AI, what are your biggest priorities?” is a good question IMP.
Counterexample: “Do you think existential risk from advanced AI is important?” is a bad question IMO (especially in isolation).
It is very easy for people to say they care about “AI safety” without giving much indication of where it stands on their priority list, what sorts of ideas/plans they want to aim for, what threat models they are concerned about, if they are the kind of person who can have a 20+ min conversation about interesting readings or topics in the field, etc.
I suspect that people would get “burnt” less if they asked these kinds of questions instead of defaulting to some sort of “does this person care about safety” frame or “is this person Part of My Tribe” thing.
(On that latter point, it is rather often that I hear people say things like “Alice is amazing!” and then when I ask them about Alice’s beliefs or work they say something like “Oh I don’t know much about Alice’s work— I just know other people say Alice is amazing!”. I think it would be better for people to say “I think Alice is well-liked but I personally do not know much about her work or what kinds of things she believes/prioritizes.”)
FWIW this is also my impression but I’m going off weak evidence (I wrote about my evidence here), and Horizon is pretty opaque so it’s hard to tell. A couple weeks ago I tried reaching out to them to talk about it but they haven’t responded.
I long wondered why OpenPhil made so many obvious mistakes in the policy space. That level of incompetence just did not make any sense.
I did not expect this to be the explanation:
THEY SIMPLY DID NOT HAVE ANYONE WITH ANY POLITICAL EXPERIENCE ON THE TEAM until hiring one person in April 2025.
This is, like, insane. Not what I’d expect at all from any org that attempts to be competent.
(openphil, can you please hire some cracked lobbyists to help you evaluate grants? This is, like, not quite an instance of Graham’s Design Paradox, because instead of trying to evaluate grants you know nothing about, you can actually hire people with credentials you can evaluate, who’d then evaluate the grants. thank you <3)
To be clear, I don’t think this is an accurate assessment of what is going on. If anything, I think marginally people with more “political experience” seemed to me to mess up more.
In-general, takes of the kind “oh, just hire someone with expertise in this” almost never make sense IMO. First of all, identifying actual real expertize is hard. Second, general competence and intelligence is a better predictor of task performance in almost all domains after even just a relatively short acclimation period that OpenPhil people far exceed. Third, the standard practices in many industries are insane and most of the time if you hire someone specifically for their expertise in a domain, not just as an advisor but an active team member, they will push for adopting those standard practices even when it doesn’t make sense.
I don’t think Mikhail’s saying that hiring an expert is sufficient. I think he’s saying that hiring an expert, in a very high-context and unnatural/counter-intuitive field like American politics, is necessary, or that you shouldn’t expect success trying to re-derive all of politics in a vacuum from first principles. (I’m sure OpenPhil was doing the smarter version of this thing, where they had actual DC contacts they were in touch with, but that they still should have expected this to be insufficient.)
Often the dumb versions of ways of dealing with the political sphere (advocated by people with some experience) just don’t make any sense at all, because they’re directional heuristics that emphasize their most counterintuitive elements. But, in talking to people with decades of experience and getting the whole picture, the things they say actually do make sense, and I can see how the random interns or whatever got their dumb takes (by removing the obvious parts from the good takes, presenting only the non-obvious parts, and then over-indexing on them).
I big agree with Habryka here in the general case and am routinely disappointed by input from ‘experts’; I think politics is just a very unique space with a bunch of local historical contingencies that make navigation without very well-calibrated guidance especially treacherous. In some sense it’s more like navigating a social environment (where it’s useful to have a dossier on everyone in the environment, provided by someone you trust) than it is like navigating a scientific inquiry (where it’s often comparatively cheap to relearn or confirm something yourself rather than deferring).
I mean, it’s not like OpenPhil hasn’t been interfacing with a ton of extremely successful people in politics. For example, OpenPhil approximately co-founded CSET, and talks a ton with people at RAND, and has done like 5 bajillion other projects in DC and works closely with tons of people with policy experience.
The thing that Jason is arguing for here is “OpenPhil needs to hire people with lots of policy experience into their core teams”, but man, that’s just such an incredibly high bar. The relevant teams at OpenPhil are like 10 people in-total. You need to select on so many things. This is like saying that Lightcone “DOESN’T HAVE ANYONE WITH ARCHITECT OR CONSTRUCTION OR ZONING EXPERIENCE DESPITE RUNNING A LARGE REAL ESTATE PROJECT WITH LIGHTHAVEN”. Like yeah, I do have to hire a bunch of people with expertise on that, but it’s really very blatantly obvious from where I am that trying to hire someone like that onto my core teams would be hugely disruptive to the organization.
It seems really clear to me that OpenPhil has lots of contact with people who have lots of policy experience, frequently consults with them on stuff, and that the people working there full-time seem reasonably selected for me. The only way I see the things Jason is arguing for work out is if OpenPhil was to much more drastically speed up their hiring, but hiring quickly is almost always a mistake.
Part of the distinction I try to draw in my sequence is that the median person at CSET or RAND is not “in politics” at all. They’re mostly researchers at think tanks, writing academic-style papers about what kinds of policies would be theoretically good for someone to adopt. Their work is somewhat more applied/concrete than the work of, e.g., a median political science professor at a state university, but not by a wide margin.
If you want political experts—and you should—you have to go talk to people who have worked on political campaigns, served in the government, or led advocacy organizations whose mission is to convince specific politicians to do specific things. This is not the same thing as a policy expert.
For what it’s worth, I do think OpenPhil and other large EA grantmakers should be hiring many more people. Hiring any one person too quickly is usually a mistake, but making sure that you have several job openings posted at any given time (each of which you vet carefully) is not.
I agree that this is the same type of thing as the construction example for Lighthaven, but I also think that you did leave some value on the table there in certain ways (e.g. commercial-grade furniture vs consumer-grade furniture), and I think that a larger total % domain-specific knowledge I’d hope exists at Open Phil is policy knowledge than total % domain-specific knowledge I’d hope exists at Lightcone is hospitality/construction knowledge.
I hear you as saying ‘experts aren’t all that expert’ *‘hiring is hard’ + ‘OpenPhil does actually have access to quite a few experts when they need them’ = ‘OpenPhil’s strategy here is very reasonable.’
I agree in principal here but think that, on the margin, it just is way more valuable to have the skills in-house than to have external people giving you advice (so that they have both sides of the context, so that you can make demands of them rather than requests, so that they’re filtered for a pretty high degree of value alignment, etc). This is why Anthropic and OAI have policy teams staffed with former federal government officials. It just doesn’t get much more effective than that.
I don’t share Mikhail’s bolded-all-caps-shock at the state of things; I just don’t think the effects you’re reporting, while elucidatory, are a knockdown defense of OpenPhil being (seemingly) slow to hire for a vital role. But running orgs is hard and I wouldn’t shackle someone to a chair to demand an explanation.
Separately, a lot of people defer to some discursive thing like ‘The OP Worldview’ when defending or explicating their positions, and I can’t for the life of me hammer out who the keeper of that view is. It certainly seems like a knock against this particular kind of appeal when their access to policy experts is on-par with e.g. MIRI and Lightcone (informal connections and advisors), rather than the ultra-professional, ultra-informed thing it’s often floated as being. OP employees have said furtive things like ‘you wouldn’t believe who my boss is talking to’ and, similarly, they wouldn’t believe who my boss is talking to. That’s hardly the level of access to experts you’d want from a central decision-making hub aiming to address an extinction-level threat!
To be clear, I was a lot more surprised when I was told about some of what OpenPhil did in DC, once starting to facepalm really hard after two sentences and continuing to facepalm very hard for most of a ten-minute-long story. It was so obviously dumb, that even me, with basically zero exposure to American politics or local DC norms and only some tangential experience running political campaigns in a very different context (an authoritarian country), immediately recognized it as obviously very stupid. While listening, I couldn’t think of better explanations than stuff like “maybe Dustin wanted x and OpenPhil didn’t have a way to push back on it”. But not having anyone who could point out how this would be very, very stupid, on the team, is a perfect explanation for the previous cringe over their actions; and it’s also incredibly incompetent, on the level I did not expect.
As Jason correctly noted, it’s not about “policy”. This is very different from writing papers and figuring out what a good policy should be. It is about advocacy: getting a small number of relevant people to make decisions that lead to the implementation of your preferred policies. OpenPhil’s goals are not papers; and some of the moves they’ve made that their impact their utility more than any of the papers they’ve funded more are ridiculously bad.
A smart enough person could figure it out from the first principles, with no experience, or by looking at stuff like how climate change became polarized, but for most people, it’s a set of intuitions, skills, knowledge that are very separate from those that make you a good evaluator of research grants.
It is absolutely obvious to me that someone experienced in advocacy should get to give feedback on a lot of decisions that you plan to make, including because some of them can have strategic implications you didn’t think about.
Instead, OpenPhil are a bunch of individuals who apparently often don’t know the right questions to ask even despite their employer’s magic of everyone wanting to answer their questions.
(I disagree with Jason on how transparent grant evaluations are ought to be; if you’re bottlenecked by time, it seems fine to make handwavy bets. You just need people who are good of making bets. The issue is that they’re not selected for making good bets in politics, and so they fuck up; not with the general idea of having people who make bets.)
I’m the author of the LW post being signal-boosted. I sincerely appreciate Oliver’s engagement with these critiques, and I also firmly disagree with his blanket dismissal of the value of “standard practices.”
As I argue in the 7th post in the linked sequence, I think OpenPhil and others are leaving serious value on the table by not adopting some of the standard grant evaluation practices used at other philanthropies, and I don’t think they can reasonably claim to have considered and rejected them—instead the evidence strongly suggests that they’re (a) mostly unaware of these practices due to not having brought in enough people with mainstream expertise, and (b) quickly deciding that anything that seems unfamiliar or uncomfortable “doesn’t make sense” and can therefore be safely ignored.
We have a lot of very smart people in the movement, as Oliver correctly points out, and general intelligence can get you pretty far in life, but Washington, DC is an intensely competitive environment that’s full of other very smart people. If you try to compete here with your wits alone while not understanding how politics works, you’re almost certainly going to lose.
general competence and intelligence is a better predictor of task performance in almost all domains after even just a relatively short acclimation period
Can you say more about this? I’m aware of the research on g predicting performance on many domains, but the quoted claim is much stronger than the claims I can recall reading.
random thought, not related to GP comment: i agree identifying expertise in a domain you don’t know is really hard, but from my experience, identifying generalizable intelligence/agency/competence is less hard. generally it seems like a useful signal to see how fast they can understand and be effective at a new thing that’s related to what they’ve done before but that they’ve not thought much specifically about before. this isn’t perfectly correlated with competence at their primary field, but it’s probably still very useful.
e.g it’s generally pretty obvious if someone is flailing on an ML/CS interview Q because they aren’t very smart, or just not familiar with the tooling. people who are smart will very quickly and systematically figure out how to use the tooling, and people who aren’t will get stuck and sit there being confused. I bet if you took e.g a really smart mathematician with no CS experience and dropped them in a CS interview, it would be very fascinating to watch them figure out things from scratch
disclaimer that my impressions here are not necessarily strictly tied to feedback from reality on e.g job performance (i can see whether people pass the rest of the interview after making a guess at the 10 minute mark, but it’s not like i follow up with managers a year after they get hired to see how well they’re doing)
PSA: if you’re looking for a name for your project, most interesting .ml domains are probably available for $10, because the mainstream registrars don’t support the TLD.
I bought over 170 .ml domains, including anthropic.ml (redirects to the Fooming Shoggoths song), closed.ml & evil.ml (redirect to OpenAI Files), interpretability.ml, lens.ml, evals.ml, and many others (I’m happy to donate them to AI safety projects).
Since this seems to be a crux, I propose a bet to @Zac Hatfield-Dodds (or anyone else at Anthropic): someone shows random people in San-Francisco Anthropic’s letter to Newsom on SB-1047. I would bet that among the first 20 who fully read at least one page, over half will say that Anthropic’s response to SB-1047 is closer to presenting the bill as 51% good and 49% bad than presenting it as 95% good and 5% bad.
Sorry, I’m not sure what proposition this would be a crux for?
More generally, “what fraction good vs bad” seems to me a very strange way to summarize Anthropic’s Support if Amended letter or letter to Governor Newsom. It seems clear to me that both are supportive in principle of new regulation to manage emerging risks, and offering Anthropic’s perspective on how best to achieve that goal. I expect most people who carefully read either letter would agree with the preceeding sentence and would be open to bets on such a proposition.
Personally, I’m also concerned about the downside risks discussed in these letters—because I expect they both would have imposed very real costs, and reduced the odds of the bill passing and similar regulations passing and enduring in other juristictions. I nonetheless concluded that the core of the bill was sufficiently important and urgent, and downsides manageable, that I supported passing it.
I claim that a responsible frontier AI company would’ve behaved very differently from Anthropic. In particular, the letter said basically “we don’t think the bill is that good and don’t really think it should be passed” more than it said “please sign”. This is very different from your personal support for the bill; you indeed communicated “please sign”.
Sam Altman has also been “supportive of new regulation in principle”. These words sadly don’t align with either OpenAI’s or Anthropic’s lobbying efforts, which have been fairly similar. The question is, was Anthropic supportive of SB-1047 specifically? I expect people to not agree Anthropic was after reading the second letter.
I strongly disagree that OpenAI’s and Anthropic’s efforts were similar (maybe there’s a bet there?). OpenAI formally opposed the bill without offering useful feedback; Anthropic offered consistent feedback to improve the bill, pledged to support it if amended, and despite your description of the second letter Senator Wiener describes himself as having Anthropic’s support.
I also disagree that a responsible company would have behaved differently. You say “The question is, was Anthropic supportive of SB-1047 specifically?”—but I think this is the wrong question, implying that lack of support is irresponsible rather than e.g. due to disagreements about the factual question of whether passing the bill in a particular state would be net-helpful for mitigating catastrophic risks. The Support if Amended letter, for example, is very clear:
Anthropic does not support SB 1047 in its current form. However, we believe the bill’s core aims to ensure the safe development of AI technologies are worthy, and that it is possible to achieve these aims while eliminating most of the current bill’s substantial drawbacks, as we will propose here. … We are committed to supporting the bill if all of our proposed amendments are made.
I don’t expect further discussion to be productive though; much of the additional information I have is nonpublic, and we seem to have different views on what constitutes responsible input into a policy process as well as basic questions like “is Anthropic’s engagement in the SB-1047 process well described as ‘support’ when the letter to Governor Newsom did not have the word ‘support’ in the subject line”. This isn’t actually a crux for me, but I and Senator Wiener seem to agree yes, while you seem to think no.
One thing to highlight, which I only learned recently, is that the norm when submitting letters to the governor on any bill in California is to include: “Support” or “Oppose” in the subject line to clearly state the company’s position.
Anthropic importantly did NOT include “support” in the subject line of the second letter. I don’t know how to read this as anything else than that Anthropic did not support SB1047.
Good point! That seems right; advocacy groups seem to think staff sorts letters by support/oppose/request for signature/request for veto in the subject line and recommend adding those to the subject line. Examples: 1, 2.
Anthropic has indeed not included any of that in their letter to Gov. Newsom.
Is there a write up on why the “abundance and growth” cause area is an actually relatively efficient way to spend money (instead of a way for OpenPhil to be(come) friends with everyone who’s into abundance & growth)? (These are good things to work on, but seem many orders of magnitude worse than other ways to spend money.)
Modern economic growth has transformed global living standards, delivering vast improvements in health and well-being while helping to lift billions of people out of poverty.
Where does economic growth come from? Because new ideas — from treating infections with penicillin to designing jet engines — can be shared and productively applied by multiple people at once, mainstream economic theory holds that scientific and technological progress that creates ideas is the main driver of long-run growth. In a recent article, Stanford economist Chad Jones estimates that the growth in ideas can account for around 50% of per-capita GDP growth in the United States over the past half-century. This implies that the benefits of investing in innovation are large: Ben Jones and Larry Summers estimate that each $1 invested in R&D gives a social return of $14.40. Our Open Philanthropy colleagues Tom Davidson and Matt Clancy have done similar calculations that take into account global spillovers (where progress in one country also boosts others through the spread of ideas), and found even larger returns for R&D and scientific research.
But ideas don’t automatically raise living standards; economic growth requires turning them into technologies that can disseminate throughout society. Burdensome government regulations and institutional constraints are increasingly slowing the pace of this progress and creating artificial scarcity. Restrictive zoning and land use regulations have created housing shortages in many major cities, driving up rents and preventing people from making productive moves to centers of economic growth and innovation. Similar constraints hinder scientific and technological innovation — key institutional funders like the NSF or the National Institutes for Health (NIH) burden researchers with excessive paperwork and overly lengthy grant review processes, while preferring low-risk, incremental research over higher-risk but potentially transformative ideas. Meanwhile, environmental review laws slow a wide variety of infrastructure projects, including green energy.
...
For Open Philanthropy as an institution, the timing is also right. Learning from the recent success of our Lead Exposure Action Fund (LEAF), which doubled the total amount of philanthropic funding toward lead exposure reduction in low-income countries, we are increasingly exploring pooled funding models. We talked with a number of like-minded donors who suggested potential appetite for a pooled fund like this, and ultimately received commitments for over $60 million so far from other funders. We’re grateful to Good Ventures, Patrick Collison, and our other donors in this fund for their support, and we’re always excited to hear from other funders who might be interested in collaboration opportunities.
Yes, I’ve read their entire post. $14.4 of “social return” per $1 in the US seems incredibly unlikely to be comparable to the best GiveWell interventions or even GiveDirectly.
Ozzie Gooen asked about this before, here’s (the relevant part of) what Alexander Berger replied:
Our innovation policy work is generally based on the assumption that long-run health and income gains are ultimately attributable to R&D. For example, Matt Clancy estimated in this report that general funding for scientific research ranged from 50-330x in our framework, depending on the model and assumptions about downside risks from scientific research. In practice we currently internally use a value of average scientific research funding of 70x when evaluating our innovation policy work. Of course, 70x is well below our bar (currently ~2,100x), and so the premise of the program is not to directly fund additional scientific research, but instead to make grants that we think are sufficiently likely to increase the effective size of R&D effort by raising its efficiency or productivity or level enough to clear the bar. Moreover, while most of our giving in this program flows to grantees in high-income countries operating on the research frontier, the ultimate case is based on global impact: we assume research like this eventually benefits everyone, though with multi-decade lags (which in practice lead us to discount the benefits substantially, as discussed in Matt’s paper above and this report by Tom Davidson).
Our innovation policy work so far has cleared our internal bar for impact, and one reason we are excited to expand into this space is because we’ve found more opportunities that we think are above the bar than Good Ventures’ previous budget covered.
We also think our housing policy work clears our internal bar for impact. Our current internal valuation on a marginal housing unit in a highly constrained metro area in the US is just over $400k (so a grant would be above the bar if we think it causes a new unit in expectation for $200). A relatively small part of the case here is again based on innovation—there is some research indicating that increasing the density of people in innovative cities increases the rate of innovation. But our internal valuation for new housing units also incorporates a few other paths to impact. For example, increasing the density of productive cities also raises the incomes of movers and other residents, and reduces the overall carbon footprint of the housing stock. Collectively, we think these benefits are large enough to make a lot of grants related to housing policy clear our bar, given the leverage that advocacy can sometimes bring.
Someone else asked him to clarify what he meant by the numbers on the housing policy work and also separately asked
Also I read the report you linked on R and D where it didn’t clear the funding bar. That said 45x, you were pushing that up to 76x
“In a highly stylized calculation, the social returns to marginal R&D are high, but typically not as high as the returns in some other areas we’re interested in (e.g. cash transfers to those in absolute poverty). Measured in our units of impact (where “1X” is giving cash to someone earning $50k/year) I estimate the cost effectiveness of funding R&D is 45X. This is 45% the ROI from giving cash to someone earning $500/year, and 4.5% the GHW bar for funding. More.”
I understand that you think you can raise efficiency of certain types of R@D, but getting from 70x to 2100x means you would have to 30x the efficiency. I struggle to understand how that would be likely again any pointers here?
to which he replied
On the housing piece: we have a long internal report on the valuation question that we didn’t think was particularly relevant to external folks so we haven’t published it, but will see about doing so later this year. Fn 7 and the text around it of this grant writeup explain the basic math of a previous version of that valuation calc, though our recent version is a lot more complex.
If you’re asking about the bar math, the general logic is explained here and the move to a 2,100x bar is mentioned here.
On R&D, the 70x number comes from Matt Clancy’s report (and I think we may have made some modest internal revisions but I don’t think they change the bottom line much). You’re right that that implies we need ~30x leverage to clear our bar. We sometimes think that is possible directly through strategic project selection—e.g., we fund direct R&D on neglected and important global health problems, and sometimes (in the case of this portfolio) through policy/advocacy. I agree 30x leverage presents a high bar and I think it’s totally reasonable to be skeptical about whether we can clear it, but we think we sometimes can.
(I don’t know anything else about this beyond the exchange above, if you’re interested in litigating this further you can try replying to his last comment maybe)
I mean, I think abundance and growth has much better arguments for improving long-run well-being cost-effectively than reducing global disease burden. I do think it gets messy because of technological risks, but if you bracket that (which is course is a very risky thing to do), this seems like a good reallocation of funds to me that seems closer to reasonable to me.
I’m very confused about how they’re evaluating cost-effectiveness here. Like, no, spending $200 on vaccines in Africa to save lives seems like a much better deal than spending $200 to cause one more $400k apartment to exist.
Do you mean “they” or “me”? I think the latter is very likely better in the long run! Like, the places where $400k apartments exist have enormous positive externalities and enormous per-capita productivity, which is the central driver of technological growth, which is definitely going to determine long-run disease burden and happiness and population levels. The argument here feels pretty straightforward. We can try to put numbers on it, if you want, but if you accept the basic premise it’s kind of hard for the numbers to come out in favor of vaccines.
In RSP, Anthropic committed to define ASL-4 by the time they reach ASL-3.
With Claude 4 released today, they have reached ASL-3. They haven’t yet defined ASL-4.
Turns out, they have quietly walked back on the commitment. The change happened less than two months ago and, to my knowledge, was not announced on LW or other visible places unlike other important changes to the RSP. It’s also not in the changelog on their website; in the description of the relevant update, they say they added a new commitment but don’t mention removing this one.
Anthropic’s behavior is not at all the behavior of a responsible AI company. Trained a new model that reaches ASL-3 before you can define ASL-4? No problem, update the RSP so that you no longer have to, and basically don’t tell anyone. (Did anyone not working for Anthropic know the change happened?)
When their commitments go against their commercial interests, we can’t trust their commitments.
You should not work at Anthropic on AI capabilities.
I don’t think it’s accurate to say that they’ve “reached ASL-3?” In the announcement, they say
To be clear, we have not yet determined whether Claude Opus 4 has definitively passed the Capabilities Threshold that requires ASL-3 protections. Rather, due to continued improvements in CBRN-related knowledge and capabilities, we have determined that clearly ruling out ASL-3 risks is not possible for Claude Opus 4 in the way it was for every previous model, and more detailed study is required to conclusively assess the model’s level of risk.
And it’s also inaccurate to say that they have “quietly walked back on the commitment.” There was no commitment to define ASL-4 by the time they reach ASL-3 in the updated RSP, or in versions 2.0 (released October last year), and 2.1 (see all past RSPs here). I looked at all mentions of ASL-4 in the lastest document, and this comes closest to what they have:
If, however, we determine we are unable to make the required showing, we will act as though the model has surpassed the Capability Threshold.9 This means that we will (1) upgrade to the ASL-3 Required Safeguards (see Section 4) and (2) conduct follow-up a capability assessment to confirm that the ASL-4 Standard is not necessary (see Section 5).
Which is what they did with Opus 4. Now they have indeed not provided a ton of details on what exactly they did to determine that the model has not reached ASL-4 (see report), but the comment suggesting that they “basically [didn’t] tell anyone” feels inaccurate.
According to the Anthropic’s chief scientist’s interview with Time today, they “work under the ASL-3 standard”. So they have reached the safety level—they’re working under it—and the commitment would’ve applied[1].
There was a commitment in RSP prior to Oct last year. They did walk back on this commitment quietly: the fact they walk back on it was not announced in their posts and wasn’t noticed in the posts of others; only a single LessWrong comment in Oct 2024 from someone not affiliated with Anthropic mentions it. I think this is very much “quietly walking back” on a commitment.
According to Midas, the commitment was fully removed in 2.1: “Removed commitment to “define ASL-N+ 1 evaluations by the time we develop ASL-N models””; a pretty hidden (I couldn’t find it!) revision changelog also attributes the decision to not maintain the commitment to 2.1. At the same time, the very public changelog on the RSP page only mentions new commitments and doesn’t mention the decision to “not maintain” this one.
“they’re not sure whether they’ve reached the level of capabilities which requires ASL-3 and decided to work under ASL-3, to be revised if they find out the model only requires ASL-2” could’ve been more accurate, but isn’t fundamentally different IMO. And Anthropic is taking the view that by the time you develop a model which might be ASL-n, the commitments for ASL-n should trigger until you rule that out. It’s not even clear what a different protocol could be, if you want to release a model that might be at ASL-n. Release it anyway and contain it only after you’ve confirmed it’s at ASL-n?
Meta-level comment now that this has been retracted.
Anthropic’s safety testing for Claude 4 is vastly better than DeepMind’s testing of Gemini. When Gemini 2.5 Pro was released there was no safety testing info and even the model card that was eventually released is extremely barebones to compared to what Anthropic put out.
DeepMind should be embarrassed by this. The upcoming PauseCon protest outside DeepMind’s headquarters in London will focus on this failure.
Btw, since this is a call to participate in a PauseAI protest on my shortform, do your colleagues have plans to do anything about my ban from the PauseAI Discord server—like allowing me to contest it (as I was told there was a discussion of making a procedure for) or at least explaining it?
Because it’s lowkey insane!
For everyone else, who might not know: a year ago I, in context, on the PauseAI Discord server, explained my criticism of PauseAI’s dishonesty and, after being asked to, shared proofs that Holly publicly lied about our personal communications, including sharing screenshots of our messages; a large part of the thread was then deleted by the mods because they were against personal messages getting shared, without warning (I would’ve complied if asked by anyone representing a server to delete something!) or saving/allowing me to save any of the removed messages in the thread, including those clearly not related to the screenshots that you decided were violating the server norms; after a discussion of that, the issue seemed settled and I was asked to maybe run some workshops for PauseAI to improve PauseAI’s comms/proofreading/factchecking; and then, months later, I was banned despite not having interacted with the server at all.
When I reached out after noticing not being able to join the server, there was a surprising combination of being very friendly and excited to chat and scheduling a call and getting my takes on strategy, looking surprised to find out that I was somehow banned, then talking about having “protocols” for notifying of the ban which somehow didn’t work, and mentioning you were discussing creating a way to contest the ban and saying stuff about the importance of allowing the kind of criticism that I did; and at the same time, zero transparency around the actual reasons for the ban, how it happened, why I wasn’t notified, and then giving zero updates.
It’s hard to assume that the PauseAI leadership is following deontology.
Uhh yeah sorry that there hasn’t been a consistent approach. In our defence I believe yours in the only complex moderation case that PauseAI Global has ever had to deal with so far and we’ve kinda dropped the ball on figuring out how to handle it.
For context my take is that you’ve raised some valid points. And also you’ve acted poorly in some parts of this long running drama. And most importantly you’ve often acted in a way that seems almost optimised to turn people off. Especially for people not familiar with LessWrong culture, the inferential distance between you and many people is so vast that they really cannot understand you at all. Your behavior pattern matches to trolling / nuisance attention seeking in many ways and I often struggle to communicate to more normie types why I don’t think you’re insane or malicious.
I do sincerely hope to iron this out some time and put in place actual systems for dealing with similar disputes in the future. And I did read over the original post + Google doc a few months ago to try to form my own views more robustly. But this probably won’t be a priority for PauseAI Global in the immediate future. Sorry.
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
I wrote the article Mikhail referenced and wanted to clarify some things.
The thresholds are specified, but the original commitment says, “We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth,” and, regarding ASL-4, “Capabilities and warning sign evaluations defined before training ASL-3 models.”
The latest RSP says this of CBRN-4 Required Safeguards, “We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update.”
Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, “At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required.”
I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.
AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
I assume this is what you meant by “further protections” but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn’t.
I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I’m worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
FYI, I was (and remain to this day) confused by AI R&D 4 being called an “ASL-4” threshold. AFAICT as an outsider, ASL-4 refers to a set of deployment and security standards that are now triggered by dangerous capability thresholds, and confusingly, AI R&D 4 corresponds to the ASL-3 standard.
AI R&D 5, on the other hand, corresponds to ASL-4, but only on the security side (nothing is said about the deployment side, which matters quite a bit given that Anthropic includes internal deployment here and AI R&D 5 will be very tempting to deploy internally)
I’m also confused because the content of both AI R&D 4 and AI R&D 5 is seemingly identical to the content of the nearest upcoming threshold in the October 2024 policy (which I took to be the ASL-3 threshold). A rough sketch of what I think happened:
A rough sketch of my understanding of the current policy:
When I squint hard enough at this for a while, I think I can kind of see the logic: the model likely to trigger the CBRN threshold requiring ASL-3 seems quite close, whereas we might be further from the very-high threshold that was the October AI R&D threshold (now AI R&D 4), so the October AI R&D threshold was just bumped to the next level (and the one after that since causing dramatic scaling of effective compute is even harder than being a entry-level remote worker… maybe) with some confidence that we were still somewhat far away from it and thus it can be treated effectively as today’s upcoming + to-be-defined (what would have been called n+1) threshold.
I just get lost when we call it an ASL-4 threshold (it’s not, it’s an ASL-3 threshold), and also it mostly makes me sad that these thresholds are so high because I want Anthropic to get some practice reps in implementing the RSP before it’s suddenly hit with an endless supply of fully automated remote workers (plausibly the next threshold, AI R&D 4, requiring nothing more than the deployment + security standards Anthropic already put in place as of today).
I wish today’s AI R&D 4 threshold had been set at what, in the October policy, was called a “checkpoint” on the way to ASL-3: completing 2-8 hour SWE tasks. It looks like we’re about there, and it also looks like we’re about at CBRN-4, and ASL-3 seems like a reasonable set of precautions for both milestones. I do not think ASL-3 will be appropriate when we truly get endless parallelized drop-in Anthropic researchers, even if they have not yet been shown to dramatically increase the rate of effective scaling.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).
Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.
I’m accumulating a small collection of spicy previously unreported deets about Anthropic for an upcoming post. Some of them sadly cannot publish because they might identify the sources. Others can! Some of those will be surprising to staff.
If you can share anything that’s wrong with Anthropic, that has not previously been public, DM me, preferably on Signal (@ misha.09)
The IMO organizers asked AI labs not to share their IMO results until a week later to not steal the spotlight from the kids. IMO organizers consider OpenAI’s actions “rude and inappropriate”.
Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)
And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)
This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.
Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.
You all should just call these two probabilities two different words instead of arguing which one is the correct definition for “probability”.
As the creator of the linked market, I agree it’s definitional. I think it’s still interesting to speculate/predict what definition will eventually be considered most natural.
Say an adventurer wants Keltham to coordinate with a priest of Asmodeus on a shared interest. She goes to Keltham and says some stuff that she expects could enable coordination. She expects that Keltham, due to his status of a priest of Abadar, would not act on that information in ways that would be damaging to the Evil priest (as it was shared in her expectation that a priest of Abadar would aspire to be Lawful enough not to do that with information that was shared to enable coordination, making someone regret dealing with him). Keltham prefers using this information in a way damaging to the priest of Asmodeus to using it to coordinate. Keltham made no explicit promises about the use of information; the adventurer told him that piece of information first and said it was shared to enable coordination and shouldn’t be acted upon outside of it enabling coordination only afterwards. Would Keltham say “deal, thanks for telling me”, or would he say “lol no I didn’t agree to that prior to being told thanks for telling me”?
It is a predictable consequence of saying “lol no I didn’t agree to that prior to being told thanks for telling me” that Keltham (and other people with similar expressed views regarding information and coordination) won’t be told information intended for coordination in future, including information that Keltham and similar people would have wanted to be able to use in order to coordinate instead of using it against the interests of those giving them the information.
So the question is: just how strongly does Keltham value using this information against the priest, when weighed against the cost of decreasing future opportunities for coordination for himself and others who are perceived to be similar to him?
There are plenty of other factors, such as whether there are established protocols for receiving such information in a way that binds priests of Abadar to not use it against the interests of those conveying it, whether the priest and the adventurer (and future others) could have been expected to know those protocols, and so on.
Yep. (I think there’s also a sense of honor and not screwing people over that’s not just about the value of getting such information in the future, that Keltham would care about.)
I do not believe Anthropic as a company has a coherent and defensible view on policy. It is known that they said words they didn’t hold while hiring people (and they claim to have good internal reasons for changing their minds, but people did work for them because of impressions that Anthropic made but decided not to hold). It is known among policy circles that Anthropic’s lobbyists are similar to OpenAI’s.
From Jack Clark, a billionaire co-founder of Anthropic and its chief of policy, today:
Dario is talking about countries of geniuses in datacenters in the context of competition with China and a 10-25% chance that everyone will literally die, while Jack Clark is basically saying, “But what if we’re wrong about betting on short AI timelines? Security measures and pre-deployment testing will be very annoying, and we might regret them. We’ll have slower technological progress!”
This is not invalid in isolation, but Anthropic is a company that was built on the idea of not fueling the race.
Do you know what would stop the race? Getting policymakers to clearly understand the threat models that many of Anthropic’s employees share.
It’s ridiculous and insane that, instead, Anthropic is arguing against regulation because it might slow down technological progress.
I’ve only seen this excerpt, but it seems to me like Jack isn’t just arguing against regulation because it might slow progress—and rather something more like:
“there’s some optimal time to have a safety intervention, and if you do it too early because your timeline bet was wrong, you risk having worse practices at the actually critical time because of backlash”
This seems probably correct to me? I think ideally we’d be able to be cautious early and still win the arguments to be appropriately cautious later too. But empirically, I think it’s fair not to take as a given?
I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.
In my opinion, Project Lawful / planecrash is a terrible reference in addition to being written in a seriously annoying format. Although I have read it, I don’t recommend that anyone else read it. If any of the material in it should become some sort of shared culture that we should assume others in the community have read, it would require completely rewriting the entire thing from beginning to end.
I am not one of the two voters who initially downvoted, but I understand why they might have done so. I have weakly downvoted your comment for having made a load-bearing link between someone not having read Project Lawful and calling them “an NPC” in your sense, which is not the standard of discourse I want to see.
If you were expecting this person to have read this extremely niche and frankly bizarre work of fiction without having confirmed that they have actually read it and understood and fully agree with some relevant decision theory parts in it, then that seems pretty unwise of you and their not having done so does not reflect in any way poorly upon them.
“You didn’t act like I think the fictional character Keltham would have” is not a reasonable criticism of anyone.
There may or may not be some other unspecified actions they may have performed that do reflect poorly upon them, but those do not appear to connect in any way with this post.
I think many people around me would’ve had the same assumption that this particular person read planecrash. I don’t want to say more as I probably don’t want to say that they specifically did that because I think their goals are still similar to my, even if they’re very mistaken and are doing some very counterproductive things, and I definitely want to err on the side on not harming someone’s life/social status without a strong reason why it would be good for the community to know a fact about them.
NPC-like behavior was mostly due to them doing the thing they seemed to ascribe to themselves as what they should just be doing in their role, without willingness to really consider arguments; planecrash was just a thing that would’ve given them the argument why you shouldn’t take the specific actions they’ve taken. (Basic human decency and friendship would also suffice, but if someone read planecrash and still did the thing I would not want to deal with them in any way in the future, the way you wouldn’t want to deal with someone who just screws you over for no reason.)
“You didn’t act like I think the fictional character Keltham would have” is not a reasonable criticism of anyone.
I agree; it was largely what they did, which has nothing to do with planecrash. There are just some norms, that I expect it would be good for the community to have, that one implicitly learns from planecrash.
I didn’t down-vote and I think planecrash is amazing. But FYI referring to other humans as NPCs, even if you elaborate and make it clear what you mean, leaves a very bad taste in my mouth. If you were a random person I didn’t know anything about, and this was the first thing I read from you*, I’d think you were a bad person and I’d want nothing to do with you.
Not judging you, just informing you about my intuitive immediate reaction to your choice of words. Plausible other people who did downvote felt similar.
(Yep, it was me ranting about experiencing someone betraying my trust in a fairly sad way, who I really didn’t expect to do that, and who was very non-smart/weirdly scripted about doing it, and it was very surprising until I learned that they’ve not read planecrash. I normally don’t go around viewing anyone this way; and I dislike it when (very rarely! i can’t recall any other situations like that!) I do feel this way about someone.)
Anthropic employees: stop deferring to Dario on politics. Think for yourself.
Do your company’s actions actually make sense if it is optimizing for what you think it is optimizing for?
Anthropic lobbied against mandatory RSPs, against regulation, and, for the most part, didn’t even support SB-1047. The difference between Jack Clark and OpenAI’s lobbyists is that publicly, Jack Clark talks about alignment. But when they talk to government officials, there’s little difference on the question of existential risk from smarter-than-human AI systems. They do not honestly tell the governments what the situation is like. Ask them yourself.
A while ago, OpenAI hired a lot of talent due to its nonprofit structure.
Anthropic is now doing the same. They publicly say the words that attract EAs and rats. But it’s very unclear whether they institutionally care.
Dozens work at Anthropic on AI capabilities because they think it is net-positive to get Anthropic at the frontier, even though they wouldn’t work on capabilities at OAI or GDM.
It is not net-positive.
Anthropic is not our friend. Some people there do very useful work on AI safety (where “useful” mostly means “shows that the predictions of MIRI-style thinking are correct and we don’t live in a world where alignment is easy”, not “increases the chance of aligning superintelligence within a short timeframe”), but you should not work there on AI capabilities.
Anthropic’s participation in the race makes everyone fall dead sooner and with a higher probability.
Work on alignment at Anthropic if you must. I don’t have strong takes on that. But don’t do work for them that advances AI capabilities.
I think you should try to clearly separate the two questions of
Is their work on capabilities a net positive or net negative for humanity’s survival?
Are they trying to “optimize” for humanity’s survival, and do they care about alignment deep down?
I strongly believe 2 is true, because why on Earth would they want to make an extra dollar if misaligned AI kills them in addition to everyone else? Won’t any measure of their social status be far higher after the singularity, if it’s found that they tried to do the best for humanity?
I’m not sure about 1. I think even they’re not sure about 1. I heard that they held back on releasing their newer models until OpenAI raced ahead of them.
You (and all the people who upvoted your comment) have a chance of convincing them (a little) in a good faith debate maybe. We’re all on the same ship after all, when it comes to AI alignment.
PS: AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter, so I personally hope for their success. I may be wrong, but it’s my best guess...
There is very little hope IMHO in increasing spending on technical AI alignment because (as far as we can tell based on how slow progress has been on it over the last 22 years) it is a much thornier problem than AI capability research and because most people doing AI alignment research don’t have a viable story about how they are going to stop any insights / progress they achieve from helping with AI capability research. I mean, if you have a specific plan that avoids these problems, then let’s hear it, I am all ears, but advocacy in general of increasing work on technical alignment is counterproductive IMHO.
EDIT: thank you so much for replying to the strongest part of my argument, no one else tried to address it (despite many downvotes).
I disagree with the position that technical AI alignment research is counterproductive due to increasing capabilities, but I think this is very complicated and worth thinking about in greater depth.
Do you think it’s possible, that your intuition on alignment research being counterproductive, is because you compared the plausibility of the two outcomes:
Increasing alignment research causes people to solve AI alignment, and humanity survives.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity.
And you decided that outcome 2 felt more likely?
Well, that’s the wrong comparison to make.
The right comparison should be:
Increasing alignment research causes people to improve AI alignment, and humanity survives in a world where we otherwise wouldn’t survive.
Increasing alignment research led to an improvement in AI capabilities, allowing AI labs to build a superintelligence which then kills humanity in a world where we otherwise would survive.
In this case, I think even you would agree what P(1) > P(2).
P(2) is very unlikely because if increasing alignment research really would lead to such a superintelligence, and it really would kill humanity… then let’s be honest, we’re probably doomed in that case anyways, even without increasing alignment research.
If that really was the case, the only surviving civilizations would have had different histories, or different geographies (e.g. only a single continent with enough space for a single country), leading to a single government which could actually enforce an AI pause.
We’re unlikely to live in a world so pessimistic that alignment research is counterproductive, yet so optimistic that we could survive without that alignment research.
I believe we’re probably doomed anyways.
Sorry to disappoint you, but I do not agree.
Although I don’t consider it quite impossible that we will figure out alignment, most of my hope for our survival is in other things, such as a group taking over the world and then using their power to ban AI research. (Note that that is in direct contradiction to your final sentence.) So for example, if Putin or Xi were dictator of the world, my guess is that there is a good chance he would choose to ban all AI research. Why? It has unpredictable consequences. We Westerners (particularly Americans) are comfortable with drastic change, even if that change has drastic unpredictable effects on society; non-Westerners are much more skeptical: there have been too many invasions, revolutions and peasant rebellions that have killed millions in their countries. I tend to think that the main reason Xi supports China’s AI industry is to prevent the US and the West from superseding China and if that consideration were removed (because for example he had gained dictatorial control over the whole world) he’d choose to just shut it down (and he wouldn’t feel that need to have a very strong argument for that shutting it down like Western decision-makers would: non-Western leader shut important things down all the time or at least they would if the governments they led had the funding and the administrative capacity to do so).
Of course Xi’s acquiring dictatorial control over the whole world is extremely unlikely, but the magnitude of the technological changes and societal changes that are coming will tend to present opportunities for certain coalitions to gain and to keep enough power to shut AI research down worldwide. (Having power in all countries hosting leading-edge fabs is probably enough.) I don’t think this ruling coalition necessarily need to believe that AI presents a potent risk of human extinction for them to choose to shut it down.
I am aware that some reading this will react to “some coalition manages to gain power over the whole world” even more negatively than to “AI research causes the extinction of the entire human race”. I guess my response is that I needed an example of a process that could save us and that would feel plausible—i.e., something that might actually happen. I hasten add that there might be other processes that save us that don’t elicit such a negative reaction—including processes the nature of which we cannot even currently imagine.
I’m very skeptical of any intervention that reduces the amount of time we have left in the hopes that this AI juggernaut is not really as potent a threat to us as it currently appears. I was much much less skeptical of alignment research 20 years ago, but since then a research organization has been exploring the solution space and the leader of that organization (Nate Soares) and its most senior researcher (Eliezer) are reporting that the alignment project is almost completely hopeless. Yes, this organization (MIRI) is kind of small, but it has been funded well enough to keep about a dozen top-notch researchers on the payroll and it has been competently led. Also, for research efforts like this, how many years the team had to work on the problem is more important than the size of the team, and 22 years is a pretty long time to end up with almost no progress other than some initial insights (around the orthogonality thesis, the fragility of value, convergent instrumental values, CEV as a solution to if the problem were solvable by the current generation of human beings.
OK, if I’m being fair and balanced, then I have to concede that it was probably only in 2006 (when Eliezer figured out how to write a long intellectually-dense blog post every day) or even only in 2008 (when Anna Salamon join the organization—she was very good at recruiting and had a lot of energy to travel and to meet people) that Eliezer’s research organization could start to pick and choose among a broad pool of very talented people, but still between 2008 and now is 17 years, which again is a long time for a strong team to fail to make even a decent fraction of the progress humanity would seem to need to make on the alignment problem if in fact the alignment problem is solvable by spending more money on it. It does not appear to me to be the sort of problem than can be solved with 1 or 2 additional insights; it seems a lot more like the kind of problem where insight 1 is needed, but before any mere human can find insight 1, all the researchers need to have already known insight 2, and to have any hope of finding insight 2, they all would have had to know insight 3, and so on.
I don’t agree that the probability of alignment research succeeding is that low. 17 years or 22 years of trying and failing is strong evidence against it being easy, but doesn’t prove that it is so hard that increasing alignment research is useless.
People worked on capabilities for decades, and never got anywhere until recently, when the hardware caught up, and it was discovered that scaling works unexpectedly well.
There is a chance that alignment research now might be more useful than alignment research earlier, though there is uncertainty in everything.
We should have uncertainty in the Ten Levels of AI Alignment Difficulty.
The comparison
It’s unlikely that 22 years of alignment research is insufficient but 23 years of alignment research is sufficient.
But what’s even more unlikely, is the chance that $200 billion on capabilities research plus $0.1 billion on alignment research is survivable, while $210 billion on capabilities research plus $1 billion on alignment research is deadly.
In the same way adding a little alignment research is unlikely to turn failure into success, adding a little capabilities research is unlikely to turn success into failure.
It’s also unlikely that alignment effort is even deadlier than capabilities effort dollar for dollar. That would mean reallocating alignment effort into capabilities effort paradoxically slows down capabilities and saves everyone.
Even if you are right
Even if you are right that delaying AI capabilities is all that matters, Anthropic still might be a good thing.
Even if Anthropic disappeared, or never existed in the first place, the AI investors will continue to pay money for research, and the AI researchers will continue to do research for money. Anthropic was just the middleman.
If Anthropic never existed, the middlemen would consist of only OpenAI, DeepMind, Meta AI, and other labs. These labs will not only act as the middle man, but lobby against regulation far more aggressively than Anthropic, and may discredit the entire “AI Notkilleveryoneism” movement.
To continue existing at one of these middlemen, you cannot simply stop paying the AI researchers for capabilities research, otherwise the AI investors and AI customers will stop paying you in turn. You cannot stem the flow, you can only decide how much goes through you.
It’s the old capitalist dilemma of “doing evil or getting out-competed by those who do.”
For their part, Anthropic redirected some of that flow to alignment research, and took the small amount of precautions which they could afford to take. They were also less willing to publish capabilities research than other labs. That may be the best one can hope to accomplish against this unstoppable flow from the AI investors to AI researchers.
The small amount of precautions which Anthropic did take may have already costed them their first mover advantage. Had Anthropic raced ahead before OpenAI released ChatGPT, Anthropic may have stolen the limelight, got the early customers and investors, and been bigger than OpenAI.
This assumes that alignment success is the mostly likely avenue to safety for humankind whereas like I said, I consider other avenues more likely. Actually there needs to be a qualifier on that: I consider other avenues more likely than the alignment project’s succeeding while the current generation of AI researchers remain free to push capabilities: if the AI capabilities juggernaut could be stopped for 150 years, giving the human population time to get smarter and wiser, then alignment is likely (say p = .7) to succeed in my estimation. I am informed by Eliezer in his latest interview that such a success would probably use some technology other than deep learning to create the AI’s capabilities; i.e., deep learning is particularly hard to align.
Central to my thinking is my belief that alignment is just a significantly harder problem than the problem of creating an AI capable of killing us all. Does any of the reasoning you do in your section “the comparision” change if you started believing that alignment is much much harder than creating a superhuman (unaligned) AI?
It will probably come as no great surprise that I am unmoved by the arguments I have seen (including your argument) that Anthropic is so much better than OpenAI that it helps the global situation for me to support Anthropic (if it were up to me, both would be shut down today if I couldn’t delegate the decision to someone else and if I had to decide now with the result that there is no time for me to gather more information) but I’m not very certain and would pay attention to future arguments for supporting Anthropic or some other lab.
Thanks for engaging with my comments.
Thank you, I’ve always been curious about this point of view because a lot of people have a similar view to yours.
I do think that alignment success is the most likely avenue, but my argument doesn’t require this assumption.
Your view isn’t just that “alternative paths are more likely to succeed than alignment,” but that “alternative paths are so much more likely to succeed than alignment, that the marginal capabilities increase caused by alignment research (or at least Anthropic), makes them unworthwhile.”
To believe that alignment is that hopeless, there should be stronger proof than “we tried it for 22 years, and the prior probability of the threshold being between 22 years and 23 years is low.” That argument can easily be turned around to argue why more alignment research is equally unlikely to cause harm (and why Anthropic is unlikely to cause harm). I also think multiplying funding can multiply progress (e.g. 4x funding ≈ 2x duration).
If you really want a singleton controlling the whole world (which I don’t agree with), your most plausible path would be for most people to see AI risk as a “desperate” problem, and for governments under desperation to agree on a worldwide military which swears to preserve civilian power structures within each country.[1]
Otherwise, the fact that no country took over the world during the last centuries strongly suggests that no country will in the next few years, and this feels more solid than your argument that “no one figured out alignment in the last 22 years, so no one will in the next few years.”
Out of curiosity, would you agree with this being the most plausible path, even if you disagree with the rest of my argument?
The most plausible story I can imagine quickly right now is the US and China fight a war and the US wins and uses some of the political capital from that win to slow down the AI project, perhaps through control over the world’s leading-edge semiconductor fabs plus pressuring Beijing to ban teaching and publishing about deep learning (to go with a ban on the same things in the West). I believe that basically all the leading-edge fabs in existence or that will be built in the next 10 years are in the countries the US has a lot of influence over or in China. Another story: the technology for “measuring loyalty in humans” gets really good fast, giving the first group to adopt the technology so great an advantage that over a few years the group gets control over the territories where all the world’s leading-edge fabs and most of the trained AI researchers are.
I want to remind people of the context of this conversation: I’m trying to persuade people to refrain from actions that on expectation make human extinction arrive a little quicker because most of our (sadly slim) hope for survival IMHO flows from possibilities other than our solving (super-)alignment in time.
I would go one step further and argue you don’t need to take over territory to shut down the semiconductor supply chain, if enough large countries believed AI risk was a desperate problem they could convince and negotiate the shutdown of the supply chain.
Shutting down the supply chain (and thus all leading-edge semiconductor fabs) could slow the AI project by a long time, but probably not “150 years” since the uncooperative countries will eventually build their own supply chain and fabs.
The ruling coalition can disincentivize the development of a semiconductor supply chain outside the territories it controls by selling world-wide semiconductors that use “verified boot” technology to make it really hard to use the semiconductor to run AI workloads similar to how it is really hard even for the best jailbreakers to jailbreak a modern iPhone.
That’s a good idea! Even today it may be useful for export controls (depending on how reliable it can be made).
The most powerful chips might be banned from export, and have “verified boot” technology inside in case they are smuggled out.
The second most powerful chips might be only exported to trusted countries, and also have this verified boot technology in case these trusted countries end up selling them to less trusted countries who sell them yet again.
If I believed that, then maybe I’d believe (like you seem to do) that there is no strong reason to believe that alignment project cannot be finished successfully before the capabilities project creates an unaligned super-human AI. I’m not saying scaling and hardware improvement have not been important: I’m saying they were not sufficient: algorithmic improvements were quite necessary for the field to arrive at anything like ChatGPT, and at least as early as 2006, there were algorithm improvements that almost everyone in the machine-learning field recognized as breakthrough or important insights. (Someone more knowledgeable about the topic might be able to push the date back into the 1990s or earlier.)
After the publication 19 years ago by Hinton et al of “A Fast Learning Algorithm for Deep Belief Nets”, basically all AI researchers recognized it as a breakthrough. Building on it, was AlexNet in 2012, again recognized as an important breakthrough by essentially everyone in the field (and if some people missed it then certainly generational adversarial networks, ResNets and AlphaGo convinced them). AlexNet was the first deep model trained on GPUs, a technique essential for the major breakthrough in 2017 reported in the paper “Attention is all you need”.
In contrast, we’ve seen nothing yet in the field of alignment that is as unambiguously a breakthrough as is the 2006 paper by Hinton et al or 2012′s AlexNet or (emphatically) the 2017 paper “Attention is all you need”. In fact I suspect that some researchers could tell that the attention mechanism reported by Bahdanau et al in 2015 or the Seq2Seq models reported on by Sutskever et al in 2014 was evidence that deep-learning language models were making solid progress and that a blockbuster insight like “attention is all you need” is probably only a few years away.
The reason I believe it is very unlikely for the alignment research project to succeed before AI kills us all is that in machine learning or the deep-learning subfield of machine learning, what was recognized by essentially everyone in the field as a minor or major breakthrough has occurred every few years. Many of these breakthrough rely on earlier breakthroughs (i.e., it is very unlikely for the sucessive breakthrough to have occurred if the earlier breakthrough had not been disseminated to the community of researcher). During this time, despite very talented people working on it, there has been zero results in alignment research that the entire field of alignment researchers would consider a breakthrough. That does not mean it is impossible for the alignment project to be finished in time, but it does IMO make it critical for the alignment project to be prosecuted in such a way that it does not inadvertently assist the capabilities project.
Yes, much more money has been spent on capability research the last 20 years than on alignment research, but money doesn’t help all that much to speed up research in which to have any hope of solving the problem, the researchers need insight X or X2, and to have any hope of arriving at insight X, they need insights Y and Y2, and to have much hope at all of arriving at Y, they need insight Z.
Even if building intelligence requires solving many many problems, preventing that intelligence from killing you may just require solving a single very hard problem. We may go from having no idea to having a very good idea.
I don’t know. My view is that we can’t be sure of these things.
There are certain kinds of things that it’s essentially impossible for any institution to effectively care about.
What is this referring to?
People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.
Ask Jack Clark whether that happened or not.
Anthropic ppl had also said approximately this publicly. Saying that it’s too soon to make the rules, since we’d end up mispecifying due to ignorance of tomorrow’s models.
There’s a big difference between regulation which says roughly “you must have something like an RSP”, and regulation which says “you must follow these specific RSP-like requirements”, and I think Mikhail is talking about the latter.
I personally think the former is a good idea, and thus supported SB-1047 along with many other lab employees. It’s also pretty clear to me that locking in circa-2023 thinking about RSPs would have been a serious mistake, and so I (along with many others) am generally against very specific regulations because we expect they would on net increase catastrophic risk.
When do you think would be a good time to lock in regulation? I personally doubt RSP-style regulation would even help, but the notion that now is too soon/risks locking in early sketches, strikes me as in some tension with e.g. Anthropic trying to automate AI research ASAP, Dario expecting ASL-4 systems between 2025—the current year!—and 2028, etc.
Here I am on record supporting SB-1047, along with many of my colleagues. I will continue to support specific proposed regulations if I think they would help, and oppose them if I think they would be harmful; asking “when” independent of “what” doesn’t make much sense to me and doesn’t seem to follow from anything I’ve said.
My claim is not “this is a bad time”, but rather “given the current state of the art, I tend to support framework/liability/etc regulations, and tend to oppose more-specific/exact-evals/etc regulations”. Obviously if the state of the art advanced enough that I thought the latter would be better for overall safety, I’d support them, and I’m glad that people are working on that.
AFAIK Anthropic has not unequivocally supported the idea of “you must have something like an RSP” or even SB-1047 despite many employees, indeed, doing so.
To quote from Anthropic’s letter to Govenor Newsom,
“we believe its benefits likely outweigh its costs” is “it was a bad bill and now it’s likely net-positive”, not exactly unequivocally supporting it. Compare that even to the language in calltolead.org.
Edit: AFAIK Anthropic lobbied against SSP-like requirements in private.
My guess is it’s referring to Anthropic’s position on SB 1047, or Dario’s and Jack Clark’s statements that it’s too early for strong regulation, or how Anthropic’s policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).
SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.
Our worldviews do not match, and I fail to see how yours makes sense. Even when I relax my predictions about the future to take in a wider set of possible paths… I still don’t get it.
AI is here. AGI is coming whether you like it or not. ASI will probably doom us.
Anthropic, as an org, seems to believe that there is a threshold of power beyond which creating an AGI more powerful than that would kill us all. OpenAI may believe this also, in part, but it seems like their expectation of where that threshold is is further away than mine. Thus, I think there is a good chance they will get us all killed. There is substantial uncertainty and risk around these predictions.
Now, consider that, before AGI becomes so powerful that utilizing it for practical purposes becomes suicide, there is a regime where the AI product gives its wielder substantial power. We are currently in that regime. The further AI gets advanced, the more power it grants.
Anthropic might get us all killed. OpenAI is likely to get us all killed. If you tryst the employees of Anthropic to not want to be killed by OpenAI… then you should realize that supporting them while hindering OpenAI is at least potentially a good bet.
Then we must consider probabilities, expected values, etc. Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
It seems to me that other possibilities exist, besides “has model with numbers” or “confused.” For example, that there are relevant ethical considerations here which are hard to crisply, quantitatively operationalize!
One such consideration which feels especially salient to me is the heuristic that before doing things, one should ideally try to imagine how people would react, upon learning what you did. In this case the action in question involves creating new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth, so my guess is that the reaction would entail things like e.g. literal worldwide riots. If so, this strikes me as the sort of consideration one should generally weight more highly than their idiosyncratic utilitarian BOTEC.
Does your model predict literal worldwide riots against the creators of nuclear weapons? They posed a single-digit risk of killing everyone on Earth (total, not yearly).
It would be interesting to live in a world where people reacted with scale sensitivity to extinction risks, but that’s not this world.
nuclear weapons have different game theory. if your adversary has one, you want to have one to not be wiped out; once both of you have nukes, you don’t want to use them.
also, people were not aware of real close calls until much later.
with ai, there are economic incentives to develop it further than other labs, but as a result, you risk everyone’s lives for money and also create a race to the bottom where everyone’s lives will be lost.
I think you (or @Adam Scholl) need to argue why people won’t be angry at you if you developed nuclear weapons, in a way which doesn’t sound like “yes, what I built could have killed you, but it has an even higher chance of saving you!”
Otherwise, it’s hard to criticize Anthropic for working on AI capabilities without considering whether their work is a net positive. It’s hard to dismiss the net positive arguments as “idiosyncratic utilitarian BOTEC,” when you accept “net positive” arguments regarding nuclear weapons.
Allegedly, people at Anthropic have compared themselves to Robert Oppenheimer. Maybe they know that one could argue they have blood on their hands, the same way one can argue that about Oppenheimer. But people aren’t “rioting” against Oppenheimer.
I feel it’s more useful to debate whether it is a net positive, since that at least has a small chance of convincing Anthropic or their employees.
My argument isn’t “nuclear weapons have a higher chance of saving you than killing you”. People didn’t know about Oppenheimer when rioting about him could help. And they didn’t watch The Day After until decades later. Nuclear weapons were built to not be used.
With AI, companies don’t build nukes to not use them; they build larger and larger weapons because if your latest nuclear explosion is the largest so far, the universe awards you with gold. The first explosion past some unknown threshold will ignite the atmosphere and kill everyone, but some hope that it’ll instead just award them with infinite gold.
Anthropic could’ve been a force of good. It’s very easy, really: lobby for regulation instead of against it so that no one uses the kind of nukes that might kill everyone.
In a world where Anthropic actually tries to be net-positive, they don’t lobby against regulation and instead try to increase the chance of a moratorium on generally smarter-than-human AI systems until alignment is solved.
We’re not in that world, so I don’t think it makes as much sense to talk about Anthropic’s chances of aligning ASI on first try.
(If regulation solves the problem, it doesn’t matter how much it damaged your business interests (which maybe reduced how much alignment research you were able to do). If you really care first and foremost about getting to aligned AGI, then regulation doesn’t make the problem worse. If you’re lobbying against it, you really need to have a better justification than completely unrelated “if I get to the nuclear banana first, we’re more likely to survive”.)
Hi,
I’ve just read this post, and it is disturbing what arguments Anthropic made about how the US needs to be ahead of China.
I didn’t really catch up to this news, and I think I know where the anti-Anthropic sentiment is coming from now.
I do think that Anthropic only made those arguments in the context of GPU export controls, and trying to convince the Trump administration to do export controls if nothing else. It’s still very concerning, and could undermine their ability to argue for strong regulation in the future.
That said, I don’t agree with the nuclear weapon explanation.
Suppose Alice and Bob were each building a bomb. Alice’s bomb has a 10% chance of exploding and killing everyone, and a 90% chance of exploding into rainbows and lollipops and curing cancer. Bob’s bomb has a 10% chance of exploding and killing everyone, and a 90% chance of “never being used” and having a bunch of good effects via “game theory.”
I think people with ordinary moral views will not be very angry at Alice, but forgive Bob because “Bob’s bomb was built to not be used.”
(Dario’s post did not impact the sentiment of my shortform post.)
I don’t believe the nuclear bomb was truly built to not be used from the point of view of the US gov. I think that was just a lie to manipulate scientists who might otherwise have been unwilling to help.
I don’t think any of the AI builders are anywhere close to “building AI not to be used”. This seems even more clear than with nuclear, since AI has clear beneficial peacetime economically valuable uses.
Regulation does make things worse if you believe the regulation will fail to work as intended for one reason or another. For example, my argument that putting compute limits on training runs (temporarily or permanently) would hasten progress to AGI by focusing research efforts on efficiency and exploring algorithmic improvements.
It has been pretty clearly announced to the world by various tech leaders that they are explicitly spending billions of dollars to produce “new minds vastly smarter than any person, which pose double-digit risk of killing everyone on Earth”. This pronouncement has not yet incited riots. I feel like discussing whether Anthropic should be on the riot-target-list is a conversation that should happen after the OpenAI/Microsoft, DeepMind/Google, and Chinese datacenters have been burnt to the ground.
Once those datacenters have been reduced to rubble, and the chip fabs also, then you can ask things like, “Now, with the pressure to race gone, will Anthropic proceed in a sufficiently safe way? Should we allow them to continue to exist?” I think that, at this point, one might very well decide that the company should continue to exist with some minimal amount of compute, while the majority of the compute is destroyed. I’m not sure it makes sense to have this conversation while OpenAI and DeepMind remain operational.
That’s a very good heuristic. I bet even Anthropic agrees with it. Anthropic did not release their newer models until OpenAI released ChatGPT and the race had already started.
That’s not a small sacrifice. Maybe if they released it sooner, they would be bigger than OpenAI right now due to the first mover advantage.
I believe they want the best for humanity, but they are in a no-win situation, and it’s a very tough choice what they should do. If they stop trying to compete, the other AI labs will build AGI just as fast, and they will lose all their funds. If they compete, they can make things better.
AI safety spending is only $0.1 billion while AI capabilities spending is $200 billion. A company which adds a comparable amount of effort on both AI alignment and AI capabilities should speed up the former more than the latter.
Even if they don’t support all the regulations you believe in, they’re the big AI company supporting relatively much more regulation than all the others.
I don’t know, I may be wrong. Sadly it is so very hard to figure out what’s good or bad for humanity in this uncertain time.
I don’t think that most people, upon learning that Anthropic’s justification was “other companies were already putting everyone’s lives at risk, so our relative contribution to the omnicide was low” would then want to abstain from rioting. Common ethical intuitions are often more deontological than that, more like “it’s not okay to risk extinction, period.” That Anthropic aims to reduce the risk of omnicide on the margin is not, I suspect, the point people would focus on if they truly grokked the stakes; I think they’d overwhelmingly focus on the threat to their lives that all AGI companies (including Anthropic) are imposing.
Regarding common ethical intuitions, I think people in the post singularity world (or afterlife, for the sake of argument) will be far more forgiving of Anthropic. They will understand, even if Anthropic (and people like me) turned out wrong, and actually were a net negative for humanity.
Many ordinary people (maybe most) would have done the same thing in their shoes.
Ordinary people do not follow the utilitarianism that the awkward people here follow. Ordinary people also do not follow deontology or anything that’s the opposite of utilitarianism. Ordinary people just follow their direct moral feelings. If Anthropic was honestly trying to make the future better, they won’t feel that outraged at their “consequentialism.” They may be outraged an perceived incompetence, but Anthropic definitely won’t be the only one accused of incompetence.
In your mind, is there a difference between being killed by AI developed by OpenAI and by AI developed by Anthropic? What positive difference does it make, if Anthropic develops a system that kills everyone a bit earlier than OpenAI would develop such a system? Why do you call it a good bet?
Nope.
You’re right that the local incentives are not great: having a more powerful model is hugely economically beneficial, unless it kills everyone.
But if 8 billion humans knew what many of LessWrong users know, OpenAI, Anthropic, DeepMind, and others cannot develop what they want to develop, and AGI doesn’t come for a while.
From the top of my head, it actually likely could be sufficient to either (1) inform some fairly small subset of 8 billion people of what the situation is or (2) convince that subset that the situation as we know it is likely enough to be the case that some measures to figure out the risks and not be killed by AI in the meantime are justified. It’s also helpful to (3) suggest/introduce/support policies that change the incentives to race or increase the chance of (1) or (2).
A theory of change some have for Anthropic is that Anthropic might get in position to successfully do one of these two things.
My shortform post says that the real Anthropic is very different from the kind of imagined Anthropic that would attempt to do these nope. Real Anthropic opposes these things.
Are there good models that support that Anthropic is a good bet? I’m genuinely curious.
I assume that naively, if any side had more of the burden of proof, it would be Anthropic. They have many more resources, and are the ones doing the highly-impactful (and potentially negative) work.
My impression was that there was very little probablistic risk modeling here, but I’d love to be wrong.
I don’t feel free to share my model, unfortunately. Hopefully someone else will chime in. I agree with your point and that this is a good question!
I am not trying to say I am certain that Anthropic is going to be net positive, just that that’s my view as the higher probability.
I think it’s totally fine to think that Anthropic is a net positive. Personally, right now, I broadly also think it’s a net positive. I have friends on both sides of this.
I’d flag though that your previous comment suggested more to me than “this is just you giving your probability”
> Give me your model, with numbers, that shows supporting Anthropic to be a bad bet, or admit you are confused and that you don’t actually have good advice to give anyone.
I feel like there are much nicer ways to phase that last bit. I suspect that this is much of the reason you got disagreement points.
Fair enough. I’m frustrated and worried, and should have phrased that more neutrally. I wanted to make stronger arguments for my point, and then partway through my comment realized I didn’t feel good about sharing my thoughts.
I think the best I can do is gesture at strategy games that involve private information and strategic deception like Diplomacy and Stratego and MtG and Poker, and say that in situations with high stakes and politics and hidden information, perhaps don’t take all moves made by all players at literally face value. Think a bit to yourself about what each player might have in their uands, what their incentives look like, what their private goals might be. Maybe someone whose mind is clearer on this could help lay out a set of alternative hypotheses which all fit the available public data?
The private data is, pretty consistently, Anthropic being very similar to OpenAI where it matters the most and failing to mention in private policy-related settings its publicly stated belief on the risk that smarter-than-human AI will kill everyone.
I wonder if this is due to
funding—the company need money to perform research on safety alignment (X risks, and assuming they do want to to this), and to get there they need to publish models so that they can 1) make profits from them, 2) attract more funding. A quick look on the funding source shows Amazon, Google, some other ventures, and some other tech companies
empirical approach—they want to take empirical approach to AI safety and would need some limited capable models
But both of the points above are my own speculations
The book is now a NYT bestseller: #7 in combined print&e-books nonfiction, #8 in hardcover nonfiction.
I want to thank everyone here who contributed to that. You’re an awesome community, and you’ve earned a huge amount of dignity points.
https://www.nytimes.com/books/best-sellers/combined-print-and-e-book-nonfiction/
Nobody at Anthropic can point to a credible technical plan for actually controlling a generally superhuman model. If it’s smarter than you, knows about its situation, and can reason about the people training it, this is a zero-shot regime.
The world, including Anthropic, is acting as if “surely, we’ll figure something out before anything catastrophic happens.”
That is unearned optimism. No other engineering field would accept “I hope we magically pass the hardest test on the first try, with the highest stakes” as an answer. Just imagine if flight or nuclear technology were deployed this way. Now add having no idea what parts the technology is made of. We’ve not developed fundamental science about how any of this works.
As much as I enjoy Claude, it’s ordinary professional ethics in any safety-critical domain: you shouldn’t keep shipping SOTA tech if your own colleagues, including the CEO, put double-digit chances on that tech causing human extinction.
You’re smart enough to know how deep the gap is between current safety methods and the problem ahead. Absent dramatic change, this story doesn’t end well.
In the next few years, the choices of a technical leader in this field could literally determine not what the future looks like, but whether we have a future at all.
If you care about doing the right thing, now is the time to get more honest and serious than the prevailing groupthink wants you to be.
I think it’s accurate to say that most Anthropic employees are abhorrently reckless about risks from AI (though my guess is that this isn’t true of most people who are senior leadership or who work on Alignment Science, and I think that a bigger fraction of staff are thoughtful about these risks at Anthropic than other frontier AI companies). This is mostly because they’re tech people, who are generally pretty irresponsible. I agree that Anthropic sort of acts like “surely we’ll figure something out before anything catastrophic happens”, and this is pretty scary.
I don’t think that “AI will eventually pose grave risks that we currently don’t know how to avert, and it’s not obvious we’ll ever know how to avert them” immediately implies “it is repugnant to ship SOTA tech”, and I wish you spelled out that argument more.
I agree that it would be good if Anthropic staff (including those who identify as concerned about AI x-risk) were more honest and serious than the prevailing Anthropic groupthink wants them to be.
What if someone at Anthropic thinks P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30%? Then the obvious alternatives are to do their best to get governments / international agreements to make everyone pause or to make everyone’s AI development safer, but it’s not completely obvious that this is a better strategy because it might not be very tractable. Additionally, they might think these things are more tractable if Anthropic is on the frontier (e.g. because it does political advocacy, AI safety research, and deploys some safety measures in a way competitors might want to imitate to not look comparatively unsafe). And they might think these doom-reducing effects are bigger than the doom-increasing effects of speeding up the race.
You probably disagree with P(doom|some other company builds AGI) - P(doom|Anthropic builds AGI) and with the effectiveness of Anthropic advocacy/safety research/safety deployments, but I feel like this is a very different discussion from “obviously you should never build something that has a big chance of killing everyone”.
(I don’t think most people at Anthropic think like that, but I believe at least some of the most influential employees do.)
Also my understanding is that technology is often built this way during deadly races where at least one side believes that them building it faster is net good despite the risks (e.g. deciding to fire the first nuke despite thinking it might ignite the atmosphere, …).
If this is their belief, they should state it and advocate for the US government to prevent everyone in the world, including them, from building what has a double-digit chance of killing everyone. They’re not doing that.
P(doom|Anthropic builds AGI) is 15% and P(doom|some other company builds AGI) is 30% --> You need to add to this the probability that Anthropic is first and that the other companies are not going to create AGI if Anthropic already created it. this is by default not the case
I agree, the net impact is definitely not the difference between these numbers.
Also I meant something more like P(doom|Anthropic builds AGI first).I don’t think people are imagining that the first AI company to achieve AGI will have an AGI monopoly forever. Instead some think it may have a large impact on what this technology is first used for and what expectations/regulations are built around it.
It would be easier to argue with you if you proposed a specific alternative to the status quo and argued for it. Maybe “[stop] shipping SOTA tech” is your alternative If so: surely you’re aware of the basic arguments for why Anthropic should make powerful models; maybe you should try to identify cruxes.
Separately from my other comment: It is not the case that the only appropriate thing to do when someone is going around killing your friends and your family and everyone you know is to “try to identify cruxes”.
It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action. It is not my job to convince Anthropic staff they are doing something wrong. Indeed, the economic incentives point extremely strongly towards Anthropic staff being the hardest to convince of true beliefs here. The standard you invoke here seems pretty crazy to me.
It is not clear to me that Anthropic “unilaterally stopping” will result in meaningfully better outcomes than the status quo, let alone that it would be anywhere near the best way for Anthropic to leverage its situation.
I do think there’s a Virtue of Silence problem here.
Like—I was a ML expert who, roughly ten years ago, decided to not advance capabilities and instead work on safety-related things, and when the returns to that seemed too dismal stopped doing that also. How much did my ‘unilateral stopping’ change things? It’s really hard to estimate the counterfactual of how much I would have actually shifted progress; on the capabilities front I had several ‘good ideas’ years early but maybe my execution would’ve sucked, or I would’ve been focused on my bad ideas instead. (Or maybe me being at the OpenAI lunch table and asking people good questions would have sped the company up by 2%, or w/e, independent of my direct work.)
How many people are there like me? Also not obvious, but probably not that many. (I would guess most of them ended up in the MIRI orbit and I know them, but maybe there are lurkers—one of my friends in SF works for generic tech companies but is highly suspicious of working for AI companies, for reasons roughly downstream of MIRI, and there might easily be hundreds of people in that boat. But maybe the AI companies would only actually have wanted to hire ten of them, and the others objecting to AI work didn’t actually matter.)
I think that just Anthropic, OpenAI, and DeepMind stopping would plausibly result in meaningfully better outcomes than the status quo. I still see no strong evidence that anyone outside these labs is actually pursuing AGI with anything like their level of effectiveness. I think it’s very plausible that everyone else is either LARPing (random LLM startups), or largely following their lead (DeepSeek/China), or pursuing dead ends (Meta’s LeCun), or some combination.
The o1 release is a good example. Yes, everyone and their grandmother was absent-mindedly thinking about RL-on-CoTs and tinkering with relevant experiments. But it took OpenAI deploying a flashy proof-of-concept for everyone to pour vast resources into this paradigm. In the counterfactual where the three major labs weren’t there, how long would it have taken the rest to get there?
I think it’s plausible that if only those three actors stopped, we’d get +5-10 years to the timelines just from that. Which I expect does meaningfully improve the outcomes, particularly in AI-2027-style short-timeline worlds.
So I think getting any one of them to individually stop would be pretty significant, actually (inasmuch as it’s a step towards “make all three stop”).
I think more than this, when you look at the labs you will often see the breakthru work was done by a small handful of people or a small team, whose direction was not popular before their success. If just those people had decided to retire to the tropics, and everyone else had stayed, I think that would have made a huge difference to the trajectory. (What does it look like if Alec Radford had decided to not pursue GPT? Maybe the idea was ‘obvious’ and someone else gets it a month later, but I don’t think so.)
I see no principle by which I should allow Anthropic to build existentially dangerous technology, but disallow other people from building it. I think the right choice is for no lab to build it. I am here not calling for particularly much censure of Anthropic compared to all labs, and my guess is we can agree that in aggregate building existentially dangerous AIs is bad and should face censure.
If you are killing me and my friends because you think it better that you do the killing than someone else, then actually I will still ask you to stop, because I draw a hard line around killing me and my friends. Naturally, I have a similar line around developing tech that will likely kill me and my friends.
I think this would fail Anthropic’s ideological Turing test. For example, they might make arguments like: by being a frontier lab, they can push for impactful regulation in a way they couldn’t if they weren’t; they can set better norms and demonstrate good safety practices that get adopted by others; or they can conduct better safety research that they could not do without access to frontier models. It’s totally reasonable to disagree with this, or argue that their actions so far (e.g., lukewarm support and initial opposition to SB 1047) show that they are not doing this, but I don’t think these arguments are, in principle, ridiculous.
Yeah, sorry, I think it’s just very tricky for me to pass Anthropic’s ITT, because to imitate Anthropic, I would need to be concurrently saying stuff like “by being a frontier lab, we can push for impactful regulation”, typing stuff like “this bill will impose multi-million dollar fines for minor, technical violations, representing a risk to smaller companies” about a NY bill with requirements only for $100m+ training runs that would not impose multi-million dollar fine for minor violations, and misleading a part of me about Dario’s role (he is the Anthropic’s politics and policy lead and was a lot more involved in SB 1047 than many at Anthropic think).
It’s generally harder to pass ITT of an entity that lies to itself and others than to point out why it is incoherent and ridiculous.
In my mind, a good predictor of Anthropic’s actions is something in the direction of “a bunch of Sam Altmans stuck with potentially unaligned employees (who care about x-risk), going hard on trying to win the race”.
I disagree, but this doesn’t feel like a productive discussion, so I’ll leave things there
Do you have a source for Anthropic comments on the NY bill? I couldn’t find them and that one is news to me
A bill passed two chambers of New York State legislature. It incorporated a lot of feedback from this community. This bill’s author actually talked about it as a keynote speaker at an event organized by FAR at the end of May.
There’s no good theory of change for Anthropic compatible with them opposing and misrepresenting this bill. If you work at Anthropic on AI capabilities, you should stop.
From Jack Clark:
(Many such cases!)
Here’s what the bill’s author says in response:
I’m not saying that it’s implausible that the consequences might seem better. I’m stating it’s still morally wrong to race toward causing a likely extinction-level event as that’s a pretty schelling place for a deontological lines against action.
Ah. In that case we just disagree about morality. I am strongly in favour of judging actions by their consequences, especially for incredibly high stakes actions like potential extinction level events. If an action decreases the probability of extinction I am very strongly in favour of people taking it.
I’m very open to arguments that the consequences would be worse, that this is the wrong decision theory, etc, but you don’t seem to be making those?
I too believe we should ultimately judge things based on their consequences. I believe that having deontological lines against certain actions is something that leads humans to make decisions with better consequences, partly because we are bounded agents that cannot well-compute the consequences of all of our actions.
For instance, I think you would agree that it would be wrong to kill someone in order to prevent more deaths, today here in the Western world. Like, if an assassin is going to kill two people, but says if you kill one then he won’t kill the other, if you kill that person you should still be prosecuted for murder. It is actually good to not cross these lines even if the local consequentialist argument seems to check out. I make the same sort of argument for being first in the race toward an extinction-level event. Building an extinction-machine is wrong, and arguing you’ll be slightly more likely to pull back first does not stop it from being something you should not do.
I think when you look back at a civilization that raced to the precipice and committed auto-genocide, and ask where the lines in the sand should’ve been drawn, the most natural one will be “building the extinction machine, and competing to be first to do so”. So it is wrong to cross this line, even for locally net positive tradeoffs.
I think this just takes it up one level of meta. We are arguing about the consequences of a ruleset. You are arguing that your ruleset has better consequences, while others disagree. And so you try to censure these people—this is your prerogative, but I don’t think this really gets you out of the regress of people disagreeing about what the best actions are.
Engaging with the object level of whether your proposed ruleset is a good one, I feel torn.
For your analogy of murder, I am very pro-not-murdering people, but I would argue this is convergent because it is broadly agreed upon by society. We all benefit from it being part of the social contract, and breaking that erodes the social contract in a way that harms all involved. If Anthropic unilaterally stopped trying to build AGI, I do not think this would significantly affect other labs, who would continue their work, so this feels disanalogous.
And it is reasonable in extreme conditions (e.g. when those prohibitions are violated by others acting against you) to abandon standard ethical prohibitions. For example, I think it was just for Allied soldiers to kill Nazi soldiers in World War II. I think having nuclear weapons is terrible and questionable but I generally don’t support countries unilaterally abandoning their nuclear weapons, leaving them vulnerable to other nuclear-armed nations. Obviously, there are many disanalogies, but my point is that you need to establish how much a given deontological prohibition makes sense in unusual situations, rather than just appealing to moral intuition.
I’m not here to defend Anthropic’s actions on the object level—they are not acting as I would in their situation, but they may have sound reasons. But they are not acting badly enough that I confidently assume bad faith. They have had positive effects, like their technical research and helping RSPs become established, though I disagree with some of their policy positions.
Another disanalogy between this and murder is that there are multiple AGI labs, and only one needs to cause human extinction. If Anthropic ceased to exist, other labs would continue this work. I’d argue that Anthropic is accelerating development by researching capabilities and intensifying commercial pressure, and this is bad. But when arguing about acceleration’s harm, we must weigh it against Anthropic’s potential good—this becomes more of an apples-to-apples comparison rather than a clear deontological violation.
Not a crux for either of us, but I disagree. When is the last time that someone shut down a multi-billion dollar profit arm of a company due to ethics, and especially due to the threat of extinction? If Anthropic announced they were ceasing development / shutting down because they did not want to cause an extinction-level event, this would have massive ramifications through society as people started to take this consequence more seriously, and many people would become more scared, including friends of employees at the other companies and more of the employees themselves. This would have massive positive effects.
This implies one should never draw lines in the sand about good/bad behavior if society has not reached consensus on it. In contrast, I think it is good to not do many behaviors even if your society has not yet reached consensus on it. For instance, if a government has not yet regulated that language-models shouldn’t encourage people to kill themselves, and then language models do and 1000s of ppl die (NB: this is a fictional example), this isn’t ethically fine just because it wasn’t illegal. I think we should act in ways that we believe will make sense as policies even before they have achieved consensus, and this is part of what makes someone engaged in ethics rather than in simply “doing what you are told”.
You bring up Nazism. I think that it was wrong to go along with Nazism even though the government endorsed it. Surely there are ethical lines against causing an extinction-level event even if your society has not come to a consensus on where those lines are yet. And even if we never achieve consensus, everyone should still attempt to figure out the answer and live by it, rather than give up on having such ethical lines.
Habryka wrote about how the bad-faith comment was a non-sequiter in another thread. I will here say that the “I’m not here to defend Anthropic’s actions on the object level” doesn’t make sense to me. I am saying they should stop racing, and you are saying they should not, and we are exchanging arguments for this, currently coming down to the ethics of racing toward an extinction-level event and whether there are deontological lines against doing that. I agree that you are not attempting to endorse all the details of what they are doing beyond that, but I believe you are broadly defending their IMO key object-level action of doing multi-billion dollar AI capabilities research and building massive industry momentum.
It reads to me that you’re just talking around the point here. I said that people shouldn’t race toward extinction-level threats for deontological reasons, you said we should evaluate the direct consequences, I said deontological reasons are endorsed by a consequentialist framework so we should analyze it deontologically, and now you’re saying that I’m conceding the initial point that we should be doing the consequentialist analysis. No, I’m saying we should do a deontological analysis, and this is in conflict with you saying we should just judge based on the direct consequences that we know how to estimate.
You keep trying to engage me in this consequentialist analysis, and say that sometimes (e.g. during times of war) the deontological rules can have exceptions, but you have not argued for why this is an exception. If people around you in society start committing murder, would you then start murdering? If people around you started lying, would you then start lying? I don’t think so. Why then, if people around you are racing to an extinction-level event, does the obvious rule of “do not race toward an extinction-level event” get an exception? Other people doing things that are wrong (even if they get away with it!) doesn’t make those things right.
The point I was trying to make is that, if I understood you correctly, you were trying to appeal to common sense morality that deontological rules like this are good on consequentialist grounds. I was trying to give examples why I don’t think this immediately follows and you need to actually make object level arguments about this and engage with the counter arguments. If you want to argue for deontological rules, you need to justify why those rules
I am not trying to defend the claim that I am highly confident that what Anthropic is doing is ethical and net good for the world, but I am trying to defend the claim that there are vaguely similar plans to Anthropics that I would predict are net good in expectation, e.g., becoming a prominent actor then leveraging your influence to push for good norms and good regulations. Your arguments would also imply that plans like that should be deontologically prohibited and I disagree.
I don’t think this follows from naive moral intuition. A crucial disanalogy with murder is that if you don’t kill someone, the counterfactual is that the person is alive. While if you don’t race towards AGI, the counterfactual is that maybe someone else makes it and we die anyway. This means that we need to be engaging in discussion about the consequences of there being another actor pushing for this, the consequences of other actions this actor may take, and how this all nets out, which I don’t feel like you’re doing.
I expect AGI to be either the best or worse thing that has ever happened, and this means that important actions will typically be high variance, with major positive or negative consequences. Declining to engage in things with the potential for high negative consequences severely restricts your action space. And given that it’s plausible that there’s a terrible outcome even if we do nothing, I don’t think the act-omission distinction applies
Thank you for clarifying, I think I understand now. I’m hearing you’re not arguing in defense of Anthropic’s specific plan but in defense of there being some part of the space of plans being good that involve racing to build something that has a (say) >20% chance of causing an extinction-level event, that Anthropic may or may not fall into.
This isn’t disanalagous. As I have already said in this thread, you are not allowed to murder someone even if someone else is planning to murder them. If you find out multiple parties are going to murder Bob, you are not now allowed to murder Bob in a way that is slightly less likely to be successful.
Crucially it is not to be assumed that we will build AGI in the next 1-2 decades. If the countries of the world decided to ban training runs of a particular size, because we don’t want to take this sort of extinction-level risk, then it would not happen. Assuming this out of the hypothesis space will get you into bad ethical territory. Suppose a military general says “War is inevitable, the only question is how fast it’s over when it starts and how few deaths there are.” This general would never take responsibility for instigating. Similarly if you assume with certainty that AGI will be developed risking in next few decades, you absolve yourself of all responsibility for being the one who does so.
I think you are failing to understand the concept of deontology by replacing “breaks deontological rules” with “highly negative consequences”. Deontology doesn’t say “you can tell a lie if it saves you from telling two lies later” or “lying is wrong unless you get a lot of money for it”. It says “don’t tell lies”. There are exceptional circumstances for all rules, but unless you’re in an exceptional circumstance, you treat them as rules, and don’t treat violations as integers to be traded against each other.
When the stakes get high it is not time to start lying, cheating, killing, or unilaterally betting the extinction of the human race. If it is for someone, then they simply can’t be trusted to follow these ethical principles when it matters.
Yes that is correct
I disagree. If a patient has a deadly illness then I think it is fine for a surgeon to perform a dangerous operation to try to save their life. I think the word murder is obfuscating things and suggest we instead talk in terms of “taking actions that may lead to death”, which I think is more analogous—hopefully we can agree Anthropic won’t intentionally cause human extinction. I think it is totally reasonable to take actions that net decrease someone’s probability of dying, while introducing some novel risks.
I think we’re talking past each other. I understood you as arguing “deontological rules against X will systematically lead to better consequences than trying to evaluate each situation carefully, because humans are fallible”. I am trying to argue that your proposed deontological rule does not obviously lead to better consequences as an absolute rule. Please correct me if I have misunderstood.
I am arguing that “things to do with human extinction from AI, when there’s already a meaningful likelihood” are not a domain where ethical prohibitions like “never do things that could lead to human extinction” are productive. For example, you help run LessWrong, which I’d argue has helped raise the salience of AI x-risk, which plausibly has accelerated timelines. I personally think this is outweighed by other effects, but that’s via reasoning about the consequences. Your actions and Anthropic’s feel more like a difference in scale than a difference in kind.
I am not arguing that AI x-risk is inevitable, in fact I’m arguing the opposite. AI x-risk is both plausible and not inevitable. Actions to reduce this seem very valuable. Actions that do this will often have side effects that increase risk in other ways. In my opinion, this is not sufficient cause to immediately rule them out.
Meanwhile, I would consider anyone pushing hard to make frontier AI to be highly reckless if they were the only one who could cause extinction, and they could unilaterally stop—this is a way to unilaterally bring risk to zero, which is better than any other action. But Anthropic has no such action available, and so I want them to take the actions that reduce risk as much as possible. And there are arguments for proceeding and arguments for stopping.
This is simplifying away key details.
If you go up to a person with a deadly illness and non-consensually do a dangerous surgery on them, this is wrong. If you kill them via this, their family has a right to sue you / prosecute you for murder. Once again, simply because some bad outcome is likely, you do not have ethical mandate to now go and cause it yourself. Deontology is typically about forbidding classes of action that on net make the world worse even when locally you have a good reason. Talking about “taking actions that lead to death” explicitly obfuscates the mechanism. I know you won’t endorse this once I point it out, but under this strictly-consequentialist framework “blogging on LessWrong about extinction-risk from AI” and “committing murder” are just two different “actions that lead to death” and neither can be thought of as having different deontological lines drawn. On the contrary, “don’t commit murder” and “don’t build a doomsday machine” are simple and natural deontological rules, whereas “don’t build a blogging platform with unusually high standards for truthseeking” is not.
I am not trying to argue for an especially novel deontological rule… “building a doomsday machine” is wrong. It’s a far greater sin than murder. I think you’d do better to think of the AI companies as more like competing political factions each of whom’s base is very motivated toward committing a genocide against their neighbors. If your political faction commits a genocide; and you were merely a top-200 ranked official who didn’t particularly want a genocide, you still bear moral responsibility for it even though you only did paperwork and took meetings and maybe worked in a different department. Just because there are two political factions whose bases are uncomfortably attracted to the idea of committing genocide does not now make it ethically clear for you to make a third one that hungers for genocide but has wiser people in charge.
I am not advocating for some new interesting deontological rule. I am arguing that the obvious rule against building a doomsday machine applies here straightforwardly. Deontological violations don’t stop being bad just because other people are committing them. You cannot commit murder just because other people do, and you cannot build a doomsday machine just because other people are. You generally shouldn’t build doomsday machines even if you have a good reason. To argue against this you should show why deontological rules break down, and then apply it to this case, but the doctor example you gave doesn’t show that, because by-default you aren’t actually allowed to non-consensually do risky surgeries on people even if it makes sense on the consequentialist calculus.
I continue to feel like we’re talking past each other, so let me start again. We both agree that causing human extinction is extremely bad. If I understand you correctly, you are arguing that it makes sense to follow deontological rules, even if there’s a really good reason breaking them seems locally beneficial, because on average, the decision theory that’s willing to do harmful things for complex reasons performs badly.
The goal of my various analogies was to point out that this is not actually a fully correcct statement about common sense morality. Common sense morality has several exceptions for things like having someone’s consent to take on a risk, someone doing bad things to you, and innocent people being forced to do terrible things.
Given that exceptions exist, for times when we believe the general policy is bad, I am arguing that there should be an additional exception stating that: if there is a realistic chance that a bad outcome happens anyway, and you believe you can reduce the probability of this bad outcome happening (even after accounting for cognitive biases, sources of overconfidence, etc.), it can be ethically permissible to take actions that have side effects around increasing the probability of the bad outcome in other ways.
When analysing the reasons I broadly buy the deontological framework for “don’t commit murder”, I think there are some clear lines in the sand, such as maintaining a valuable social contract, and how if you do nothing, the outcomes will be broadly good. Further, society has never really had to deal with something as extreme as doomsday machines, which makes me hesitant to appeal to common sense morality at all. To me, the point where things break down with standard deontological reasoning is that this is just very outside the context where such priors were developed and have proven to be robust. I am not comfortable naively assuming they will generalize, and I think this is an incredibly high stakes thing where far and away the only thing I care about is taking the actions that will actually, in practice, lead to a lower probability of extinction.
Regarding your examples, I’m completely ethically comfortable with someone making a third political party in a country where the population has two groups who both strongly want to cause genocide to the other. I think there are many ways that such a third political party could reduce the probability of genocide, even if it ultimately comprises a political base who wants negative outcomes.
Another example is nuclear weapons. From a certain perspective, holding nuclear weapons is highly unethical as it risks nuclear winter, whether from provoking someone else or from a false alarm on your side. While I’m strongly in favour of countries unilaterally switching to a no-first-use policy and pursuing mutual disarmament, I am not in favour of countries unilaterally disarming themselves. By my interpretation of your proposed ethical rules, this suggests countries should unilaterally disarm. Do you agree with that? If not, what’s disanalogous?
COVID-19 would be another example. Biology is not my area of expertise, but as I understand it, governments took actions that were probably good but risked some negative effects that could have made things worse. For example, widespread use of vaccines or antivirals, especially via the first-doses-first approach, plausibly made it more likely that resistant strains would spread, potentially affecting everyone else. In my opinion, these were clearly net-positive actions because the good done far outweighed the potential harm.
You could raise the objection that governments are democratically elected while Anthropic is not, but there were many other actors in these scenarios, like uranium miners, vaccine manufacturers, etc., who were also complicit.
Again, I’m purely defending the abstract point of “plans that could result in increased human extinction, even if by building the doomsday machine yourself, are not automatically ethically forbidden”. You’re welcome to critique Anthropic’s actual actions as much as you like. But you seem to be making a much more general claim.
Hm… I would say that one should follow deontological rules like “don’t lie” and “don’t steal” and so on because we fail to understand or predict the knock-on consequences. For instance they can get the world into a much worse equilibrium of mutual liars/stealers, for instance, in ways that are hard to predict. And being a good person can get the world into a much better equilibrium of mutually-honorable people in ways that are hard to predict. And also because, if it does screw up in some hard to predict way, then when you look back, it will often be the easiest line in the sand to draw.
For instance, if SBF is wondering at what point he could have most reliably intervened on his whole company collapsing and ruining the reputation of things associated with it, he might talk about certain deals he made or strategic plays with Binance or the US Govt, for he is not a very ethical person; I would talk about not taking customer deposits.
If and when we get to an endgame where tons of AI systems are sociopathically lying and stealing money and ultimately killing the humans, I suspect people of SBF’s mindset again to talk about how the US and China should’ve played things, or how Musk should’ve played OpenAI, and how Amodei should’ve done played with DC. And I will talk about not racing to develop the unaligned AI systems in the first place.
I don’t really know why you think that this generalization can’t be made to things we’ve not seen before. So many things I experience haven’t been seen before in history. How many centuries have we had to develop ethical intuitions for how to write on web forums? There are still answers to these questions, and I can identify ethical and unethical behaviors, as can you (e.g. sockpuppeting, doxing, brigading, etc). There can be ethical lines in novel situations, not only historically common ones.
I am not sure what I would propose if I believed Nuclear Winter was a serious existential threat; it seems plausible to me that the ethical thing would be to unilaterally disarm. I suspect that at the very least if I were a country I would openly and aggressively campaign for mutual disarmament. (If any AI capabilities company openly campaigned for making it illegal to develop AI then I suspect I would consider that plausibly quite ethical).
To be clear, I think you’re defending a somewhat stronger claim. You write further up thread:
My current stance is that all actors currently in this space are doing things prohibited by basic deontology. This is not merely an unfortunate outcome, but is a grave sin, for they are building doomsday machines, likely the greatest evil that we will ever experience in our history (regardless of if they are successful). So I want to emphasize that the boundary here is not between “better and worse plans” but between “moral murky and morally evil plans”. Insofar as you commit a genocide or worse, history should remember your names as people of shame who we must take pain never to repeat. Insofar as you played with the idea, thought you could control it, and failed, then history should still think of you this way.
I believe we disagree over where the deontological lines are, given you are defending “vaguely similar plans to Anthropic’s”. Perhaps you could point to where you think they are? Presumably you think that a Larry Page style “this is just the next stage in evolution” indifference to human extinction AI-project would be morally wrong?
Here’s two lines that I think might cross into being acceptable [edit: or rather, “only morally murky”] from my perspective.
I think it might be appropriate to risk building a doomsday machine if, loudly and in-public, you told everyone “I AM BUILDING A POTENTIAL DOOMSDAY MACHINE, AND YOU SHOULD SHUT MY INDUSTRY DOWN. IF YOU DON’T THEN I WILL RIDE THIS WAVE AND ATTEMPT TO IMPROVE IT, BUT YOU REALLY SHOULD NOT LET ANYONE DO WHAT I AM DOING.” And was engaged in serious lobbying and advertising efforts to this effect.
I think it could possibly be acceptable to build an AI capabilities company if you committed to never releasing or developing any frontier capabilities AND if all employees also committed not to leave and release frontier capabilities elsewhere AND you were attempting to use this to differential improve society’s epistemics and awareness of AI’s extinction-level threat. Though this might still cause too much economic investment into AI as an industry, I’m not sure.
I of course do not think any current project looks superficially like these.
Okay, after reading this it seems to me that we broadly do agree and are just arguing over price. I’m arguing that it is permissible to try to build a doomsday machine if there are really good reasons to believe it is net good for the probability of doomsday. It sounds like you agree, and give two examples of what “really good reasons” could be. I’m sure we disagree on the boundaries of where the really good reasons lie, but I’m trying to defend the point that you actually need to think about the consequences.
What am I missing? Is it that you think these two are really good reasons, not because of the impact on the consequences, but because of the attitude/framing involved?
I’m not Ben, but I think you don’t understand. I think explaining what you are doing loudly in public isn’t like “having a really good reason to believe it is net good” is instead more like asking for consent.
Like you are saying “please stop me by shutting down this industry” and if you don’t get shut down, that it is analogous to consent: you’ve informed society about what you’re doing and why and tried to ensure that if everyone else followed a similar sort of policy we’d be in a better position.
(Not claiming I agree with Ben’s perspective here, just trying to explain it as I understand it.)
Ah! Thanks a lot for the explanation, that makes way more sense, and is much weaker than what I thought Ben was arguing for. Yeah this seems like a pretty reasonable position, especially “take actions where if everyone else took them we would be much better off” and I am completely fine with holding Anthropic to that bar. I’m not fully sold re the asking for consent framing, but mostly for practical reasons—I think there’s many ways that society is not able to act constantly, and the actions of governments on many issues are not a reflection of the true informed will of the people, but I expect there’s some reframe here that I would agree with.
I don’t think Ryan (or I) was intending to imply a measure of degree, so my guess is unfortunately somehow communication still failed. Like, I don’t think Ryan (or Ben) are saying “it’s OK to do these things you just have to ask for consent”. Ryan was just trying to point out a specific way in which things don’t bottom out in consequentialist analysis.
If you end up walking away with thinking that Ben believes “the key thing to get right for AI companies is to ask for consent before building the doomsday machine”, which I feel like is the only interpretation of what you could mean by “weaker” that I currently have, then I think that would be a pretty deep misunderstanding.
OK, I’m going to bow out of the conversation at this point, I’d guess further back and forth won’t be too productive. Thanks all!
There is something important to me in this conversation about not trusting one’s consequentialist analysis when evaluating proposals to violate deontological lines, and from my perspective you still haven’t managed to paraphrase this basic ethical idea or shown you’ve understood it, which I feel a little frustrated over. Ah well. I still have been glad of this opportunity to argue it through, and I feel grateful to Neel for that.
I actually agree with Neel that, in principle, an AI lab could race for AGI while acting responsibly and IMO not violating deontology.
Releasing models exactly at the level of their top competitor, immediately after the competitor’s release and a bit cheaper; talking to the governments and lobbying for regulation; having an actually robust governance structure and not doing a thing that increases the chance of everyone dying.
This doesn’t describe any of the existing labs, though.
I like a lot of your comment, but this feels like a total non-sequitur. Did anyone involved in this conversation say that Anthropic was acting under false pretenses? I don’t think anyone brought up concerns that rest on assumptions of bad faith (though to be clear, Anthropic employees have mostly told me I should assume something like bad faith from Anthropic as an institution, and that people should try to hold it accountable the same way any other AI lab, and to not straightforwardly trust statements Anthropic makes without associated commitments, so I do think I would assume bad faith, but it mostly just feels besides the point in this discussion).
Ah, sorry, I was thinking of Mikhail’s reply here, not anything you or Ben said in this conversation https://www.lesswrong.com/posts/BqwXYFtpetFxqkxip/mikhail-samin-s-shortform?commentId=w2doi6TzjB5HMMfmx
But yeah, I’m happy to leave that aside, I don’t think it’s cruxy
Makes sense! I hadn’t read that subthread, so was additionally confused.
Killing anyone who hasn’t done anything to lose deontological protection is wrong and clearly violates deontology.
As a Nazi soldier, you lose deontological protection.
There are many humans who are not even customers of any of the AI labs; they clearly have not lost deontological protection, and it’s not okay to risk killing them without their consent.
I disagree with this as a statement about war, I’m sure a bunch of Nazi soldiers were conscripted, did not particularly support the regime, and were participating out of fear. Similarly, malicious governments have conscripted innocent civilians and kept them in line through fear in many unjust wars throughout history. And even people who volunteered may have done this due to being brainwashed by extensive propaganda that led to them believing they were doing the right thing. The real world is messy and strict deontological prohibitions break down in complex and high stakes situations, where inaction also has terrible consequences—I strongly disagree with a deontological rule that says countries are not about to defend themselves against innocent people forced to do terrible things
My deontology prescribes not to join a Nazi army regardless of how much fear you’re in. It’s impossible to demand of people to be HPMOR!Hermione, but I think this standard works fine for real-world situations.
(While I do not wish any Nazi soldiers death, regardless of their views or reasons for their actions. There’s a sense in which Nazi soldiers are innocent regardless of what they’ve done; none of them are grown up enough to be truly responsible for their actions. Every single death is very sad, and I’m not sure there has ever been even a single non-innocent human. At the same time, I think it’s okay to kill Nazi soldiers (unless they’re in a process of surrenderring, etc.) or lie to them, and they don’t have deontological protection.)
You’re arguing it’s okay to defend yourself against innocent people forced to do terrible things. I agree with that, and my deontology agrees with that.
At the same time, killing everyone because otherwise someone else could’ve killed them with a higher chance = killing many people who aren’t ever going to contribute to any terrible things. I think, and my deontology thinks, that this is not okay. Random civilians are not innocent Nazi soldiers; they’re simply random innocent people. I ask of Anthropic to please stop working towards killing them.
And do you feel this way because you believe that the general policy of obeying such deontological prohibitions will on net result in better outcomes? Or because you think that even if there were good reason to believe that following a different policy would lead to better empirical outcomes, your ethics say that you should be deontologically opposed regardless?
I think the general policy of obeying such deontological rules leads to better outcomes; this is the reason for having deontology in the first place. (I agree with that old post on what to do when it feels like there’s a good reason to believe that following a different policy would lead to better outcomes.)
(Just as a datapoint, while largely agreeing with Ben here, I really don’t buy this concept of deontological protection of individuals. I think there are principles we have about when it’s OK to kill someone, but I don’t think the lines we have here route through individuals losing deontological protection.
Killing a mass murderer while he is waiting for trial is IMO worse than killing a civilian in collateral damage as part of taking out an active combatant, because it violates and messes with different processes, which don’t generally route through individuals “losing deontological protection” but instead are more sensitive to the context the individuals are in)
Locally: can you give an example of when it’s okay to kill someone who didn’t lose deontological protection, where you want to kill them because of the causal impact of their death?
To me the issue goes the other way. The idea of “losing deontological protection” suggests I’m allowed to ignore deontological rules when interacting with someone. But that seems obviously crazy to me. For instance I think there’s a deontological injunction against lying, but just because someone lies doesn’t now mean I’m allowed to kill them. It doesn’t even mean I’m allowed to lie to them. I think lying to them would still be about as wrong as it was before, not a free action I can take whenever I feel like it.
I mean, a very classical example that I’ve seen a few times in media is shooting a civilian who is about to walk into a minefield in which multiple other civilians or military members are located. It seems tragic but obviously the right choice to shoot them if they don’t heed your warning.
IDK, I also think it’s the right choice to pull the lever in the trolley problem, though the choice becomes less obvious the more it involves active killing as opposed to literally pulling a lever.
Sorry for replying to a dead thread but,
Murder implies an intent to kill someone.
Suppose I hire a hitman to kill you. But suppose there already are 3 hitmen trying to kill you, and I’m hoping my hitman would reach you first, and I know that my hitman has really bad aim. Once the first hitman reaches you and starts shooting, the other hitmen will freak out and run away, so I’m hoping you’re more likely to survive.
I have no other options for saving you, since the only contact I have is a hitman, and he’s very bad at English and doesn’t understand any instructions except trying to kill someone.
In this case, you can argue to the court that my plan to save you was retarded. But you cannot concede that my plan actually was a good idea consequentially, but deontologically unethical. Since I didn’t intend to kill anyone.
Deontology only kicks in when your plan involves making someone die, or greatly increasing the chance someone dies.
I feel like this it’s actually a great analogy! The only difference is that if your hitman starts shooting and doesn’t kill anyone, you get infinite gold.
You know that in real life you go to police instead of hiring a hitman, right?
And I claim that it’s really not okay to hire a hitman who might lower the chance of the person ending up dead, especially when your brain is aware of the infinite gold part.
The good strategy for anyone in that situation to follow is to go to the police or go public and not hire any additional hitmen.
Yeah, it’s less deontologically bad than murder but I admit it’s still not completely okay.
PS: Part of the reason I used the unflattering hitman analogy is because I’m no longer as optimistic about Anthropic’s influence.
They routinely describe other problems (e.g. winning the race against China to defend democracy) with the same urgency as AI Notkilleveryoneism.
The only way to believe that AI Notkilleveryoneism is still Anthropic’s main purpose, is to hope that,
They describe a ton of other problems with the same urgency as AI Notkilleveryoneism, but that is only due to political necessity.
At the same time, their apparent concern for AI Notkilleveryoneism is not just a political maneuver, but significantly more genuine.
This “hope” is plausible since the people in charge of Anthropic prefer to live, and consistently claimed to have high P(doom).
But it’s not certain, and there is circumstantial evidence suggesting this isn’t the case (e.g. their lobbying direction, and how they’re choosing people for their board of directors).
Maybe50% this hope is just cope :(
I don’t agree that deontology is about intent. Deontology is about action. Deontology is about not hiring hitmen to kill someone even if you have a really good reason, and even if your intent is good. Deontology is substantially about schelling lines of action where everything gets hard to predict and goes bad after you commit it.
I imagine that your incompetent hitman has only like a 50% chance of succeeding, whereas the others have ~100%, that seems deontologically wrong to me.
It seems plausible that what you mean to say by the hypothetical is that he has 0% chance.
I admit this is more confusing and I’m not fully resolved on this.
I notice I am confused about how you can get that epistemic state in real life.
I observe that society will still prosecute you for attempted murder if you buy a hitman off the dark web, even one with a clearly incompetent reputation for 0⁄10 kills or whatever.
I think society’s ability to police this line is not as fine grained as you’re imagining, and so you should not buy incompetent hitmen in order to not kill your friend, unless you’re willing to face the consequences.
To be honest I couldn’t resist writing the comment because I just wanted to share the silly thought :/
Now that I think about it, it’s much more complicated. Mikhail Samin is right that the personal incentive of reaching AGI first really complicates the good intentions. And while a lot of deontology is about intent, it’s hyperbole to say that deontology is just intent.
I think if your main intent is to save someone (and not personal gain), and your plan doesn’t require or seek anyone’s death, then it is deontologically much less bad than evil things like murder. But it may still be too bad for you to do, if you strongly lean towards deontology rather than consequentialism. Even if the court doesn’t find you guilty of first degree murder, it may still find you guilty of… some… things.
One might argue that the enormous scale (risking everyone’s death instead of only one person), makes it deontologically worse. But I think the balance does not shift in favor of deontology and against consequentialism as we increase the scale (it might even shift a little in favor of consequentialism?).
That’s fair, but the deontological argument doesn’t work for anyone building the extinction machine who is unconvinced by x-risk arguments, or deludes themselves that it’s not actually an extinction machine, or that extinction is extremely unlikely, or that the extinction machine is the only thing that can prevent extinction (as in all the alignment via AI proposals) etc. etc.
This is not the case for many at Anthropic.
True; in general, many people who behave poorly do not know that they do so.
Plugging that I wrote a post which quotes Anthropic execs at length describing their views on race to the top: https://open.substack.com/pub/stevenadler/p/dont-rely-on-a-race-to-the-top (and yes agreed with Neel’s summary)
I suppose if you think it’s less likely there will be killing involved if you’re the one holding the overheating gun than if someone else is holding it, that hard line probably goes away.
Just because someone else is going to kill me, doesn’t mean we don’t have an important societal norm against murder. You’re not allowed to kill old people just because they’ve only got a few years left, or kill people with terminal diseases.
I don’t see how that at all addresses the analogy I made.
I am not quite sure what an overheating gun refers to, I am guessing the idea is that it has some chance of going off without being fired.
Anyhow, if that’s accurate, it’s acceptable to decide to be the person holding an overheating gun, but it’s not acceptable to (for example) accept a contract to assassinate someone so that you get to have the overheating gun, or to promise to kill slightly fewer people with the gun than the next guy. Like, I understand consequentially fewer deaths happen, but our society has deontological lines against committing murder even given consequentialist arguments, which are good. You’re not allowed to commit murder even if you have a good reason.
I fully expect we’re doomed, but I don’t find this attitude persuasive. If you don’t want to be killed, you advocate for actions that hopefully result in you not being killed, whereas this action looks like it just results in you being killed by someone else. Like you’re facing a firing squad and pleading specifically with just one of the executioners.
I just want to clarify that Anthropic doesn’t have the social authority of a governmental firing squad to kill people.
For me the missing argument in this comment thread is the following: Has anyone spelled out the arguments for how it’s supposed to help us, even incrementally, if one AI lab (rather than all of them) drops out of the AI race? Suppose whichever AI lab is most receptive to social censure could actually be persuaded to drop out; don’t we then just end in an Evaporative Cooling of Group Beliefs situation where the remaining participants in the race are all the more intransigent?
An AI lab dropping out helps in two ways:
timelines get longer because the smart and accomplished AI capabilities engineers formerly employed by this lab are no longer working on pushing for SOTA models/no longer have access to tons of compute/are no longer developing new algorithms to improve performance even holding compute constant. So there is less aggregate brainpower, money, and compute dedicated to making AI more powerful, meaning the rate of AI capability increase is slowed. With longer timelines, there is more time for AI safety research to develop past its pre-paradigmatic stage, for outreach effort to mainstream institutions to start paying dividends in terms of shifting public opinion at the highest echelons, for AI governance strategies to be employed by top international actors, and for moonshots like uploading or intelligence augmentation to become more realistic targets.
race dynamics become less problematic because there is one less competitor other top labs have to worry about, so they don’t need to pump out top models quite as quickly to remain relevant/retain tons of funding from investors/ensure they are the ones who personally end up with a ton of power when more capable AI is developed.
I believe these arguments, frequently employed by LW users and alignment researchers, are indeed valid. But I believe their impact will be quite small, or at the very least meaningfully smaller than what other people on this site likely envision.
And since I believe the evaporative cooling effects you’re mentioning are also real (and quite important), I indeed conclude pushing Anthropic to shut down is bad and counterproductive.
For that to be case, instead of engineers entering another company, we should suggest other tasks. There are very questionable technologies shipped indeed (for example, social media with automatic recommendation algorithms) but someone would have to connect the engineers to the tasks.
I agree with sunwillrise but I think there is an even stronger argument for why it would be good for an AI company to drop out of the race. It is a strong jolt that has a good chance of waking up the world to AI risk. It sends a clear message:
I don’t know exactly what effect that would have on public discourse, but the effect would be large.
Larger than the OpenAI board fiasco? I doubt it.
A board firing a CEO is a pretty normal thing to happen, and it was very unclear that the firing had anything to do with safety concerns because the board communicated so little.
A big company voluntarily shutting down because its product is too dangerous is (1) a much clearer message and (2) completely unprecedented, as far as I know.
In my ideal world, the company would be very explicit that they are shutting down specifically because they are worried about AGI killing everyone.
I make the case here for stopping based on deontological rather than consequentialist reasons.
My understanding was that LessWrong, specifically, was a place where bad arguments are (aspirationally) met with counterarguments, not with attempts to suppress them through coordinated social action. Is this no longer the case, even aspirationally?
I think it would be bad to suppress arguments! But I don’t see any arguments being suppressed here. Indeed, I see Zack as trying to create a standard where (for some reason) arguments about AI labs being reckless must be made directly to the people who are working at those labs, and other arguments should not be made, which seems weird to me. The OP seems to me like it’s making fine arguments.
I don’t think it was ever a requirement for participation on LessWrong to only ever engage in arguments that could change the minds of the specific people who you would like to do something else, as opposed to arguments that are generally compelling and might affect those people in indirect ways. It’s nice when it works out, but it really doesn’t seem like a tenet of LessWrong.
Ah, I had (incorrectly) interpreted “It’s eminently reasonable for people to just try to stop whatever is happening, which includes intention for social censure, convincing others, and coordinating social action” as being an alternative to engaging at all with the arguments of people who disagree with your positions here, rather than an alternative to having that standard in the outside world with people who are not operating under those norms.
Sure, censure among people who agree with you is a fine thing for a comment to do. I didn’t read Mikhail’s comment that way because it seemed to be asking Anthropic people to act differently (but without engaging with their views).
It’s OK to ask people to act differently without engaging with your views! If you are stabbing my friends and family I would like you to please stop, and I don’t really care about engaging with your views. The whole point of social censure is to ask people to act differently even if they disagree with you, that’s why we have civilization and laws and society.
I think Anthropic leadership should feel free to propose a plan to do something that is not “ship SOTA tech like every other lab”. In the absence of such a plan, seems like “stop shipping SOTA tech” is the obvious alternative plan.
Clearly in-aggregate the behavior of the labs is causing the risk here, so I think it’s reasonable to assume that it’s Anthropic’s job to make an argument for a plan that differs from the other labs. At the moment, I know of no such plan. I have some vague hopes, but nothing concrete, and Anthropic has not been very forthcoming with any specific plans, and does not seem on track to have one.
Note that Anthropic, for the early years, did have a plan to not ship SOTA tech like every other lab, and changed their minds. (Maybe they needed the revenue to get the investment to keep up; maybe they needed the data for training; maybe they thought the first mover effects would be large and getting lots of enterprise clients or w/e was a critical step in some of their mid-game plans.) But I think many plans here fail once considered in enough detail.
Anthropic’s responsible scaling policy does mention pausing scaling if the capabilities of their models exceeds their best safety methods:
I think OP and others in the thread are wondering why Anthropic doesn’t stop scaling now given the risks. I think the reason why is that in practice doing so would create a lot of problems:
How would Anthropic fund their safety research if Claude is no longer SOTA and becomes less popular?
Is Anthropic supposed to learn from and test only models at current levels of capability and how does it learn about future advanced model behaviors? I haven’t heard a compelling argument for how we could solve superalignment by studying much less advanced models. Imagine trying to align GPT-4 or o3 by only studying and testing GPT-2 from 2019. In reality, future models will probably have lots of unknown unknowns and emergent properties that are difficult or impossible to predict in advance. And then there’s all the social consequences of AI like misuse which are difficult to predict in advance.
Although I’m skeptical that alignment can be solved without a lot of empirical work on frontier models I still think it would better if AI progress were slower.
I don’t expect Anthropic to stick to any of their policies when competitive pressure means they have to train and deploy and release or be left behind. None of their commitments are of a kind they wouldn’t be able to walk back on.
Anthropic accelerates capabilities more than safety; they don’t even support regulation, with many people internally being misled about Anthropic’s efforts. None of their safety efforts meaningfully contributed to solving any of the problems you’d have to solve to have a chance of having something much smarter than you that doesn’t kill you.
I’d be mildly surprised if there’s a consensus at Anthropic that they can solve superalignment. The evidence they’re getting shows, according to them, that we live in an alignment-is-hard world.
If any of these arguments are Anthropic’s, I would love for them to say that out loud.
I’ve generally been aware of/can come up with some arguments; I haven’t heard them in detail from anyone at Anthropoid and would love for Anthropic to write up the plan that includes reasoning why shipping SOTA models helps humanity survive instead of doing the opposite thing.
The last time I saw Anthropic’s claimed reason for existing, it later became an inspiration for
I’m confused about why you’re pointing to Anthropic in particular here. Are they being overoptimistic in a way that other scaling labs are not, in your view?
Unlike other labs, Anthropic is full of people who care and might leave capabilities work or push for the leadership to be better. It’s a tricky place to be in: if you’re responsible enough, you’ll hear more criticism than less responsible actors, because criticism can still change what you’re doing.
Other labs are much less responsible, to be clear. There’s it a lot (I think) my words here can do about that, though.
Got it. It might be worth adding something like that to the post, which in my opinion reads as if it’s singling out Anthropic as especially deserving of criticism.
I understand your argument and it has merit, but I think the reality of the situation is more nuanced.
Humanity has long build buildings and bridges without access to formal engineering methods for predicting the risk of collapse. We might regard it as unethical to build such a structure now without using the best practically available engineering knowledge, but we do not regard it as having been unethical to build buildings and bridges historically due to the lack of modern engineering materials and methods. They did their best, more or less, with the resources they had access to at the time.
AI is a domain where the current state of the art safety methods are in fact being applied by the major companies, as far as I know (and I’m completely open to being corrected on this). In this respect, safety standards in the AI field are comparable to those of other fields. The case for existential risk is approximately as qualitative and handwavey as the case for safety, and I think that both of these arguments need to be taken seriously, because they are the best we currently have. It is disappointing to see the cavalier attitude with which pro-AI pundits dismiss safety concerns, and obnoxious to see the overly confident rhetoric deployed by some in the safety world when they tweet about their p(doom). It is a weird and important time in technology, and I would like to see greater open-mindedness and thoughtfulness about the ways to make progress on all of these important issues.
Perhaps the answer is right there, in the name. The future Everett branches where we still exist will indeed be the ones where we have magically passed the hardest test on the first try.
Branches like that don’t have a lot of reality-fluid and lost most of the value of our lightcone; you’re much more likely to find yourself somewhere before that.
Does “winning the race” actually give you a lever to stop disaster, or does it just make Anthropic the lab responsible for the last training run?
Does access to more compute and more model scaling, with today’s field understanding, truly give you more control—or just put you closer to launching something you can’t steer? Do you know how to solve alignment given even infinite compute?
Is there any sign, from inside your lab, that safety is catching up faster than capabilities? If not, every generation of SOTA increases the gap, not closes it.
“Build the bomb, because if we don’t, someone worse will.”
Once you’re at the threshold where nobody knows how to make these systems steerable or obedient, it doesn’t matter who is first—you still get a world-ending outcome.
If Anthropic, or any lab, ever wants to really make things go well, the only winning move is not to play, and try hard to make everyone not play.
If Anthropic was what it imagines itself being, it would build robust field-wide coordination and support regulation that would be effective globally, even if it means watching over your shoulder for colleagues and competitors across the world.
If everyone justifies escalation as “safety”, there is no safety.
In the end, if the race leads off a cliff, the team that runs fastest doesn’t “win”: they just get there first. That’s not leadership. It’s tragedy.
If you truly care about not killing everyone, will have to be a point—maybe now—where some leaders stop, even if it costs, and demand a solution that doesn’t sacrifice the long-term for a financial gain due to having a model slightly better than those of your competitors.
Anthropic is in a tricky place. Unlike other labs, it is full of people who care. The leadership has to adjust for that.
That makes you one of the few people in history who has the chance to say “no” to the spiral to the end of the world and demand of your company to behave responsibly.
(note: many of these points are AI-generated by a model with 200k tokens of Arbital in its context; though heavily edited.)
I have great empathy and deep respect for the courage of the people currently on hunger strikes to stop the AI race. Yet, I wish they hadn’t started them: these hunger strikes will not work.
Hunger strikes can be incredibly powerful when there’s a just demand, a target who would either give in to the demand or be seen as a villain for not doing so, a wise strategy, and a group of supporters.
I don’t think these hunger strikes pass the bar. Their political demands are not what AI companies would realistically give in to because of a hunger strike by a small number of outsiders.
A hunger strike can bring attention to how seriously you perceive an issue. If you know how to make it go viral, that is; in the US, hunger strikes are rarely widely covered by the media. And even then, you are more likely to marginalize your views than to make them go more mainstream: if people don’t currently think halting frontier general AI development requires hunger strikes, a hunger strike won’t explain to them why your views are correct: this is not self-evident just from the description of the hunger strike, and so the hunger strike is not the right approach here and now.
Also, our movement does not need martyrs. You can be a lot more helpful if you eat well, sleep well, and are able to think well and hard. Your life is also very valuable, it is a part of what we’re fighting for; saving a world without you is slightly sadder than saving a world with you; and perhaps more importantly to you, it will not help. It needs to already be seen by the public as legitimate, to make them more sympathetic towards your cause and exert pressure. It needs to target decision makers who have the means to give in and advance your cause by doing that, for it to have any meaning at all.
At the moment, these hunger strikes are people vibe-protesting. They feel like some awful people are going to kill everyone, they feel powerless, and so they find a way to do something that they perceive as having a chance of changing the situation.
Please don’t risk your life; especially, please don’t risk your life in this particular way that won’t change anything.
Action is better than inaction; but please stop and think of your theory of change for more than five minutes, if you’re planning to risk your life, and then don’t risk your life[1]; please pick actions thoughtfully and wisely and not because of the vibes[2].
You can do much more if you’re alive and well and use your brain.
Not to say that you shouldn’t be allowed to risk your life for a large positive impact. I would sacrifice my life for some small chance of preventing AI risk. But most people who think they’re facing a choice to sacrifice their life for some chance of making a positive impact are wrong and don’t actually face it; so I think the bar for risking one’s life should be very high. In particular, when people have time to carefully do the math, I really want them to carefully do the math before deciding to risk their lives, and in this specific case, some of my frustration is from the people clearly getting their math wrong.
I think as a community, we also would really want to make people err on the side of safety, and have a strong norm of assumption that most people who decide to sacrifice their lives got their math wrong, especially if a community that shares their values disagrees with them on the consequences of the sacrifice. People really shouldn’t be risking their lives without having carefully thought of the theory of change (when they have the ability to do so).
I’d bet if we find people competent in how movements achieve their goals, they will say that these particular hunger strikes are not great; and I expect it to be the case most of the time when individuals who share values with a larger movement decide to go on a hunger strike even as the larger movement thinks that would not be effective.
My strong impression is that the person on the hunger strike in front of Anthropic is doing this primarily because he feels like it is the proper thing to do in this situation, like it’s the action someone should be taking here.
Hi Mikhail, thanks for offering your thoughts on this. I think having more public discussion on this is useful and I appreciate you taking the time to write this up.
I think your comment mostly applies to Guido in front of Anthropic, and not our hunger strike in front of Google DeepMind in London.
I don’t think I have been framing Demis Hassabis as a villain and if you think I did it would be helpful to add a source for why you believe this.
I’m asking Demis Hassabis to “publicly state that DeepMind will halt the development of frontier AI models if all the other major AI companies agree to do so.” which I think is a reasonable thing to state given all public statements he made regarding AI Safety. I think that is indeed something that a company such as Google DeepMind would give in.
I’m currently in the UK, and I can tell you that there’s already been two pieces published on Business Insider. I’ve also given three interviews in the past 24 hours to journalists to contribute to major publications. I’ll try to add links later if / once these get published.
Again, I’m pretty sure I haven’t framed people as “awful”, and would be great if you could provide sources to that statement. I also don’t feel powerless. My motivation for doing this was in part to provide support to Guido’s strike in front of Anthropic, which feels more like helping an ally, joining forces.
I find it actually empowering to be able to be completely honest about what I actually think DeepMind should do to help stop the AI race and receive so much support from all kinds of people on the street, including employees from Google, Google DeepMind, Meta and Sony. I am also grateful to have Denys with me, who flew from Amsterdam to join the hunger strike, and all the journalists who have taken the time to talk to us, both in person and remotely.
I agree to the general point that taking decisions based on an actual theory of change is a much more effective way to have an impact in the world. I’ve personally thought quite a lot about why doing this hunger strike in front of DeepMind is net good, and I believe it’s having the intended impact, so I disagree with your implication that I’m basing my decisions on vibes. If you’d like to know more I’d be happy to talk to you in person in front of the DeepMind office or remotely.
Now, taking a step back and considering Guido’s strike, I want to say that even if you think that his actions were reckless and based on vibes, it’s worth evaluating whether his actions (and their consequences) will eventually turn out to be net negative. For one I don’t think I would have been out in front of DeepMind as I type this if it was not for Guido’s action, and I believe what we’re doing here in London is net good. But most importantly we’re still at the start of the strikes so it’s hard to tell what will happen as this continues. I’d be happy to have this discussion again at the end of the year, looking back.
Finally, I’d like to acknowledge the health risks involved. I’m personally looking over my health and there are some medics at King’s Cross that would be willing to help quickly if anything extreme was to happen. And given the length of the strikes so far I think what we’re doing is relatively safe, though I’m happy to be proven otherwise.
Thanks for responding!
Yep!
A hunger strike is not a good tool if you don’t want to paint someone as a villain in the eyes of the public when they don’t give in to your demand.
It is vanishingly unlikely that all other major AI companies would agree to do so without the US government telling them to; this statement would be helpful, but only to communicate their position and not because of the commitment itself. Why not ask them to ask the government to stop everyone (maybe conditional on China agreeing to stop everyone in China)?
If any of them go viral in the US with a good message, I’ll (somewhat) change my mind!
This was mainly my impression after talking to Guido; but do you want to say more about the impact you think you’ll have?
(Can come back to it at the end of the year; if you have any advance predictions, they might be helpful to have posted!)
I hope you remain safe and are not proven otherwise! Hunger strikes do carry negative risks though. Do you have particular plans for how long to be on the hunger strike for?
I have sent myself an email to arrive on December 20th to send you both a reminder about this thread.
Is there any form of protest that doesn’t implicitly imply that the person you’re protesting is doing something wrong? When the thing wrong is “causing human extinction” it seems to me kind of hard for that to not automatically be assumed ‘villainous’.
(Asking genuinely, I think it quite probably the answer is ‘yes’.)
Something like: Hunger strikes are optimized hard specifically for painting someone as a villain because they decide to make someone suffer or die (or be inhumanely fed), this is different from other forms of protests that are more focused on, e.g., that specific decisions are bad and should be revoked, but don’t necessarily try to make people perceive the other side as evil.
I don’t really see the problem with painting people as evil in principle, given that some people are evil. You can argue against it in specific cases, but I think the case for AI CEOs being evil is strong enough that it can’t be dismissed out of hand.
The case in question is “AI CEOs are optimising for their short-term status/profits, and for believing things about the world which maximise their comfort, rather than doing the due diligence required of someone in their position, which is to seriously check whether their company is building something which kills everyone”
Whether this is a useful frame for one’s own thinking—or a good frame to deploy onto the public—I’m not fully sure, but I think it does need addressing. Of course it might also differ between CEOs. I think Demis and Dario are two of the CEOs who it’s relatively less likely to apply to, but also I don’t think it applies weakly enough for them to be dismissed out of hand even in their cases.
“People are on hunger strikes” is not really a lot of evidence for “AI CEOs are optimizing for their short-term status/profits and are not doing the due diligence” in the eyes of the public.
I don’t think there’s any problem with painting people and institutions as evil, I’m just not sure why you would want to do this here, as compared to other things, and would want people to have answers to how they imagine a hunger strike would paint AI companies/CEOs and what would be the impact of that, because I expect little that could move the needle.
That is true. “People are on hunger strikes and the CEOs haven’t even commented” is (some) public evidence of “AI CEOs are unempathetic”
I misunderstood your point, I thought you were arguing against painting individuals as evil in general.
This seems to be exactly the point of the demand? This is a demand that would be cheap (perhaps even of negative cost) for DeepMind to accept (because the other AI companies wouldn’t agree to that), and would also be a major publicity win for the Pause AI crowd. Even counting myself skeptical of the hunger strikes, I think this is a very smart move.
the demand is that a specific company agrees to halt if everyone halts; this does not help in reality, because in fact it won’t be the case that everyone halts (abscent gov intervention).
I don’t think the point of hunger strikes is to achieve immediate material goals, but publicity/symbolic ones.
I think there’s a very reasonable theory of change—X-risk from AI needs to enter the Overton window. I see no justification here for going to the meta-level and claiming they did not think for 5 minutes, which is why I have weak downvoted in addition to strong disagree.
This tactic might not work, but I am not persuaded by your supposed downsides. The strikers should not risk their lives, but I don’t get the impression that they are. The movement does need people who are eating → working on AI safety research, governance, and other forms of advocacy. But why not this too? Seems very plausibly a comparative advantage for some concerned people, and particularly high leverage when very few are taking this step. If you think they should be doing something else instead, say specifically what it is and why these particular individuals are better suited to that particular task.
Michaël Trazzi’s comment, which he wrote a few hours before he started his hunger strike, isn’t directly about hunger striking but it does indicate to me that he put more than 5 minutes of thought into the decision, and his comment gestures at a theory of change.
I spoke to Michaël in person before he started. I told him I didn’t think the game theory worked out (if he’s not willing to die, GDM should ignore him; if he does die then he’s worsening the world, since he can definitely contribute better by being alive, and GDM should still ignore him). I don’t think he’s going to starve himself to death or serious harm, but that does make the threat empty. I don’t really think that matters too much on a game-theoretic-reputation method since nobody seems to be expecting him to do that.
His theory of change was basically “If I do this, other people might” which seems to be true: he did get another person involved. That other person has said they’ll do it for “1-3 weeks” which I would say is unambiguously not a threat to starve oneself to death.
As a publicity stunt it has kinda worked in the basic sense of getting publicity. I think it might change the texture and vibe of the AI protest movement in a direction I would prefer it to not go in. It certainly moves the salience-weighted average of public AI advocacy towards Stop AI-ish things.
As Mikhail said, I feel great empathy and respect for these people. My first instinct was similar to yours, though - if you’re not willing to die, it won’t work, and you probably shouldn’t be willing to die (because that also won’t work / there are more reliable ways to contribute / timelines uncertainty).
I think ‘I’m doing this to get others to join in’ is a pretty weak response to this rebuttal. If they’re also not willing to die, then it still won’t work, and if they are, you’ve wrangled them in at more risk than you’re willing to take on yourself, which is pretty bad (and again, it probably still won’t work even if a dozen people are willing to die on the steps of the DeepMind office, because the government will intervene, or they’ll be painted as loons, or the attention will never materialize and their ardor will wain).
I’m pretty confused about how, under any reasonable analysis, this could come out looking positive EV. Most of these extreme forms of protest just don’t work in America (e.g. the soldier who self-immolated a few years ago). And if it’s not intended to be extreme, they’ve (I presume accidentally) misbranded their actions.
Fair enough. I think these actions are +ev under a coarse grained model where some version of “Attention on AI risk” is the main currency (or a slight refinement to “Not-totally-hostile attention on AI risk”). For a domain like public opinion and comms, I think that deploying a set of simple heuristics like “Am I getting attention?” “Is that attention generally positive?” “Am I lying or doing something illegal?” can be pretty useful.
Michael said on twitter here that he’s had conversations with two sympathetic DeepMind employees, plus David Silver, who was also vaguely sympathetic. This itself is more +ev than I expected already, so I’m updating in favour of Michael here.
It’s also occurred to me that if any of the CEOs cracks and at least publicly responds the hunger strikers, then the CEOs who don’t do so will look villainous, so you actually only need to have one of them respond to get a wedge in.
“Attention on AI risk” is a somewhat very bad proxy to optimize for, where available tactics include attention that would be paid to luddites, lunatics, and crackpots caring about some issue.
The actions that we can take can:
Use what separates us from people everyone considers crazy: that our arguments check out and our predictions hold; communicate those;
Spark and mobilize existing public support;
Be designed to optimize for positive attention, not for any attention.
I don’t think DeepMind employees really changed their minds? Like, there are people at DeepMind with p(doom) higher than Eliezer’s; they would be sympathetic; would they change anything they’re doing? (I can imagine it prompting them to talk to others at DeepMind, talking about the hunger strike to validate the reasons for it.)
I don’t think Demis responding to the strike would make Dario look particularly villainous, happy to make conditional bets. How villainous someone looks here should be pretty independent, outside of eg Demis responding, prompting a journalist to ask Dario, which takes plausible deniability away from him.
I’m also not sure how effective it would be to use this to paint the companies (or the CEOs—are they even the explicit targets of the hunger strikes?) as villainous.
To clarify, “think for five minutes” was an appeal to people who might want to do these kinds of things in the future, not a claim about Guido or Michael.
That said, I do in fact claim they have not thought carefully about their theory of change, and the linked comment from Michael lists very obvious surface-level reasons for why do this in front of anthropic and not openai; I really would not consider this on the level of demonstrating having thought carefully about the theory of change.
While in principle, as I mentioned, a hunger strike can bring attention, this is not an effective way to do this for the particular issue that AI will kill everyone by default. The diff to communicate isn’t “someone is really scared of AI ending the world”; it’s “scientists think AI might literally kill everyone and also here are the reasons why”.
This was not a claim about these people but an appeal to potential future people to maybe do research on this stuff before making decisions like this one.
That said, I talked to Guido prior to the start of the hunger strike, tried to understand his logic, and was not convinced he had any kind of reasonable theory of change guiding his actions, and my understanding is that he perceives it as the proper action to take, in a situation like that, which is why I called this vibe-protesting.
(It’s not very clear what would be the conditions for them to stop the hunger strikes.)
Hunger strikes can be very effective and powerful if executed wisely. My comment expresses my strong opinion that this did not happen here, not that it can’t happen in general.
I think I somewhat agree, but also I think this is a more accurate vibe than “yay tech progress”. It seems like a step in the right direction to me.
You repeat a recommendation not to risk your life. Um, I’m willing to die to prevent human extinction. The math is trivial. I’m willing to die to reduce the risk by a pretty small percentage. I don’t think a single life here is particularly valuable on consequentialist terms.
There’s important deontology about not unilaterally risking other people’s lives, but this mostly goes away in the case of risking your own life. This is why there are many medical ethics guidelines that separate self-experimentation as a special case from rules for experimenting on others (and that’s been used very well in many cases and aligns incentives). I think one should have dignity and respect yourself, but I think there are many self-respecting situations where one should take major personal sacrifices and risk one’s whole life. (Somewhat similarly there are many situations to risk being prosecuted unjustly by the state and spending a great deal of your life in-prison.)
I don’t think so, I agree we shouldn’t have laws around this, but insofar as we have deontologies to correct for circumstances where historically our naive utility maximizing calculations have been consistently biased, I think there have been enough cases of people uselessly martyring themselves for their causes to justify a deontological rule not to sacrifice your own actual life.
Edit: Basically, I don’t want suicidal people to back-justify batshit insane reasons why they should die to decrease x-risk instead of getting help. And I expect these are the only people who would actually be at risk for a plan which ends with “and then I die, and there is 1% increased probability everyone else gets the good ending”.
I recently read The Sacrifices We Choose to Make by Michael Nielsen, which was a good read. Here are some relevant extracts.
Nielsen also includes unsuccessful or actively repugnant examples of it.
I also think this paragraph about Quang Duc is quite relevant:
I’m not certain if there’s a particular point you want me to take away from this, but thanks for the information, and including an unbiased sample from the article you linked. I don’t think I changed my mind so much from reading this though.
Do you also believe there is a deontological rule against suicide? I have heard rumor that most people who attempt suicide and fail, regret it. At the same time, I think some lives are worse than death (for example, see Amanda Luce’s Book Review: Two Arms And A Head that won the ACX book review prize), and so I believe it should be legal and sometimes supported, even if it were the case that most attempted suicides have been regretted.
After doing some research on this, I think this is unlikely to be true. The only quantitative study I found says that among its sample of suicide attempt survivors, 35.6% are glad to have survived, while 42.7% feel ambivalent, and 21.6% regret having survived. I also found a couple of sources agreeing with your “rumor”, but one cited just a suicide awareness trainer as its source, while the other cited the above study as the only evidence for its claim, somehow interpreting it as “Previous research has found that more than half of suicidal attempters regret their suicidal actions.” (Gemini 2.5 Pro says “It appears the authors of the 2023 paper misinterpreted or misremembered the findings of the 2005 study they cited.”)
If this “rumor” was true, I would expect to see a lot of studies supporting it, because such studies are easy to do and the result would be highly useful for people trying to prevent suicides (i.e., they can use it to convince potential suicide attempters that they’re likely to regret it). Evidence to the contrary are likely to be suppressed or not gathered in the first place, as almost nobody wants to encourage suicides. (The above study gathered the data incidentally, for a different purpose.) So everything seems consistent with the “rumor” being false.
Interesting, thanks. I think I had heard the rumor before and believed it.
In the linked study, it looks like they asked the people about regret very shortly after the suicide attempt. This could both bias the results towards less regret to have survived (little time to change their mind) or more regret to have survived (people might be scared to signal intent to retry suicide, for fear of being committed, which I think sometimes happens soon after failed attempts).
I think very very many people are not making an informed decision when they decide to commit suicide.
For example, I think quantum immortality is quite plausibly a thing. Very few people know about quantum immortality and even fewer have seriously thought about it. This means that almost everyone on the planet might have a very mistaken model of what suicide actually does to their anticipated experience.[1] Also, many people are religious and believe in a pleasant afterlife. Many people considering suicide are mentally ill in a way that compromises their decision making. Many people think transhumanism is impossible and won’t arrange for their brain to be frozen for that reason.
I agree that there is some threshold on the fraction of ill-considered suicides relative to total suicides such that suicide should be legal if we were below that threshold. I used to think we were maybe below that threshold. After I began studying physics at uni and so started taking quantum immortality more seriously, I switched to thinking we are maybe above the threshold.
You might find yourself in a branch where your suicide attempt failed, but a lot of your body and mind were still destroyed. If you keep exponentially decreasing the amplitude of your anticipated future experience in the universal wave function further, you might eventually find that it is now dominated by contributions from weird places and branches far-off in spacetime or configuration space that were formerly negligible, like aliens simulating you for some negotiation or other purpose.
I don’t really know yet how to reason well about what exactly the most likely observed outcome would be here. I do expect that by default, without understanding and careful engineering our civilisation doesn’t remotely have the capability for yet, it’d tend to be very Not Good.
This all feels galaxy-brained to me and like it proves too much. By analogy I feel like if you thought about population ethics for a while and came to counterintuitive conclusions, you might argue that people who haven’t done that shouldn’t be allowed to have children; or if they haven’t thought about timeless decision theory for a while they aren’t allowed to get a carry license.
I don’t think it proves too much. Informed decision-making comes in degrees, and some domains are just harder? Like, I think my threshold for leaving people free to make their own mistakes if they are the only ones harmed by them is very low, compared to where the human population average seems to be at the moment. But my threshold is, in fact, greater than zero.
For example, there are a bunch of things I think bystanders should generally prevent four year old human children from doing, even if the children insist that they want to do them. I know that stopping four year old children from doing these things will be detrimental in some cases, and that having such policies is degrading to the childrens’ agency. I remember what it was like being four years old and feeling miserable because of kindergarten teachers who controlled my day and thought they knew what was best for me. I still think the tradeoff is worth it on net in some cases.
I just think that the suicide thing happens to be a case where doing informed decision-making is maybe just too tough for way too many humans and thus some form of ban could plausibly be worth it on net. Sports betting is another case where I was eventually convinced that maybe a legal ban of some form could be worth it.
(I agree with Lucious in that I think it is important that people have the option of getting cryopreserved and also are aware of all the reality-fluid stuff before they decide to kill themselves.)
“Important” is ambiguous, in that I agree it matters, but it does for this civilization to ban whole life options from people until they have heard about niche philosophy. Most people will never hear about niche philosophy.
I don’t think quantum immortality changes anything. You can rephrame this in terms of standard probability theory and condition on them continuing to have subjective experience, and still get to the same calculus.
However, only considering the branches in which you survive, or conditioning on having subjective experience after the suicide attempt, ignores the counterfactual suffering prevented in all the branches (or probability mass) in which you did die, which may be less unpleasant than the branches in which you survived, but are many many more in number! Ignoring those branches biases the reasoning toward rare survival tails that don’t dominate the actual expected utility.
I agree that quantum mechanics is not really central for this on a philosophical level. You get a pretty similar dynamic just from having a universe that is large enough to contain many almost-identical copies of you. It’s just that it seems at present very unclear and arguable whether the physical universe is in fact anywhere near that large, whereas I would claim that a universal wavefunction which constantly decoheres into different branches containing different versions of us is pretty strongly implied to be a thing by the laws of physics as we currently understand them.
It is very late here and I should really sleep instead of discussing this, so I won’t be able to reply as in-depth as this probably merits. But, basically, I would claim that this is not the right way to do expected utility calculations when it comes to ensembles of identical or almost-identical minds.
A series of thought experiments might maybe help illustrate part of where my position comes from:
Imagine someone tells you that they will put you to sleep and then make two copies of you, identical down to the molecular level. They will place you in a room with blue walls. They will place one copy of you in a room with red walls, and the other copy in another room with blue walls. Then they will wake all three of you up.
What color do you anticipate seeing after you wake up, and with what probability?
I’d say 2⁄3 blue, 1⁄3 red. Because there will now be three versions of me, and until I look at the walls I won’t know which one I am.
Imagine someone tells you that they will put you to sleep and then make two copies of you. One copy will not include a brain. It’s just a dead body with an empty skull. Another copy will be identical to you down to the molecular level. Then they will place you in a room with blue walls, and the living copy in a room with red walls. Then they will wake you and the living copy up.
What color do you anticipate seeing after you wake up, and with what probability? Is there a 1⁄3 probability that you ‘die’ and don’t experience waking up because you might end up ‘being’ the corpse-copy?
I’d say 1⁄2 blue, 1⁄2 red, and there is clearly no probability of me ‘dying’ and not experiencing waking up. It’s just a bunch of biomass that happens to be shaped like me.
As 2, but instead of creating the corpse-copy without a brain, it is created fully intact, then its brain is destroyed while it is still unconscious. Should that change our anticipated experience? Do we now have a 1⁄3 chance of dying in the sense that we might not experience waking up? Is there some other relevant sense in which we die, even if it does not affect our anticipated experience?
I’d say no and no. This scenario is identical to 2 in terms of the relevant information processing that is actually occurring. The corpse-copy will have a brain, but it will never get to use it, so it won’t affect my expected anticipated experience in any way. Adding more dead copies doesn’t change my anticipated experience either. My best scoring prediction will be that I have 1⁄2 chance of waking up to see red walls, and 1⁄2 chance of waking up to see blue walls.
In real life, if you die in the vast majority of branches caused by some event, i.e. that’s where the majority of the amplitude is, but you survive in some, the calculation for your anticipated experience would seem to not include the branches where you die for the same reason it doesn’t include the dead copies in thought experiments 2 and 3.
(I think Eliezer may have written about this somewhere as well using pretty similar arguments, maybe in the quantum physics sequence, but I can’t find it right now.)
Again, not sure why a large universe is needed. The expected utility ends up the same either way, whether you have some fraction of branches in which you remain alive or some probability of remaining alive.
Regarding the expected utility calculus. I agree with everything you said but i don’t see how any of it allows you to disregard the counterfactual suffering from not committing suicide in your expected value calculation.
Maybe the crux is whether we consider the utility of each “you” (i.e. you in each branch) individually, and add it up for the total utility, or wether we consider all “you”s to have just one shared utility.
Let’s say that not committing suicide gives you −1 utility in n branches but commiting suicide gives you −100 utility in n/m branches and 0 utility in n−n/m branches
If we treat all copies of you as having separate utilities and add them all up for a total expected utility calculation, not committing suicide gives −n utility while committing suicide leads to −100n/m utility. Therefore, as long as m>100, it is better to commit suicide.
If, on the other hand you treat them as having one shared utility, you get either −1 or −100 utility, and −100 is of course worse.
Do you agree that this is the crux? If so, why do you think that all the copies share one utility rather than their utilities adding up?
In a large universe, you do not end. Like, not in expectation see some branch versus other; you just continue, the computation that is you continues. When you open your eyes, you’re not likely to find yourself as a person in a branch computed only relatively rarely; still, that person continues, and does not die.
Attemted suicide reduces your reality-fluid- how much you’re computed and how likely you are to find yourself there- but you will continue to experience the world. If you die in a nuclear explosion, the continuation of you will be somewhere else, sort-of isekaied; and mostly you will find yourself not in a strange world that recovers the dead but in a world where the nuclear explosion did not appear; still, in a large world, even after a nuclear explosion, you continue.
You might care about having a lot of reality-fluid, because this makes your actions more impactful, because you can spend your lightcone better, and improve the average experience in the large universe. You might also assign negative utility to others seeing you die; they’ll have a lot of reality-fluid in worlds where you’re dead and they can’t talk to you, even as you continue. But I don’t think it works out to assigning the same negative utility to dying as in branches of small worlds.
Yes, but the number of copies of you still reduces (or the probability that you are alive in standard probability theory, or the number of branches in many worlds). Why are these not equivalent in terms of the expected utility calculus?
Imagine they you’re an agent in the game of life. Your world, your laws of physics are computed on a very large independent computers; all performing the same computation.
You exist within the laws of causality of your world, computed as long as at least one server computes your world. If some of them stop performing the computation, it won’t be a death of a copy; you’ll just have one fewer instance of yourself.
Whats the difference between fewer instances and fewer copies, and why is that load bearing for the expected utility calculation?
You are of course right that there’s no difference between reality-fluid and normal probabilities in a small world: it’s just how much you care about various branches relative to each other, regardless of whether all of them will exist or only some.
I claim that the negative utility due to stopping to exist is just not there, because you don’t actually stop to exist in a way you reflectively care about, when you have fewer instances. For normal things (e.g., how much do you care about paperclips), the expected utility is the same; but here, it’s the kind of terminal value that i expect for most people would be different; guaranteed continuation in 5% of instances is much better than 5% chance of continuing in all instances; in the first case, you don’t die!
But we are not talking about negative utility due to stopping to exist. We are talking about avoiding counterfactual negative utility by committing suicide, which still exists!
I think this is an artifact of thinking of all of the copies having a shared utility (i.e. you) rather than separate utilities that add up (i.e. so many yous will suffer if you don’t commit suicide). If they have separate utilities, we should think of them as separate instances of yourself.
And even in the case where we are assigning negative utility to death, most people are really considering counterfactual utility from being alive, and 95% of that (expected) counterfactual utility is lost whether 95% of the “instances of you” die or whether there is a 95% chance that “you” die.
I think there is, and I think cultural mores well support this. Separately, I think we shouldn’t legislate morality and though suicide is bad, it should be legal[1].
There also exist cases where it is in fact correct from a utilitarian perspective to kill, but this doesn’t mean there is no deontological rule against killing. We can argue about the specific circumstances where we need these rule carve-outs (eg war), but I think we’d agree that when it comes to politics and policy, there ought to be no carve-outs, since people are particularly bad at risk-return calculations in that domain.
But also this would mean we have to deal with certain liability issues, eg if ChatGPT convinces a kid to kill themselves, we’d like to say this is manslaughter or homicide iff the kid otherwise would’ve gotten better, but how do we determine that? I don’t know, and probably on net we should choose freedom instead, or this isn’t actually much a problem in practice.
Makes sense. I don’t hold this stance; I think my stance is that many/most people are kind of insane on this, but that like with many topics we can just be more sane if we try hard and if some of us set up good institutions around it for helping people have wisdom to lean on in thinking about it, rather than having to do all their thinking themselves with their raw brain.
(I weakly propose we leave it here, as I don’t think I have a ton more to say on this subject right now.)
To clarify, I meant that the choice of actions was based on the vibes, not on careful consideration, this seeming like the right thing to do in these circuimstances.
I maybe formulated this badly.
I do not disagree with that part of your comment. I did, in fact, risk being prosecuted unjustly by the state and spending a great deal of my life in prison. I was also aware of the kinds of situations I’d want to go for hunger strikes in while in prison, though didn’t think about that often.
And I, too, am willing to die to reduce the risk by a pretty small chance.
Most of the time, though, I think people who think they have this choice don’t actually face it; I think the bar for risking one’s life should be very high. In particular, when people have time to carefully do the math, I really want them to carefully do the math before deciding to risk their lives, and in this particular case, some of my frustration is from the people getting their math wrong.
I think as a community, we also would really want to make people err on the side of safety, and have a strong norm of assumption that most people who decide to sacrifice their lives got their math wrong. People really shouldn’t be risking their lives without having carefully thought of the theory of change when they have the ability to do so.
Like, I’d bet if we find people competent in how movements achieve their goals, they will say that these particular hunger strikes are not great; and I expect it to be the case most of the time when individuals who share values with a larger movement decide to go on a hunger strike even as the larger movement thinks that would not be effective.
I think I somewhat agree that these hunger strikes will not shut down the companies or cause major public outcry.
I think that there is definitely something to be said that potentially our society is very poor at doing real protesting, and will just do haphazard things and never do anything goal-directed. That’s potentially a pretty fundamental problem.
But setting that aside (which is a big thing to set aside!) I think the hunger-strike is moving in the direction of taking this seriously. My guess is most projects in the world don’t quite work, but they’re often good steps to help people figure out what does work. Like, I hope this readies people to notice opportunities for hunger strikes, and also readies them to expect people to be willing to make large sacrifices on this issue.
People do in fact try to be very goal-directed about protesting! They have a lot of institutional knowledge on it!
You can study what worked and what didn’t work in the past, and what makes a difference between a movement that succeeds and a movement that doesn’t. You can see how movements organize, how they grow local leaders, how they come up with ideas that would mobilize people.
A group doesn’t have to attempt a hunger strike to figure out what the consequences would be; it can study and think, and I expect that to be a much more valuable use of time than doing hunger strikes.
I’d be interested to read a quick post from you that argued “Hunger-strikes are not the right tool for this situation; here is what they work for and what they don’t work for. Here is my model of this situation and the kind of protests that do make sense.”
I don’t know much about protesting. Most of the recent ones that get big enough that I hear about them have been essentially ineffectual as far as I can recall (Occupy Wallstreet, Women’s March, No Kings). I am genuinely interested in reading about effective and clearly effective protests led by anyone currently doing protests, or within the last 10 years. Even if on a small scale.
(My thinking isn’t that protests have not worked in the past – I believe they have, MLK, Malcolm X, Women’s Suffrage Movement, Vietnam War Protest, surely more – but that the current protesting culture has lost its way and is no longer effective.)
Caveat that I don’t know much more than this, but I’m reminded of James Ozden’s lit reviews, e.g. How effective are protests? Some research and some nuance. Ostensibly relevant bits:
(Would be interested in someone going through this paper and writing a post or comment highlighting some examples and why they’re considered successful.)
Not quite responding to your main point here, but I’ll say that this position would seem valid to me and good to say if you believed it.
I don’t know what personal life tradeoffs any of them are making, so I have a hard time speaking to that. I just found out that Michael Trazzi is one of the people doing a hunger strike; I don’t think it’s true of him that he hasn’t thought seriously about the issues given how he’s been intellectually engaged for 5+ years.
Yep, I basically believe this.
(Social movements (and comms and politics) are not easy to reason about well from first principles. I think Michael is wrong to be making this particular self-sacrifice, not because he hasn’t thought carefully about AI but because he hasn’t thought carefully about hunger strikes.)
Relevantly, if any of them actually die, and if also it does not cause major change and outcry, I will probably think they made a foolish choice (where ‘foolish’ means ‘should have known in advance this was the wrong call on a majorly important decision’).
My modal guess is that they will all make real sacrifice, and stick it out for 10-20 days, then wrap up.
Follow-up: Michael Trazzi wrapped up after 7 days due to fainting twice and two doctors saying he was getting close to being in a life-threatening situation.
(Slightly below my modal guess, but also his blood glucose level dropped unusual fast.)
FAO @Mikhail Samin.
Yep. Good that he stopped. Likely bad that he started.
Trazzi shared this on Twitter:
The linked video seems to me largely successful at raising awareness of the anti-extinction position – it is not exaggerated, it is not mocked, it is accurately described and taken seriously. I take this as evidence of the strikes being effective at their goals (interested if you disagree).
I think the main negative update about Dennis (in line with your concerns) is that he didn’t tell his family he was doing this. I think that’s quite different from the Duc story I linked above, where he made a major self-sacrifice with the knowledge and support of his community.
Yep, I’ve seen the video. Maybe a small positive update overall, because could’ve been worse?
It seems to me that you probably shouldn’t optimize for publicity for publicity’s sake, and even then, hunger strikes are not a good way.
Hunger strikes are very effective tools in some situations; but they’re not effective for this. You can raise awareness a lot more efficiently than this.
“The fears are not backed up with evidence” and “AI might improve billions of lives” is what you get when you communicate being in fear of something without focusing on the reasons why.
Further follow-up: Guido Reichstadter wraps up after 30 days. Impressively long! And a bit longer than I’d guessed.
On the object level it’s (also) important to emphasize that these guys don’t seem to be seriously risking their lives. At least one of them noted he’s taking vitamins, hydrating etc. On consequentialist grounds I consider this to be an overdetermined positive.
a hunger strike will eventually kill you even if you take vitamins, electrolytes, and sugar. (a way to prevent death despite the target not giving in is often a group of supporters publicly begging the person on the hunger strike to stop and not kill themselves for some plausible reasons, but sometimes people ignore that and die.) I’m not entirely sure what Guido’s intention is if Anthropic doesn’t give in.
Sure, I just want to defend that it would also be reasonable if they were doing a more intense and targeted protest. “Here is a specific policy you must change” and “I will literally sacrifice my life if you don’t make this change”. So I’m talking about the stronger principle.
Isn’t suicide already legal in most places?
I think in a lot of places the government will try to stop you, including using violence.
I don’t strongly agree or disagree with your empirical claims but I do disagree with the level of confidence expressed. Quoting a comment I made previously:
I’m undecided on whether things like hunger strikes are useful but I just want to comment to say that I think a lot of people are way too quick to conclude that they’re not useful. I don’t think we have strong (or even moderate) reason to believe that they’re not useful.
When I reviewed the evidence on large-scale nonviolent protests, I concluded that they’re probably effective (~90% credence). But I’ve seen a lot of people claim that those sorts of protests are ineffective (or even harmful) in spite of the evidence in their favor.[1] I think hunger strikes are sufficiently different from the sorts of protests I reviewed that the evidence might not generalize, so I’m very uncertain about the effectiveness of hunger strikes. But what does generalize, I think, is that many peoples’ intuitions on protest effectiveness are miscalibrated.
[1] This may be less relevant for you, Mikhail Samin, because IIRC you’ve previously been supportive of AI pause protests in at least some contexts.
ETA: To be clear, I’m responding to the part of your post that’s about whether hunger strikes are effective. I endorse positive message of the second half of your post.
ETA 2: I read Ben Pace’s comment and he is making some good points so now I’m not sure I endorse the second half.
To be very clear, I expect large social movements that use protests as one of its forms of action to have the potential to be very successful and impactful if done well. Hunger strikes are significantly different from protests. Hunger strikes can be powerful, but they’re best for very different contexts.
I think we should show some solidarity to people committed to their beliefs and making a personal sacrifice, rather than undermining them by critiquing their approach.
Given that they’re both young men and the hunger strikes are occurring in the first world, it seems unlikely anyone will die. But it does seem likely they or their friends will read this thread.
Beyond that, the hunger strike is only on day 2 and is has already received a small amount of media coverage. Should they go viral then this one action alone will have a larger differential impact on reducing existential risk than most safety researchers will achieve in their entire careers.
https://www.businessinsider.com/hunger-strike-deepmind-ai-threat-fears-agi-demis-hassabis-2025-9
This is surprising to hear on LessWrong, where we value truth without having to think of object-level reasons for why it is good to say true things. But on the object level: it would be very dangerous for a community to avoid saying true things because it is afraid of undermining someone’s sacrifice; this would lead to a lot of needless, and even net-negative, sacrifice, without mechanisms for self-correction. Like, if I ever do something stupid, please tell me (and everyone) that instead of respecting my sacrifice: I would not want others to repeat my mistakes.
(There are lots of ways to get media coverage and it’s not always good in expectation. If they go viral, in a good way/with a good message, I will somewhat change my mind.)
Aside from whether or not the hunger strikes are a good idea, I’m really glad they have emphasized conditional commitments in their demands
I think that we should be pushing on these much much more: getting groups to say “I’ll do X if abc groups do X as well”
And should be pushing companies/governments to be clear whether their objection is “X policy is net-harmful regardless of whether anyone else does it” vs “X is net-harmful for us if we’re the only ones to do it”
[I recognize that some of this pushing/clarification might make sense privately, and that groups will be reluctant to stay stuff like this publicly because of posturing and whatnot.]
(While I like it being directed towards coordination, it would not actually make a difference, as it won’t be the case that all AI companies want to stop, and so it would still not be of great significance. The thing that works is a gov-supported ban on developing ASI anywhere in the world. A commitment to stop if everyone else stops doesn’t actually come into force unless everyone is required to stop anyway.
An ask that works is, e.g., “tell the government they need to stop everyone, including us”.)
For sure, I think that would be a reasonable ask too. FWIW, I think if multiple leading AI companies did make a statement like the one outlined, I think that would increase the chance of non-complying ones being made to halt by the government, even though they hadn’t made a statement themselves. That is, even one prominent AI company making this statement then starts to widen the Overton window
“There is no justice in the laws of Nature, no term for fairness in the equations of motion. The universe is neither evil, nor good, it simply does not care. The stars don’t care, or the Sun, or the sky. But they don’t have to! We care! There is light in the world, and it is us!”
And someday when the descendants of humanity have spread from star to star they won’t tell the children about the history of Ancient Earth until they’re old enough to bear it and when they learn they’ll weep to hear that such a thing as Death had ever once existed!
Credit for this quote goes to Eliezer Yudkowsky, for those who don’t know
We’re sending copies of the book to everyone with >5k followers!
If you have >5k followers on any platform (or know anyone who does), (ask them to) DM me the address for a physical copy of If Anyone Builds It, or an email address for a Kindle copy.
So far, sent 13 copies to people with 428k followers in total.
At the beginning of November, I learned about a startup called Red Queen Bio, that automates the development of viruses and related lab equipment. They work together with OpenAI, and OpenAI is their lead investor.
On November 13, they publicly announced their launch:
On November 15, I saw that and made a tweet about it: Automated virus-producing equipment is insane. Especially if OpenAI, of all companies, has access to it. (The tweet got 1.8k likes and 497k views.)
In the tweet, I said that there is, potentially, literally a startup, funded by and collaborating with OpenAI, with equipment capable of printing arbitrary RNA sequences, potentially including viruses that could infect humans, connected to the internet or managed by AI systems.
I asked whether we trust OpenAI to have access to this kind of equipment, and said that I’m not sure what to hope for here, except government intervention.
The only inaccuracy that was pointed out to me was that I mentioned that they were working on phages, and they denied working on phages specifically.
At the same time, people close to Red Queen Bio publicly confirmed the equipment they’re automating would be capable of producing viruses (saying that this equipment is a normal thing to have in a bio lab and not too expensive).
A few days later, Hannu Rajaniemi, a Red Queen Bio co-founder and fiction author, responded to me in a quote tweet and in comments:
They did not answer any of the explicitly asked questions, which I repeated several times:
It seems pretty bad that this startup is not being transparent about their equipment and the level of possible automation. It’s unclear whether they’re doing gain-of-function research. It’s unclear what security measures they have or are going to have in place.
I would really prefer for AIs, and for OpenAI (known for prioritizing convenience over security)’s models especially, to not have ready access to equipment that can synthesize viruses or software that can aid virus development.
I’m a little confused about what’s going on since apparently the explicit goal of the company is to defend against biorisk and make sure that biodefense capabilities keep up with AI developments, and when I first saw this thread I was like “I’m not sure of what exactly they’ll do, but better biodefense is definitely something we need so this sounds like good news and I’m glad that Hannu is working on this”.
I do also feel that the risk of rogue AI makes it much more important to invest in biodefense! I’d very much like it if we had the degree of automated defenses that the “rogue AI creates a new pandemic” threat vector was eliminated entirely. Of course there’s the risk of the AI taking over those labs but in the best case we’ll also have deployed more narrow AI to identify and eliminate all cybersecurity vulnerabilities before that.
And I don’t really see a way to defend against biothreats if we don’t do something like this (which isn’t to say one couldn’t exist, I also haven’t thought about this extensively so maybe there is something), like the human body wouldn’t survive for very long if it didn’t have an active immune system.
Thanks for sharing, this is extremely important context—I’m way more ok with dual use threats from a company actively trying to reduce bio risk from AI who seem to have vaguely reasonable threat models, than just reckless gain of function people with insane threat models. It’s much less clear to me how much risk is ok to accept from projects actively doing reasonable things to make it better, but it’s clearly non zero (I don’t know if this place is actually doing reasonable things, but Mikhail provides no evidence against)
I think it was pretty misleading for Mikhail not to include this context in the original post.
Uhm yeah valid I guess the issue was illusion of transparency: I mostly copied the original post from my tweet, which was quote-tweeting the announcement, and I didn’t particularly think about adding more context because had it cached that the tweet is fine (I checked with people closely familiar with RQB before tweeting, and it did include all of the context by virtue of quote-tweeting the original announceemnt) and when posting to lw did not realize i’m not directly adding all of the context that was included in the tweet if people don’t click on the link.
Added the context to the original post.
Separately, I think an issue is that they’re incredibly non-transparent about what they’re doing and have been somewhat misleading in their responses to my tweets and not answering any of the questions.
Like, I can see a case for doing gain-of-function research responsibly to develop protection against threats (vaccines, proteins that would bind for viruses, etc.), but this should include incredible transparency, strong security (BSL & computer security & strong guardrails around what exactly AI models have automated access to), etc.
Thanks for adding the context!
I can’t really fault them for not answering or being fully honest, from their perspective you’re a random dude who’s attacking them publicly and trying to get them lots of bad PR. I think it’s often very reasonable to just not engage in situations like that. Though I would judge them for outright lying
That’s somewhat reasonable. (They did engage though: made a number of comments and quote-tweeted my tweet, without addressing at all the main questions.)
Sure, but there’s a big difference between engaging in PR damage control mode and actually seriously engaging. I don’t take them choosing to be in the former as significant evidence of wrong doing
Agree; I’d also like to emphasize this part:
Based on this, they didn’t need to set up a new company. They already had an existing biotech company that was focused on its own research, when they realized that “oh fuck, based on our current research things could get really bad unless someone does something”… and then they went Heroic Responsibility and spun out a whole new company to do something, rather than just pretending that no dangers existed or making vague noises and asking for government intervention or something.
It feels like being hostile toward them is a bit Copenhagen Ethics, in that if they hadn’t tried to do the right thing, it’s possible that nobody would have heard about this and things would have been much easier for them. But since they were thinking about their consequences of their research and decided to do something about it and said that in public, they’re now getting piled on for not answering every question they’re asked on X. (And if I were them, I might also have concluded that the other side is so hostile that every answer might be interpreted in the worst possible light and that it’s better not to engage.)
This seems to fall into the same genre as “that word processor can be used to produce disinformation,” “that image editor can be used to produce ‘CSAM’,” and “the pocket calculator is capable of displaying the number 5318008.”
If a word processor falling into the hands of terrorists could easily generate a memetic virus capable of inducing schizophrenia in hundreds of millions of people, then I believe such concerns are warranted.
“Virus” is doing a lot of work here. It makes a big difference whether they’re capable of making phages or mammalian viruses:
Phages:
Often have a small genome, 3 kbp, easy to synthesize
Can be cultured in E. coli or other bacteria, which are easy to grow
More importantly, E. coli will take up a few kb of naked DNA, so you can just insert the genome directly into them to start the process (you can even do it without living E. coli if you use what’s basically E. coli juice)
I could order and culture one of these easily
Mammalian viruses (as I understand the situation)
Much larger genome, 30 kbp, somewhat harder to synthesize
Have to be cultured in mammalian cell cultures, which are less standard
More importantly, mammalian cells don’t just take up DNA, so you’d have to first package your viral DNA into an existing adenovirus scaffold, or some other large-capacity vector (maybe you could do it with a +ve sense RNA virus and a lipid based vector, but that’s a whole other kettle of fish)
The above might be false because I actually have no idea how to culture a de novo mammalian virus because it’s a much rarer thing to do
If they have the equipment to make phages but not to culture mammalian cells then that’s probably fine. If they’re doing AI-powered GoF research then, well, lmao I guess.
Minor correction on genome sizes:
DNA phage genomes have a median size of ~50kb, whereas RNA phage genomes are more around the 4kb mark.
Similarily, mammalian DNA viruses are usually >100kb, but their RNA viruses are usually <20kb.
Oddly enough the smallest known virus, porcine circovirus, is ssDNA, mammalian, and only 1.7kb
But yes, mammalian viruses are generally more difficult to culture, probably downstream of mammalian cells being more difficult to culture. Phages also typically only inject their genetic material into the cells, which bootstraps itself into a replication factory. Mammalian viruses, which generally instead sneak their way in and deliver the payload, often deliver their genetic material alongside proteins required to start the replication.
I was corrected on this, according to them, they’re not working on phages specifically.
I skimmed your tweet and didn’t see what evidence you used to support your assertions in it.
I didn’t particularly present any publicly available evidence in my tweet. Someone close to Red Queen Bio confirmed that they have the equipment and are automating it here.
Everyone should do more fun stuff![1]
I thought it’d just be very fun to develop a new sense.
Remember vibrating belts and ankle bracelets that made you have a sense of the direction of north? (1, 2)
I made some LLMs make me an iOS app that does this! Except the sense doesn’t go away the moment you stop the app!
I am pretty happy about it! I can tell where’s north and became much better at navigating and relating different parts of the (actual) territory in my map. Previously, I would remember my paths as collections of local movements (there, I turn left); now, I generally know where places are, and Google Maps feel much more connected to the territory.
If you want to try it, it’s on the App Store: https://apps.apple.com/us/app/sonic-compass/id6746952992
It can vibrate when you face north; even better, if you’re in headphones, it can give you spatial sounds coming from north; better still, a second before playing a sound coming from north, it can play a non-directional cue sound to make you anticipate the north sound and learn very quickly.
None of this interferes with listening to any other kind of audio.
It’s all probably less relevant to the US, as your roads are in a grid anyway; great for London though.
If you know how to make it have more pleasant sounds, or optimize directional sounds (make realistic binaural audio), and want to help, please do! The source code is on GitHub: https://github.com/mihonarium/sonic-compass/
unless it would take too much time, especially given the short timelines
This is really cool! My ADHD makes me rather place-blind, if I’m not intentionally forcing myself to pay attention to a route and my surroundings, I can get lost or disoriented quite easily. I took the same bus route to school for a decade, and I can’t trace the path, I only remember a sequence of stops. Hopefully someone makes an Android version, I’d definitely check it out.
Trying it out now, this is pretty fun! I think I’d use it more if it had an apple watch version that I could keep constantly running.
i made a thing!
it is a chatbot with 200k tokens of context about AI safety. it is surprisingly good- better than you expect current LLMs to be- at answering questions and counterarguments about AI safety. A third of its dialogues contain genuinely great and valid arguments.
You can try the chatbot at https://whycare.aisgf.us (ignore the interface; it hasn’t been optimized yet). Please ask it some hard questions! Especially if you’re not convinced of AI x-risk yourself, or can repeat the kinds of questions others ask you.
Send feedback to ms@contact.ms.
A couple of examples of conversations with users:
Confused about the disagreements. Is it because of the AI output or just the general idea of an AI risk chatbot?
how does your tool compare to stampy or just say asking these questions without the 200k tokens?
It’s better than stampy (try asking both some interesting questions!). Stampy is cheaper to run though.
I wasn’t able to get LLMs to produce valid arguments or answer questions correctly without the context, though that could be scaffolding/skill issue on my part.
Another example:
Good job trying and putting this out there. Hope you iterate on it a lot and make it better.
Personally, I utterly despise this current writing style. Maybe you can look at the Void bot on Bluesky, which is based on Gemini pro—it’s one of the rare bots I’ve seen whose writing is actually ok.
Thanks, but, uhm, try to not specify “your mom” as the background and “what the actual fuck is ai alignment” as your question if you want it to have a writing style that’s not full of “we’re toast”
Maybe the option of not specifying the writing style at all, for impatient people like me?
Unless you see this as more something to be used by advocacy/comms groups to make materials for explaining things to different groups, which makes sense.
If the general public is really the target, then adding some kind of voice mode seems like it would reduce latency a lot
This specific page is not really optimized for any use by anyone whatsoever; there are maybe five bugs each solvable with one query to claude, and all not a priority; the cool thing i want people to look at is the chatbot (when you give it some plausible context)!
(Also, non-personalized intros to why you should care about ai safety are still better done by people.)
I really wouldn’t want to give a random member of the US general public a thing that advocates for AI risk while having a gender drop-down like that.[1]
The kinds of interfaces it would have if we get to scale it[2] would be very dependent on where specific people are coming from. I.e., demographic info can be pre-filled and not necessarily displayed if it’s from ads; or maybe we ask one person we’re talking to to share it with two other people, and generate unique links with pre-filled info that was provided by the first person; etc.
Voice mode would have a huge latency due to the 200k token context and thinking prior to responding.
Non-binary people are people, but the dropdown creates unnecessary negative halo effect for a significant portion of the general public.
Also, dropdowns = unnecessary clicks = bad.
which I really want to! someone please give us the budget and volunteers!
at the moment, we have only me working full-time (for free), $10k from SFF, and ~$15k from EAs who considered this to be the most effective nonprofit in this field.
reach out if you want to donate your time or money. (donations are tax-deductible in the us.)
Is the 200k context itself available to use anywhere? How different is it from the Stampy.ai dataset? Nw if you don’t know due to not knowing what exactly stampy’s dataset is.
I get questions a lot, from regular ml researchers on what exactly alignment is and I wish I had an actually good thing to send them. Currently I either give a definition myself or send them to alignmentforum.
Nope, I’m somewhat concerned about unethical uses (eg talking to a lot of people without disclosing it’s ai), so won’t publicly share the context.
If the chatbot answers questions well enough, we could in principle embed it into whatever you want if that seems useful. Currently have a couple of requests like that. DM me somewhere?
Stampy uses RAG & is worse.
this deserves way more attention.
a big problem about AI safety advocacy is that we aren’t reaching enough people fast enough, this problem doesn’t have the same familiarity amongst the public as climate change or even factory farming and we don’t have people running around in the streets preaching about the upcoming AI apocalypse, most lesswrongers can’t even come up with a quick 5min sales pitch for lay people even if their live literally depended on it.
this might just be the best advocacy tool i have seen so far, if only we can get it to go viral it might just make the difference.
edit:
i take this part back
i have seen some really bad attempts at explaining AI-x risk in laymen terms and just assumed it was the norm, most of which were from older posts.
now looking at newer posts i think the situation is has greatly improved, not ideal but way better then i thought.
i still think this tool would be a great way to reach the wider public especially if it incorporates a better citation function so people can check the source material (it does sort of point the user to other websites but not technical papers).
Thanks! I think we’re close to a point where I’d want to put this in front of a lot of people, though we don’t have the budget for this (which seems ridiculous, given the stats we have for our ads results etc.), and also haven’t yet optimized the interface (as in, half the US public won’t like the gender dropdown).
Also, it’s much better at conversations than at producing 5min elevator pitches. (Hard to make it good at being where the user is while getting to a point instead of being very sycophantic).
The end goal is to be able to explain the current situation to people at scale.
Question: does LessWrong has any policies/procedures around accessing user data (e.g., private messages)? E.g., if someone from Lightcone Infrastructure wanted to look at my private DMs or post drafts, would they be able to without approval from others at Lightcone/changes to the codebase?
Expanding on Ruby’s comment with some more detail, after talking to some other Lightcone team members:
Those of us with access to database credentials (which is all the core team members, in theory) would be physically able to run those queries without getting sign-off from another Lightcone team member. We don’t look at the contents of user’s DMs without their permission unless we get complaints about spam or harassment, and in those cases also try to take care to only look at the minimum information necessary to determine whether the complaint is valid, and this has happened extremely rarely[1]. Similarly, we don’t read the contents or titles of users’ never-published[2] drafts. We also don’t look at users’ votes except when conducting investigations into suspected voting misbehavior like targeted downvoting or brigading, and when we do we’re careful to only look at the minimum amount of information necessary to render a judgment, and we try to minimize the number of moderators who conduct any given investigation.
I don’t recall ever having done it, Habryka remembers having done it once.
We do see drafts that were previously published and then redrafted in certain moderation views. Some users will post something that gets downvoted and then redraft it; we consider this reasonable because other users will have seen the post and it could easily have been archived by e.g. archive.org in the meantime.
I occasionally incidentally see drafts by following our automated error-logging to the page where the error occurred, which could be the edit-post page, and in those cases I have looked enough to check things like whether it contains embeds, whether collaborative editing is turned on, etc. In those cases I try not to read the actual content. I don’t think I’ve ever stumbled onto a draft dramapost this way, but if I did I would treat it as confidential until it was published. (I wouldn’t do this with a DM.)
Is there an immutable (or at least “not mutable by the person accessing the database”) access log which will show which queries were run by which users who have database credentials? If there is, I suspect that mentioning that will alleviate many concerns.
No. It turns out after a bit of digging that this might be technically possible even though we’re a ~7-person team, but it’d still be additional overhead and I’m not sure I buy that the concerns it’d be alleviating are that reasonable[1].
Not a confident claim. I personally wouldn’t be that reassured by the mere existence of such a log in this case, compared to my baseline level of trust in the other admins, but obviously my epistemic state is different from that of someone who doesn’t work on the site. Still, I claim that it would not substantially reduce the (annualized) likelihood of an admin illicitly looking at someone’s drafts/DMs/votes; take that as you will. I’d be much more reassured (in terms of relative risk reduction, not absolute) by the actual inability of admins to run such queries without a second admin’s thumbs-up, but that would impose an enormous burden on our ability to do our jobs day-to-day without a pretty impractical level of investment in new tooling (after which I expect the burden would merely be “very large”).
I think it would be feasible to increase the friction on improper access, but it’s basically impossible to do in a way that’s loophole-free. The set of people with database credentials is almost identical to the set of people who do development on the site’s software. So we wouldn’t be capturing a log of only typed in manually, we’d be capturing a log of mostly queries run by their modified locally-running webserver, typically connected to a database populated with a mirror snapshot of the prod DB but occasionally connected to the actual prod DB.
Thanks for response; my personal concerns[1] would somewhat be alleviated, without any technocal changes, by:
Lightcone Infrastructure explicitly promising not to look at private messages unless a counterparty agrees to that (e.g., becasue a counterparty reports spam);
Everyone with such access explicitly promising to tell others at Lightcone Infrastructure when they access any private content (DMs, drafts).
Talking to a friend about an incident made me lose trust in LW’s privacy unless it explicitly promises that privacy.
Second one seems reasonable.
Clarifying in the first case: If Bob signs up and DMs 20 users, and one reports spam, are you saying that we can only check his DM, or that at this time we can then check a few others (if we wish to)?
TBH the main thing that helps with in practice is that it forces teams to get off the “emailed spreadsheet of shared passwords” model of access management. Which mainly becomes useful if someone is leaving the team in a hurry under less than ideal circumstances.
“That problem is not on the urgent/important pareto frontier” is absolutely a valid answer though, especially since AFAIK LW doesn’t store any data more sensitive than passwords / a few home addresses.
We have policies to not look at user data. Vote data and DM data are the most sacred, though we will look at votes if the patterns suggest fraudulent behavior (e.g. mass downvoting of a person). We tend to inform/consult others on this, but no, there’s nothing technical blocking someone from accessing the data on their own.
Specifically, this is the privacy policy inherited from when LessWrong was a MIRI project; to the best of my knowledge, it hasn’t been updated.
I don’t think I’ve seen the team here post a law enforcement canary annually or anything.
I don’t think we currently have one. As far as I know, LessWrong hasn’t had any requests made of it by law enforcement that would trip a warrant canary while I’ve been working here (since July 5th, 2022). I have no information about before then. I’m not sure this is at the top of our priority list; we’d need to stand up some new infrastructure for it to be more helpful than harmful (i.e. because we forgot to update it, or something).
I want to make a thing that talks about why people shouldn’t work at Anthropic on capabilities and all the evidence that points in the direction of them being a bad actor in the space, bound by employees who they have to deceive.
A very early version of what it might look like: https://anthropic.ml
Help needed! Email me (or DM on Signal) ms@contact.ms (@misha.09)
If your theory of change is convincing Anthropic employees or prospective Anthropic employees they should do something else, I think your current approach isn’t going to work. I think you’d probably need to much more seriously engage with people who think that Anthropic is net-positive and argue against their perspective.
Possibly, you should just try to have less of a thesis and just document bad things you think Anthropic has done and ways that Anthropic/Anthropic leadership has misled employees (to appease them). This might make your output more useful in practice.
I think it’s relatively common for people I encounter to think both:
Anthropic leadership is engaged in somewhat scumy appeasment of safety motivated employees in ways that are misleading or based on kinda obviously motivated reasoning. (Which results in safety motivated employees having a misleading picture of what the organization is doing and why and what people expect to happen.)
Anthropic is strongly net positive despite this and working on capabilities there is among the best things you can do.
An underlying part of this view is typically that moderate improvements in effort spent on prosaic safety measures substantially reduces risk. I think you probably strongly disagree with this and this might be a major crux.
Personally, I agreee with what Zach said. I think working on capabilities[1] at Anthropic is probably somewhat net positive but would only be the best thing to work on if you had very strong comparative advantage relative to all the other useful stuff (e.g. safety research). So probably most altruistic people with views similar to mine should do something else. I currently don’t feel very confident that capabilities at Anthropic is net positive and could imagine swinging towards thinking it is net negative based on additional evidence
Putting aside strongly differential specific capabilities work.
fwiw I agree with most but not all details, and I agree that Anthropic’s commitments and policy advocacy have a bad track record, but I think that Anthropic capabilities is nevertheless net positive, because Anthropic has way more capacity and propensity to do safety stuff than other frontier AI companies.
I wonder what you believe about Anthropic’s likelihood of noticing risks from misalignment relative to other companies, or of someday spending >25% of internal compute on (automated) safety work.
If people work for Anthropic because they’re misled about the nature of the company, I don’t think arguments on whether they’re net-positive have any local relevance.
Still, to reply: They are one of the companies in the race to kill everyone.
Spending compute on automated safety work does not help. If the system you’re running is superhuman, it kills you instead of doing your alignment homework; if it’s not superhuman, it can’t solve your alignment homework.
Anthropic is doing some great research; but as a company at the frontier, their main contribution could’ve been making sure that no one builds ASI until it’s safe; that there’s legislation that stops the race to the bottom; that the governments understand the problem and want to regulate; that the public is informed of what’s going on and what legislation proposes.
Instead, Anthropic argues against regulation in private, lies about legislation in public, misleads its employees about its role in various things.
***
If Anthropic had to not stay at the frontier to be able to spend >25% of their compute on safety, do you expect they would?
Do you really have a coherent picture of the company in mind, where it is doing all the things it’s doing now (such as not taking steps that would slow down everyone), and yet would behave responsibly when it matters most and also pressure not to is the highest?
I recall a video circulating that showed Dario had changed his position on racing with China that feels perhaps relevant. People can of course change their mind, but I still dislike it.
Do you think the world would be a better place if Anthropic didn’t exist?
It’s probably better in the short-term while also making the short-term shorter (which is way worse).
Horizon Institute for Public Service is not x-risk-pilled
Someone saw my comment and reached out to say it would be useful for me to make a quick take/post highlighting this: many people in the space have not yet realized that Horizon people are not x-risk-pilled.
Edit: some people reached out to me to say that they’ve had different experiences (with a minority of Horizon people).
My sense is Horizon is intentionally a mixture of people who care about x-risk and people who broadly care about “tech policy going well”. IMO both are laudable goals.
My guess is Horizon Institute has other issues that make me not super excited about it, but I think this one is a reasonable call.
Importantly, AFAICT some Horizon fellows are actively working against x-risk (pulling the rope backwards, not sideways). So Horizon’s sign of impact is unclear to me. For a lot of people, “tech policy going well” means “regulations that don’t impede tech companies’ growth”.
My two cents: People often rely too much on whether someone is “x-risk-pilled” and not enough on evaluating their actual beliefs/skills/knowledge/competence . For example, a lot of people could pass some sort of “I care about existential risks from AI” test without necessarily making it a priority or having particularly thoughtful views on how to reduce such risks.
Here are some other frames:
Suppose a Senator said “Alice, what are some things I need to know about AI or AI policy?” How would Alice respond?
Suppose a staffer said “Hey Alice, I have some questions about [AI2027, superintelligence strategy, some Bengio talk, pick your favorite reading/resource here].” Would Alice be able to have a coherent back-and-forth with the staffer for 15+ mins that goes beyond a surface level discussion?
Suppose a Senator said “Alice, you have free reign to work on anything you want in the technology portfolio—what do you want to work on?” How would Alice respond?
In my opinion, potential funders/supporters of AI policy organizations should be asking these kinds of questions. I don’t mean to suggest it’s never useful to directly assess how much someone “cares” about XYZ risks, but I do think that on-the-margin people tend to overrate that indicator and underrate other indicators.
Relatedly, I think people often do some sort of “is this person an EA” or is this person an “xrisk person”, and I would generally encourage people to try to use this sort of thinking less. It feels like AI policy discussions are getting sophisticated enough that we can actually Have Nuanced Conversations and evaluate people less on some sort of “do you play for the Right Team” axis and more on “what is your specific constellation of beliefs/skills/priorities/proposals” dimensions.
I would otherwise agree with you, but I think the AI alignment ecosystem has been burnt many times in the past over giving a bunch of money to people who said they cared about safety, but not asking enough questions about whether they actually believed “AI may kill everyone and that is a near or the number 1 priority of theirs”.
I’m not sure if we disagree— I think there are better ways to assess this than the way the “is this an xrisk person or not” tribal card often gets applied.
Example: “Among all the topics in AI policy and concerns around AI, what are your biggest priorities?” is a good question IMP.
Counterexample: “Do you think existential risk from advanced AI is important?” is a bad question IMO (especially in isolation).
It is very easy for people to say they care about “AI safety” without giving much indication of where it stands on their priority list, what sorts of ideas/plans they want to aim for, what threat models they are concerned about, if they are the kind of person who can have a 20+ min conversation about interesting readings or topics in the field, etc.
I suspect that people would get “burnt” less if they asked these kinds of questions instead of defaulting to some sort of “does this person care about safety” frame or “is this person Part of My Tribe” thing.
(On that latter point, it is rather often that I hear people say things like “Alice is amazing!” and then when I ask them about Alice’s beliefs or work they say something like “Oh I don’t know much about Alice’s work— I just know other people say Alice is amazing!”. I think it would be better for people to say “I think Alice is well-liked but I personally do not know much about her work or what kinds of things she believes/prioritizes.”)
This seems like the opposite of a disagreement to me? Am I missing something?
Well Orpheus apparently agrees with me, so you probably understood the original comment better than I did!
What leads to you believe this?
FWIW this is also my impression but I’m going off weak evidence (I wrote about my evidence here), and Horizon is pretty opaque so it’s hard to tell. A couple weeks ago I tried reaching out to them to talk about it but they haven’t responded.
Datapoint: I spoke to one Horizon fellow a couple of years ago and they did not care about x-risk.
Talking to many people.
As in, Horizon fellows / people who work at Horizon?
Some of those; and some people who talk to those.
I want to signal-boost this LW post.
I long wondered why OpenPhil made so many obvious mistakes in the policy space. That level of incompetence just did not make any sense.
I did not expect this to be the explanation:
THEY SIMPLY DID NOT HAVE ANYONE WITH ANY POLITICAL EXPERIENCE ON THE TEAM until hiring one person in April 2025.
This is, like, insane. Not what I’d expect at all from any org that attempts to be competent.
(openphil, can you please hire some cracked lobbyists to help you evaluate grants? This is, like, not quite an instance of Graham’s Design Paradox, because instead of trying to evaluate grants you know nothing about, you can actually hire people with credentials you can evaluate, who’d then evaluate the grants. thank you <3)
To be clear, I don’t think this is an accurate assessment of what is going on. If anything, I think marginally people with more “political experience” seemed to me to mess up more.
In-general, takes of the kind “oh, just hire someone with expertise in this” almost never make sense IMO. First of all, identifying actual real expertize is hard. Second, general competence and intelligence is a better predictor of task performance in almost all domains after even just a relatively short acclimation period that OpenPhil people far exceed. Third, the standard practices in many industries are insane and most of the time if you hire someone specifically for their expertise in a domain, not just as an advisor but an active team member, they will push for adopting those standard practices even when it doesn’t make sense.
I don’t think Mikhail’s saying that hiring an expert is sufficient. I think he’s saying that hiring an expert, in a very high-context and unnatural/counter-intuitive field like American politics, is necessary, or that you shouldn’t expect success trying to re-derive all of politics in a vacuum from first principles. (I’m sure OpenPhil was doing the smarter version of this thing, where they had actual DC contacts they were in touch with, but that they still should have expected this to be insufficient.)
Often the dumb versions of ways of dealing with the political sphere (advocated by people with some experience) just don’t make any sense at all, because they’re directional heuristics that emphasize their most counterintuitive elements. But, in talking to people with decades of experience and getting the whole picture, the things they say actually do make sense, and I can see how the random interns or whatever got their dumb takes (by removing the obvious parts from the good takes, presenting only the non-obvious parts, and then over-indexing on them).
I big agree with Habryka here in the general case and am routinely disappointed by input from ‘experts’; I think politics is just a very unique space with a bunch of local historical contingencies that make navigation without very well-calibrated guidance especially treacherous. In some sense it’s more like navigating a social environment (where it’s useful to have a dossier on everyone in the environment, provided by someone you trust) than it is like navigating a scientific inquiry (where it’s often comparatively cheap to relearn or confirm something yourself rather than deferring).
I mean, it’s not like OpenPhil hasn’t been interfacing with a ton of extremely successful people in politics. For example, OpenPhil approximately co-founded CSET, and talks a ton with people at RAND, and has done like 5 bajillion other projects in DC and works closely with tons of people with policy experience.
The thing that Jason is arguing for here is “OpenPhil needs to hire people with lots of policy experience into their core teams”, but man, that’s just such an incredibly high bar. The relevant teams at OpenPhil are like 10 people in-total. You need to select on so many things. This is like saying that Lightcone “DOESN’T HAVE ANYONE WITH ARCHITECT OR CONSTRUCTION OR ZONING EXPERIENCE DESPITE RUNNING A LARGE REAL ESTATE PROJECT WITH LIGHTHAVEN”. Like yeah, I do have to hire a bunch of people with expertise on that, but it’s really very blatantly obvious from where I am that trying to hire someone like that onto my core teams would be hugely disruptive to the organization.
It seems really clear to me that OpenPhil has lots of contact with people who have lots of policy experience, frequently consults with them on stuff, and that the people working there full-time seem reasonably selected for me. The only way I see the things Jason is arguing for work out is if OpenPhil was to much more drastically speed up their hiring, but hiring quickly is almost always a mistake.
Part of the distinction I try to draw in my sequence is that the median person at CSET or RAND is not “in politics” at all. They’re mostly researchers at think tanks, writing academic-style papers about what kinds of policies would be theoretically good for someone to adopt. Their work is somewhat more applied/concrete than the work of, e.g., a median political science professor at a state university, but not by a wide margin.
If you want political experts—and you should—you have to go talk to people who have worked on political campaigns, served in the government, or led advocacy organizations whose mission is to convince specific politicians to do specific things. This is not the same thing as a policy expert.
For what it’s worth, I do think OpenPhil and other large EA grantmakers should be hiring many more people. Hiring any one person too quickly is usually a mistake, but making sure that you have several job openings posted at any given time (each of which you vet carefully) is not.
I agree that this is the same type of thing as the construction example for Lighthaven, but I also think that you did leave some value on the table there in certain ways (e.g. commercial-grade furniture vs consumer-grade furniture), and I think that a larger total % domain-specific knowledge I’d hope exists at Open Phil is policy knowledge than total % domain-specific knowledge I’d hope exists at Lightcone is hospitality/construction knowledge.
I hear you as saying ‘experts aren’t all that expert’ * ‘hiring is hard’ + ‘OpenPhil does actually have access to quite a few experts when they need them’ = ‘OpenPhil’s strategy here is very reasonable.’
I agree in principal here but think that, on the margin, it just is way more valuable to have the skills in-house than to have external people giving you advice (so that they have both sides of the context, so that you can make demands of them rather than requests, so that they’re filtered for a pretty high degree of value alignment, etc). This is why Anthropic and OAI have policy teams staffed with former federal government officials. It just doesn’t get much more effective than that.
I don’t share Mikhail’s bolded-all-caps-shock at the state of things; I just don’t think the effects you’re reporting, while elucidatory, are a knockdown defense of OpenPhil being (seemingly) slow to hire for a vital role. But running orgs is hard and I wouldn’t shackle someone to a chair to demand an explanation.
Separately, a lot of people defer to some discursive thing like ‘The OP Worldview’ when defending or explicating their positions, and I can’t for the life of me hammer out who the keeper of that view is. It certainly seems like a knock against this particular kind of appeal when their access to policy experts is on-par with e.g. MIRI and Lightcone (informal connections and advisors), rather than the ultra-professional, ultra-informed thing it’s often floated as being. OP employees have said furtive things like ‘you wouldn’t believe who my boss is talking to’ and, similarly, they wouldn’t believe who my boss is talking to. That’s hardly the level of access to experts you’d want from a central decision-making hub aiming to address an extinction-level threat!
To be clear, I was a lot more surprised when I was told about some of what OpenPhil did in DC, once starting to facepalm really hard after two sentences and continuing to facepalm very hard for most of a ten-minute-long story. It was so obviously dumb, that even me, with basically zero exposure to American politics or local DC norms and only some tangential experience running political campaigns in a very different context (an authoritarian country), immediately recognized it as obviously very stupid. While listening, I couldn’t think of better explanations than stuff like “maybe Dustin wanted x and OpenPhil didn’t have a way to push back on it”. But not having anyone who could point out how this would be very, very stupid, on the team, is a perfect explanation for the previous cringe over their actions; and it’s also incredibly incompetent, on the level I did not expect.
As Jason correctly noted, it’s not about “policy”. This is very different from writing papers and figuring out what a good policy should be. It is about advocacy: getting a small number of relevant people to make decisions that lead to the implementation of your preferred policies. OpenPhil’s goals are not papers; and some of the moves they’ve made that their impact their utility more than any of the papers they’ve funded more are ridiculously bad.
A smart enough person could figure it out from the first principles, with no experience, or by looking at stuff like how climate change became polarized, but for most people, it’s a set of intuitions, skills, knowledge that are very separate from those that make you a good evaluator of research grants.
It is absolutely obvious to me that someone experienced in advocacy should get to give feedback on a lot of decisions that you plan to make, including because some of them can have strategic implications you didn’t think about.
Instead, OpenPhil are a bunch of individuals who apparently often don’t know the right questions to ask even despite their employer’s magic of everyone wanting to answer their questions.
(I disagree with Jason on how transparent grant evaluations are ought to be; if you’re bottlenecked by time, it seems fine to make handwavy bets. You just need people who are good of making bets. The issue is that they’re not selected for making good bets in politics, and so they fuck up; not with the general idea of having people who make bets.)
I’m the author of the LW post being signal-boosted. I sincerely appreciate Oliver’s engagement with these critiques, and I also firmly disagree with his blanket dismissal of the value of “standard practices.”
As I argue in the 7th post in the linked sequence, I think OpenPhil and others are leaving serious value on the table by not adopting some of the standard grant evaluation practices used at other philanthropies, and I don’t think they can reasonably claim to have considered and rejected them—instead the evidence strongly suggests that they’re (a) mostly unaware of these practices due to not having brought in enough people with mainstream expertise, and (b) quickly deciding that anything that seems unfamiliar or uncomfortable “doesn’t make sense” and can therefore be safely ignored.
We have a lot of very smart people in the movement, as Oliver correctly points out, and general intelligence can get you pretty far in life, but Washington, DC is an intensely competitive environment that’s full of other very smart people. If you try to compete here with your wits alone while not understanding how politics works, you’re almost certainly going to lose.
Can you say more about this? I’m aware of the research on g predicting performance on many domains, but the quoted claim is much stronger than the claims I can recall reading.
random thought, not related to GP comment: i agree identifying expertise in a domain you don’t know is really hard, but from my experience, identifying generalizable intelligence/agency/competence is less hard. generally it seems like a useful signal to see how fast they can understand and be effective at a new thing that’s related to what they’ve done before but that they’ve not thought much specifically about before. this isn’t perfectly correlated with competence at their primary field, but it’s probably still very useful.
e.g it’s generally pretty obvious if someone is flailing on an ML/CS interview Q because they aren’t very smart, or just not familiar with the tooling. people who are smart will very quickly and systematically figure out how to use the tooling, and people who aren’t will get stuck and sit there being confused. I bet if you took e.g a really smart mathematician with no CS experience and dropped them in a CS interview, it would be very fascinating to watch them figure out things from scratch
disclaimer that my impressions here are not necessarily strictly tied to feedback from reality on e.g job performance (i can see whether people pass the rest of the interview after making a guess at the 10 minute mark, but it’s not like i follow up with managers a year after they get hired to see how well they’re doing)
PSA: if you’re looking for a name for your project, most interesting .ml domains are probably available for $10, because the mainstream registrars don’t support the TLD.
I bought over 170 .ml domains, including anthropic.ml (redirects to the Fooming Shoggoths song), closed.ml & evil.ml (redirect to OpenAI Files), interpretability.ml, lens.ml, evals.ml, and many others (I’m happy to donate them to AI safety projects).
Since this seems to be a crux, I propose a bet to @Zac Hatfield-Dodds (or anyone else at Anthropic): someone shows random people in San-Francisco Anthropic’s letter to Newsom on SB-1047. I would bet that among the first 20 who fully read at least one page, over half will say that Anthropic’s response to SB-1047 is closer to presenting the bill as 51% good and 49% bad than presenting it as 95% good and 5% bad.
Zac, at what odds would you take the bet?
(I would be happy to discuss the details.)
Sorry, I’m not sure what proposition this would be a crux for?
More generally, “what fraction good vs bad” seems to me a very strange way to summarize Anthropic’s Support if Amended letter or letter to Governor Newsom. It seems clear to me that both are supportive in principle of new regulation to manage emerging risks, and offering Anthropic’s perspective on how best to achieve that goal. I expect most people who carefully read either letter would agree with the preceeding sentence and would be open to bets on such a proposition.
Personally, I’m also concerned about the downside risks discussed in these letters—because I expect they both would have imposed very real costs, and reduced the odds of the bill passing and similar regulations passing and enduring in other juristictions. I nonetheless concluded that the core of the bill was sufficiently important and urgent, and downsides manageable, that I supported passing it.
I refer to the second letter.
I claim that a responsible frontier AI company would’ve behaved very differently from Anthropic. In particular, the letter said basically “we don’t think the bill is that good and don’t really think it should be passed” more than it said “please sign”. This is very different from your personal support for the bill; you indeed communicated “please sign”.
Sam Altman has also been “supportive of new regulation in principle”. These words sadly don’t align with either OpenAI’s or Anthropic’s lobbying efforts, which have been fairly similar. The question is, was Anthropic supportive of SB-1047 specifically? I expect people to not agree Anthropic was after reading the second letter.
I strongly disagree that OpenAI’s and Anthropic’s efforts were similar (maybe there’s a bet there?). OpenAI formally opposed the bill without offering useful feedback; Anthropic offered consistent feedback to improve the bill, pledged to support it if amended, and despite your description of the second letter Senator Wiener describes himself as having Anthropic’s support.
I also disagree that a responsible company would have behaved differently. You say “The question is, was Anthropic supportive of SB-1047 specifically?”—but I think this is the wrong question, implying that lack of support is irresponsible rather than e.g. due to disagreements about the factual question of whether passing the bill in a particular state would be net-helpful for mitigating catastrophic risks. The Support if Amended letter, for example, is very clear:
I don’t expect further discussion to be productive though; much of the additional information I have is nonpublic, and we seem to have different views on what constitutes responsible input into a policy process as well as basic questions like “is Anthropic’s engagement in the SB-1047 process well described as ‘support’ when the letter to Governor Newsom did not have the word ‘support’ in the subject line”. This isn’t actually a crux for me, but I and Senator Wiener seem to agree yes, while you seem to think no.
One thing to highlight, which I only learned recently, is that the norm when submitting letters to the governor on any bill in California is to include: “Support” or “Oppose” in the subject line to clearly state the company’s position.
Anthropic importantly did NOT include “support” in the subject line of the second letter. I don’t know how to read this as anything else than that Anthropic did not support SB1047.
Good point! That seems right; advocacy groups seem to think staff sorts letters by support/oppose/request for signature/request for veto in the subject line and recommend adding those to the subject line. Examples: 1, 2.
Anthropic has indeed not included any of that in their letter to Gov. Newsom.
(Could you link to the context?)
I noticed I have no clue how different positions of the tongue, the jaw, and the lips lead to different sounds.
So after talking to LLMs and a couple of friends who are into linguistics, I vibecoded https://contact.ms/fun/vowels.
I have no clue how valid any of it is. Would love for someone with a background in physics(/physiology/phonetics?) to fact-check it.
Is there a write up on why the “abundance and growth” cause area is an actually relatively efficient way to spend money (instead of a way for OpenPhil to be(come) friends with everyone who’s into abundance & growth)? (These are good things to work on, but seem many orders of magnitude worse than other ways to spend money.)
You’ve seen the blog post?
Yes, I’ve read their entire post. $14.4 of “social return” per $1 in the US seems incredibly unlikely to be comparable to the best GiveWell interventions or even GiveDirectly.
It isn’t 14.4x, it’s 2,000x they’re aiming for.
Ozzie Gooen asked about this before, here’s (the relevant part of) what Alexander Berger replied:
Someone else asked him to clarify what he meant by the numbers on the housing policy work and also separately asked
to which he replied
(I don’t know anything else about this beyond the exchange above, if you’re interested in litigating this further you can try replying to his last comment maybe)
I mean, I think abundance and growth has much better arguments for improving long-run well-being cost-effectively than reducing global disease burden. I do think it gets messy because of technological risks, but if you bracket that (which is course is a very risky thing to do), this seems like a good reallocation of funds to me that seems closer to reasonable to me.
I’m very confused about how they’re evaluating cost-effectiveness here. Like, no, spending $200 on vaccines in Africa to save lives seems like a much better deal than spending $200 to cause one more $400k apartment to exist.
Do you mean “they” or “me”? I think the latter is very likely better in the long run! Like, the places where $400k apartments exist have enormous positive externalities and enormous per-capita productivity, which is the central driver of technological growth, which is definitely going to determine long-run disease burden and happiness and population levels. The argument here feels pretty straightforward. We can try to put numbers on it, if you want, but if you accept the basic premise it’s kind of hard for the numbers to come out in favor of vaccines.
In RSP, Anthropic committed to define ASL-4 by the time they reach ASL-3.
With Claude 4 released today, they have reached ASL-3. They haven’t yet defined ASL-4.
Turns out, they have quietly walked back on the commitment. The change happened less than two months ago and, to my knowledge, was not announced on LW or other visible places unlike other important changes to the RSP. It’s also not in the changelog on their website; in the description of the relevant update, they say they added a new commitment but don’t mention removing this one.
Anthropic’s behavior is not at all the behavior of a responsible AI company. Trained a new model that reaches ASL-3 before you can define ASL-4? No problem, update the RSP so that you no longer have to, and basically don’t tell anyone. (Did anyone not working for Anthropic know the change happened?)
When their commitments go against their commercial interests, we can’t trust their commitments.
You should not work at Anthropic on AI capabilities.
The Midas Project is a good place to keep track of AI company policy changes. Here is their note on the Anthropic change:
https://www.themidasproject.com/watchtower/anthropic-033125
I don’t think it’s accurate to say that they’ve “reached ASL-3?” In the announcement, they say
And it’s also inaccurate to say that they have “quietly walked back on the commitment.” There was no commitment to define ASL-4 by the time they reach ASL-3 in the updated RSP, or in versions 2.0 (released October last year), and 2.1 (see all past RSPs here). I looked at all mentions of ASL-4 in the lastest document, and this comes closest to what they have:
Which is what they did with Opus 4. Now they have indeed not provided a ton of details on what exactly they did to determine that the model has not reached ASL-4 (see report), but the comment suggesting that they “basically [didn’t] tell anyone” feels inaccurate.
According to the Anthropic’s chief scientist’s interview with Time today, they “work under the ASL-3 standard”. So they have reached the safety level—they’re working under it—and the commitment would’ve applied[1].
There was a commitment in RSP prior to Oct last year. They did walk back on this commitment quietly: the fact they walk back on it was not announced in their posts and wasn’t noticed in the posts of others; only a single LessWrong comment in Oct 2024 from someone not affiliated with Anthropic mentions it. I think this is very much “quietly walking back” on a commitment.
According to Midas, the commitment was fully removed in 2.1: “Removed commitment to “define ASL-N+ 1 evaluations by the time we develop ASL-N models””; a pretty hidden (I couldn’t find it!) revision changelog also attributes the decision to not maintain the commitment to 2.1. At the same time, the very public changelog on the RSP page only mentions new commitments and doesn’t mention the decision to “not maintain” this one.
“they’re not sure whether they’ve reached the level of capabilities which requires ASL-3 and decided to work under ASL-3, to be revised if they find out the model only requires ASL-2” could’ve been more accurate, but isn’t fundamentally different IMO. And Anthropic is taking the view that by the time you develop a model which might be ASL-n, the commitments for ASL-n should trigger until you rule that out. It’s not even clear what a different protocol could be, if you want to release a model that might be at ASL-n. Release it anyway and contain it only after you’ve confirmed it’s at ASL-n?
Meta-level comment now that this has been retracted.
Anthropic’s safety testing for Claude 4 is vastly better than DeepMind’s testing of Gemini. When Gemini 2.5 Pro was released there was no safety testing info and even the model card that was eventually released is extremely barebones to compared to what Anthropic put out.
DeepMind should be embarrassed by this. The upcoming PauseCon protest outside DeepMind’s headquarters in London will focus on this failure.
I directionally agree!
Btw, since this is a call to participate in a PauseAI protest on my shortform, do your colleagues have plans to do anything about my ban from the PauseAI Discord server—like allowing me to contest it (as I was told there was a discussion of making a procedure for) or at least explaining it?
Because it’s lowkey insane!
For everyone else, who might not know: a year ago I, in context, on the PauseAI Discord server, explained my criticism of PauseAI’s dishonesty and, after being asked to, shared proofs that Holly publicly lied about our personal communications, including sharing screenshots of our messages; a large part of the thread was then deleted by the mods because they were against personal messages getting shared, without warning (I would’ve complied if asked by anyone representing a server to delete something!) or saving/allowing me to save any of the removed messages in the thread, including those clearly not related to the screenshots that you decided were violating the server norms; after a discussion of that, the issue seemed settled and I was asked to maybe run some workshops for PauseAI to improve PauseAI’s comms/proofreading/factchecking; and then, months later, I was banned despite not having interacted with the server at all.
When I reached out after noticing not being able to join the server, there was a surprising combination of being very friendly and excited to chat and scheduling a call and getting my takes on strategy, looking surprised to find out that I was somehow banned, then talking about having “protocols” for notifying of the ban which somehow didn’t work, and mentioning you were discussing creating a way to contest the ban and saying stuff about the importance of allowing the kind of criticism that I did; and at the same time, zero transparency around the actual reasons for the ban, how it happened, why I wasn’t notified, and then giving zero updates.
It’s hard to assume that the PauseAI leadership is following deontology.
reached out to Joep asking for the record, he said “Holly wanted you banned” and it was a divisive topic in the team.
Uhh yeah sorry that there hasn’t been a consistent approach. In our defence I believe yours in the only complex moderation case that PauseAI Global has ever had to deal with so far and we’ve kinda dropped the ball on figuring out how to handle it.
For context my take is that you’ve raised some valid points. And also you’ve acted poorly in some parts of this long running drama. And most importantly you’ve often acted in a way that seems almost optimised to turn people off. Especially for people not familiar with LessWrong culture, the inferential distance between you and many people is so vast that they really cannot understand you at all. Your behavior pattern matches to trolling / nuisance attention seeking in many ways and I often struggle to communicate to more normie types why I don’t think you’re insane or malicious.
I do sincerely hope to iron this out some time and put in place actual systems for dealing with similar disputes in the future. And I did read over the original post + Google doc a few months ago to try to form my own views more robustly. But this probably won’t be a priority for PauseAI Global in the immediate future. Sorry.
This is false. Our ASL-4 thresholds are clearly specified in the current RSP—see “CBRN-4” and “AI R&D-4″. We evaluated Claude Opus 4 for both of these thresholds prior to release and found that the model was not ASL-4. All of these evaluations are detailed in the Claude 4 system card.
I wrote the article Mikhail referenced and wanted to clarify some things.
The thresholds are specified, but the original commitment says, “We commit to define ASL-4 evaluations before we first train ASL-3 models (i.e. before continuing training beyond when ASL-3 evaluations are triggered). Similarly, we commit to define ASL-5 evaluations before training ASL-4 models, and so forth,” and, regarding ASL-4, “Capabilities and warning sign evaluations defined before training ASL-3 models.”
The latest RSP says this of CBRN-4 Required Safeguards, “We expect this threshold will require the ASL-4 Deployment and Security Standards. We plan to add more information about what those entail in a future update.”
Additionally, AI R&D 4 (confusingly) corresponds to ASL-3 and AI R&D 5 corresponds to ASL-4. This is what the latest RSP says about AI R&D 5 Required Safeguards, “At minimum, the ASL-4 Security Standard (which would protect against model-weight theft by state-level adversaries) is required, although we expect a higher security standard may be required. As with AI R&D-4, we also expect an affirmative case will be required.”
I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.
AFAICT, now that ASL-3 has been implemented, the upcoming AI R&D threshold, AI R&D-4, would not mandate any further security or deployment protections. It only requires ASL-3. However, it would require an affirmative safety case concerning misalignment.
I assume this is what you meant by “further protections” but I just wanted to point this fact out for others, because I do think one might read this comment and expect AI R&D 4 to require ASL-4. It doesn’t.
I am quite worried about misuse when we hit AI R&D 4 (perhaps even moreso than I’m worried about misalignment) — and if I understand the policy correctly, there are no further protections against misuse mandated at this point.
Not meaning to imply that Anthropic has dropped ASL-4! Just wanted to call out that this is does represent a change from the Sept. 2023 RSP.
Regardless, it seems like Anthropic is walking back its previous promise: “We have decided not to maintain a commitment to define ASL-N+1 evaluations by the time we develop ASL-N models.” The stance that Anthropic takes to its commitments—things which can be changed later if they see fit—seems to cheapen the term, and makes me skeptical that the policy, as a whole, will be upheld. If people want to orient to the rsp as a provisional intent to act responsibly, then this seems appropriate. But they should not be mistaken nor conflated with a real promise to do what was said.
Oops. Thank you and apologies.
FYI, I was (and remain to this day) confused by AI R&D 4 being called an “ASL-4” threshold. AFAICT as an outsider, ASL-4 refers to a set of deployment and security standards that are now triggered by dangerous capability thresholds, and confusingly, AI R&D 4 corresponds to the ASL-3 standard.
AI R&D 5, on the other hand, corresponds to ASL-4, but only on the security side (nothing is said about the deployment side, which matters quite a bit given that Anthropic includes internal deployment here and AI R&D 5 will be very tempting to deploy internally)
I’m also confused because the content of both AI R&D 4 and AI R&D 5 is seemingly identical to the content of the nearest upcoming threshold in the October 2024 policy (which I took to be the ASL-3 threshold). A rough sketch of what I think happened:
A rough sketch of my understanding of the current policy:
When I squint hard enough at this for a while, I think I can kind of see the logic: the model likely to trigger the CBRN threshold requiring ASL-3 seems quite close, whereas we might be further from the very-high threshold that was the October AI R&D threshold (now AI R&D 4), so the October AI R&D threshold was just bumped to the next level (and the one after that since causing dramatic scaling of effective compute is even harder than being a entry-level remote worker… maybe) with some confidence that we were still somewhat far away from it and thus it can be treated effectively as today’s upcoming + to-be-defined (what would have been called n+1) threshold.
I just get lost when we call it an ASL-4 threshold (it’s not, it’s an ASL-3 threshold), and also it mostly makes me sad that these thresholds are so high because I want Anthropic to get some practice reps in implementing the RSP before it’s suddenly hit with an endless supply of fully automated remote workers (plausibly the next threshold, AI R&D 4, requiring nothing more than the deployment + security standards Anthropic already put in place as of today).
I wish today’s AI R&D 4 threshold had been set at what, in the October policy, was called a “checkpoint” on the way to ASL-3: completing 2-8 hour SWE tasks. It looks like we’re about there, and it also looks like we’re about at CBRN-4, and ASL-3 seems like a reasonable set of precautions for both milestones. I do not think ASL-3 will be appropriate when we truly get endless parallelized drop-in Anthropic researchers, even if they have not yet been shown to dramatically increase the rate of effective scaling.
Is there a way to use policy markets to make FDT decisions instead of EDT decisions?
Worked on this with Demski. Video, report.
Any update to the market is (equivalent to) updating on some kind of information. So all you can do is dynamically choose what to do or do not update on.* Unfortunately, whenever you choose not to update on something, you are giving up on the asymptotic learning guarantees of policy market setups. So the strategic gains from updatelesness (like not falling into traps) are in a fundamental sense irreconcilable with the learning gains from updatefulness. That doesn’t prevent that you can be pretty smart about deciding what to update on exactly… but due to embededness problems and the complexity of the world, it seems to be the norm (rather than the exception) that you cannot be sure a priori of what to update on (you just have to make some arbitrary choices).
*For avoidance of doubt, what matters for whether you have updated on X is not “whether you have heard about X”, but rather “whether you let X factor into your decisions”. Or at least, this is the case for a sophisticated enough external observer (assessing whether you’ve updated on X), not necessarily all observers.
I think the first question to think about is how to use them to make CDT decisions. You can create a market about a causal effect if you have control over the decision and you can randomise it to break any correlations with the rest of the world, assuming the fact that you’re going to randomise it doesn’t otherwise affect the outcome (or bettors don’t think it will).
Committing to doing that does render the market useless for choosing policy, but you could randomly decide whether to randomise or to make the decision via whatever the process you actually want to use, and have the market be conditional on the former. You probably don’t want to be randomising your policy decisions too often, but if liquidity wasn’t an issue you could set the probability of randomisation arbitrarily low.
Then FDT… I dunno, seems hard.
Yep!
“If I randomize the pick, and pick A, will I be happy about the result?” “If I randomize the pick, and pick B, will I be happy about the result?”
Randomizing 1% of the time and adding a large liquidity subsidy works to produce CDT.
I agree with all of this! A related shortform here.
An interesting development in the time since your shortform was written is that we can now try these ideas out without too much effort via Manifold.
Anyone know of any examples?
I’m accumulating a small collection of spicy previously unreported deets about Anthropic for an upcoming post. Some of them sadly cannot publish because they might identify the sources. Others can! Some of those will be surprising to staff.
If you can share anything that’s wrong with Anthropic, that has not previously been public, DM me, preferably on Signal (@ misha.09)
The IMO organizers asked AI labs not to share their IMO results until a week later to not steal the spotlight from the kids. IMO organizers consider OpenAI’s actions “rude and inappropriate”.
https://x.com/Mihonarium/status/1946880931723194389
Based on the last paragraph it doesn’t sound like OpenAI specifically was asked to do this?
The screenshot is not the source for “The IMO organizers asked OpenAI not to share their IMO results until a week later”.
People are arguing about the answer to the Sleeping Beauty! I thought this was pretty much dissolved with this post’s title! But there are lengthy posts and even a prediction market!
Sleeping Beauty is an edge case where different reward structures are intuitively possible, and so people imagine different game payout structures behind the definition of “probability”. Once the payout structure is fixed, the confusion is gone. With a fixed payout structure&preference framework rewarding the number you output as “probability”, people don’t have a disagreement about what is the best number to output. Sleeping beauty is about definitions.)
And still, I see posts arguing that if a tree falls on a deaf Sleeping Beauty, in a forest with no one to hear it, it surely doesn’t produce a sound, because here’s how humans perceive sounds, which is the definition of a sound, and there are demonstrably no humans around the tree. (Or maybe that it surely produces the sound because here’s the physics of the sound waves, and the tree surely abides by the laws of physics, and there are demonstrably sound waves.)
This is arguing about definitions. You feel strongly that “probability” is that thing that triggers the “probability” concept neuron in your brain. If people have a different concept triggering “this is probability”, you feel like they must be wrong, because they’re pointing at something they say is a sound and you say isn’t.
Probability is something defined in math by necessity. There’s only one way to do it to not get exploited in natural betting schemes/reward structures that everyone accepts when there are no anthropics involved. But if there are multiple copies of the agent, there’s no longer a single possible betting scheme defining a single possible “probability”, and people draw the boundary/generalise differently in this situation.
You all should just call these two probabilities two different words instead of arguing which one is the correct definition for “probability”.
As the creator of the linked market, I agree it’s definitional. I think it’s still interesting to speculate/predict what definition will eventually be considered most natural.
Has anyone tried to do refusal training with early layers frozen/only on the last layers? I wonder if the result would be harder to jailbreak.
Say an adventurer wants Keltham to coordinate with a priest of Asmodeus on a shared interest. She goes to Keltham and says some stuff that she expects could enable coordination. She expects that Keltham, due to his status of a priest of Abadar, would not act on that information in ways that would be damaging to the Evil priest (as it was shared in her expectation that a priest of Abadar would aspire to be Lawful enough not to do that with information that was shared to enable coordination, making someone regret dealing with him). Keltham prefers using this information in a way damaging to the priest of Asmodeus to using it to coordinate. Keltham made no explicit promises about the use of information; the adventurer told him that piece of information first and said it was shared to enable coordination and shouldn’t be acted upon outside of it enabling coordination only afterwards.
Would Keltham say “deal, thanks for telling me”, or would he say “lol no I didn’t agree to that prior to being told thanks for telling me”?
It is a predictable consequence of saying “lol no I didn’t agree to that prior to being told thanks for telling me” that Keltham (and other people with similar expressed views regarding information and coordination) won’t be told information intended for coordination in future, including information that Keltham and similar people would have wanted to be able to use in order to coordinate instead of using it against the interests of those giving them the information.
So the question is: just how strongly does Keltham value using this information against the priest, when weighed against the cost of decreasing future opportunities for coordination for himself and others who are perceived to be similar to him?
There are plenty of other factors, such as whether there are established protocols for receiving such information in a way that binds priests of Abadar to not use it against the interests of those conveying it, whether the priest and the adventurer (and future others) could have been expected to know those protocols, and so on.
Yep. (I think there’s also a sense of honor and not screwing people over that’s not just about the value of getting such information in the future, that Keltham would care about.)
I do not believe Anthropic as a company has a coherent and defensible view on policy. It is known that they said words they didn’t hold while hiring people (and they claim to have good internal reasons for changing their minds, but people did work for them because of impressions that Anthropic made but decided not to hold). It is known among policy circles that Anthropic’s lobbyists are similar to OpenAI’s.
From Jack Clark, a billionaire co-founder of Anthropic and its chief of policy, today:
Dario is talking about countries of geniuses in datacenters in the context of competition with China and a 10-25% chance that everyone will literally die, while Jack Clark is basically saying, “But what if we’re wrong about betting on short AI timelines? Security measures and pre-deployment testing will be very annoying, and we might regret them. We’ll have slower technological progress!”
This is not invalid in isolation, but Anthropic is a company that was built on the idea of not fueling the race.
Do you know what would stop the race? Getting policymakers to clearly understand the threat models that many of Anthropic’s employees share.
It’s ridiculous and insane that, instead, Anthropic is arguing against regulation because it might slow down technological progress.
I’ve only seen this excerpt, but it seems to me like Jack isn’t just arguing against regulation because it might slow progress—and rather something more like:
“there’s some optimal time to have a safety intervention, and if you do it too early because your timeline bet was wrong, you risk having worse practices at the actually critical time because of backlash”
This seems probably correct to me? I think ideally we’d be able to be cautious early and still win the arguments to be appropriately cautious later too. But empirically, I think it’s fair not to take as a given?
kudos to LW for making a homepage theme advertising the book!
Yeah! This makes me want LW darkmode.
You’re one of today’s lucky 10 – we already have a dark mode! It’s in the menu in the top right, under ‘theme’.
There was I looking under Account Settings → Site Customizations like a fool
[RETRACTED after Scott Aaronson’s reply by email]
I’m surprised by Scott Aaronson’s approach to alignment. He has mentioned in a talk that a research field needs to have at least one of two: experiments or a rigorous mathematical theory, and so he’s focusing on the experiments that are possible to do with the current AI systems.
The alignment problem is centered around optimization producing powerful consequentialist agents appearing when you’re searching in spaces with capable agents. The dynamics at the level of superhuman general agents are not something you get to experiment with (more than once); and we do indeed need a rigorous mathematical theory that would describe the space and point at parts of it that are agents aligned with us.
[removed]
I’m disappointed that, currently, only Infra-Bayesianism tries to achieve that[1], that I don’t see dozens of other research directions trying to have a rigorous mathematical theory that would provide desiderata for AGI training setups, and that even actual scientists entering the field [removed].
Infra-Bayesianism is an approach that tries to describe agents in a way that would closely resemble the behaviour of AGIs, starting with a way you can model them having probabilities about the world in a computable way that solves non-realizability in RL (short explanation, a sequence with equations and proofs) and making decisions in a way that optimization processes would select for, and continuing with a formal theory of naturalized induction and, finally, a proposal for alignment protocol.
To be clear, I don’t expect Infra-Bayesianism to produce an answer to what loss functions should be used to train an aligned AGI in the time that we have remaining; but I’d expect that if there were a hundred research directions like that, trying to come up with a rigorous mathematical theory that successfully attacks the problem, with thousands of people working on them, some would succeed.
(Removed)
In my opinion, Project Lawful / planecrash is a terrible reference in addition to being written in a seriously annoying format. Although I have read it, I don’t recommend that anyone else read it. If any of the material in it should become some sort of shared culture that we should assume others in the community have read, it would require completely rewriting the entire thing from beginning to end.
I am not one of the two voters who initially downvoted, but I understand why they might have done so. I have weakly downvoted your comment for having made a load-bearing link between someone not having read Project Lawful and calling them “an NPC” in your sense, which is not the standard of discourse I want to see.
If you were expecting this person to have read this extremely niche and frankly bizarre work of fiction without having confirmed that they have actually read it and understood and fully agree with some relevant decision theory parts in it, then that seems pretty unwise of you and their not having done so does not reflect in any way poorly upon them.
“You didn’t act like I think the fictional character Keltham would have” is not a reasonable criticism of anyone.
There may or may not be some other unspecified actions they may have performed that do reflect poorly upon them, but those do not appear to connect in any way with this post.
I think many people around me would’ve had the same assumption that this particular person read planecrash. I don’t want to say more as I probably don’t want to say that they specifically did that because I think their goals are still similar to my, even if they’re very mistaken and are doing some very counterproductive things, and I definitely want to err on the side on not harming someone’s life/social status without a strong reason why it would be good for the community to know a fact about them.
NPC-like behavior was mostly due to them doing the thing they seemed to ascribe to themselves as what they should just be doing in their role, without willingness to really consider arguments; planecrash was just a thing that would’ve given them the argument why you shouldn’t take the specific actions they’ve taken. (Basic human decency and friendship would also suffice, but if someone read planecrash and still did the thing I would not want to deal with them in any way in the future, the way you wouldn’t want to deal with someone who just screws you over for no reason.)
I agree; it was largely what they did, which has nothing to do with planecrash. There are just some norms, that I expect it would be good for the community to have, that one implicitly learns from planecrash.
I didn’t down-vote and I think planecrash is amazing. But FYI referring to other humans as NPCs, even if you elaborate and make it clear what you mean, leaves a very bad taste in my mouth. If you were a random person I didn’t know anything about, and this was the first thing I read from you*, I’d think you were a bad person and I’d want nothing to do with you.
Not judging you, just informing you about my intuitive immediate reaction to your choice of words. Plausible other people who did downvote felt similar.
*referring to your first comment
Thanks, that’s helpful!
(Yep, it was me ranting about experiencing someone betraying my trust in a fairly sad way, who I really didn’t expect to do that, and who was very non-smart/weirdly scripted about doing it, and it was very surprising until I learned that they’ve not read planecrash. I normally don’t go around viewing anyone this way; and I dislike it when (very rarely! i can’t recall any other situations like that!) I do feel this way about someone.)
(I’m curious what caused two people to downvote this to −18.)