If you want a curated high agency group of people, introducing a small amount of unnecessary friction sounds like a fantastic idea.
Jay Bailey
Bridge Thinking and Wall Thinking
First off, thanks for providing that list. I appreciate it. I do disagree with your last sentence, and I’ll write out why.
There are a couple of assumptions laid out in my stance here which I ought to make explicit. These assumptions are:If the race to ASI is not stopped, there is an unacceptable chance that it gets us all killed. Anything that cannot and will not do this is insufficient.
Anthropic will not voluntarily decide to do this in the absence of a binding requirement, and will not actively advocate for this to be done either.
Thus, my TL:DR is: Having a bunch of voluntary ways of gathering information about the risk of AI systems is not actually going to stop them from rushing headlong into danger. I don’t think that Anthropic is facilitating a world where AI companies have meaningful checks on their behavior, because I don’t think Anthropic views any of these requirements as “meaningful checks”. It doesn’t stop them doing the one thing they most want to do—continue to train and deploy ever more powerful models that bring us closer to truly dangerous territory.
None of these criticisms are unique to Anthropic—they apply to all the frontier AI companies, but I don’t think Anthropic is doing meaningfully better on addressing this than anyone else, in the sense of being considerably more likely to break my assumptions above than anyone else. There are several ways Anthropic is unusually responsible in this space, such as Claude’s constitution, but I do not consider them as significant to the above assumptions, which are by far the most important. I know I’m hammering on about this a lot, but my views probably don’t make a lot of sense without keeping this in mind.
TL:DR ends here.With that in mind, we can now take a look at the above items through this lens:
Promote third-party auditing systems / solicit third party evaluators: This still leaves the decision up to Anthropic. Having this seems better than not having it, but the way I read Holden’s statement is “If the AI companies get to make the call, they are unacceptably likely to get it wrong”. I agree with the statement in the previous sentence. Having third-party auditors doesn’t do this. In my view of the world, it doesn’t even really give us a saving throw—I do not imagine a situation where METR / Apollo / UK AISI tell Anthropic “This model is dangerous, do not deploy it under any circumstances” and Anthropic actually listens and avoids deploying it. Having third-party evaluators is great for Anthropic, as they get useful information about model capabilities, they appear to take safety seriously, but they are never actually compelled to make costly decisions at any point.
Having a bunch of transparency about the risks: Similar to above, except this time it’s not even a third-party auditor so you have an additional step of Anthropic needing to say out loud that something is unacceptably dangerous before you reach the step of them choosing whether or not to act. It’s in the same arena. Supporting SB 53 falls under this category.
Outline what kind of mitigations would be needed at an industry-wide level: Outlining it is not the same as doing it. I think that a mitigation that involves delaying a new model on the order of months (or, God forbid, not training a new one at all) will be prohibitively expensive and promptly abandoned when the reality sets in that this is the choice. And a mitigation that never leads to this choice at all is not going to be enough.
Funding PAC’s to support regulation: What does Anthropic themselves say this does? Here is a direct quote: In circumstances like these, we need good policy: flexible regulation that allows us to reap the benefits of AI, keep the risks in check, and keep America ahead in the AI race.
Under my own assumptions, that I’ve mentioned above, this can be read as:
Flexible: Please don’t bind us in advance to making costly decisions.
Reap the benefits of AI: Let us have market share.
Keep America ahead in the AI race: Let us have market share and more chips. There is also very much the worry of authoritarian governments in there, but certainly “Keep America ahead in the AI race” is not the kind of rhetoric that helps stop the AI race.
Keep the risks in check: Let’s look at the next sentence for that one. What do they say this means?
That means keeping critical AI technology out of the hands of America’s adversaries, maintaining meaningful safeguards, promoting job growth, protecting children, and demanding real transparency from the companies building the most powerful AI models.
Maintaining meaningful safeguards: Human misuse is the problem. We implicitly dodge the idea that it might be the AI system itself that is inherently unsafe.
Promoting job growth: I assume this means using AI for productivity, aka, help advance adoption of our products.
Protecting children: Avoid CSAM. Straightforwardly good, but doesn’t meaningfully impact the race to ASI.
Demand real transparency: See the above section on transparency.
Adding this all up, I don’t think this makes any ask that would risk binding them to the kind of costly decisions they want to avoid, which are the same costly decisions that could actually prevent rushing to ASI as fast as possible. (Maybe these actions slow things down a little on the margin—after all, non-zero resources are spent on them! But I don’t see it as making a meaningful difference)Opposing state moratoriums is a straightforwardly positive action and I think Anthropic is doing the right thing by doing this. I appreciate this, but I do not think it is enough to prevent the outcomes I’m most worried about. From my point of view, approximately none of this is useful to the core problem of “Humanity is racing to unacceptably dangerous ASI as fast as possible”. And if it doesn’t address the core problem, it’s not a meaningful check. Thus, I don’t think Anthropic is doing quite a lot on the one axis that really matters, and this is why I disagree.
“We can’t leave this up to the companies” is also true.
I really wish this wasn’t a single sentence buried two-thirds of the way into this document. This seems extraordinarily important to repeat as loudly and often as possible. If we cannot leave this up to the companies, what actions are you taking (and, as a separate question, what actions are Anthropic taking) to improve the chances that we don’t do this?
I currently live somewhere with just a microwave and hotplate, so this is intriguing to me. One question—when Marie says “100% power” in a cookbook from 1985, do you need to tone it down somewhat for modern microwaves?
And we have made it happen! Thanks to both Aloekine and Lightcone :)
I am indeed—shall PM you and we can make it happen
The problem with MPI is it feels like “Anyone can trivially spend a small amount of money and no effort to make a larger amount of money” is the kind of thing that quickly gets saturated. If you don’t have an MPI only because of all the other MPI’s out there that have eaten up the free profits, it’s not a great measure of capabilities.
I like the other three, but I also wonder how close to TUI we already are. It wouldn’t shock me that much if we already were most of the way TUI and the only reason this didn’t lead to robotics being solved is the AI itself has limitations—i.e. it can build an interface to control the robot, but the AI itself (not the interface) ends up being too slow, too high-latency, and too unable to plan things properly to actually perform at the level it needs to. (And that slowness I expect to continue such that creating/distilling small models is better for robotics use)
Interesting. You have convinced me that I need a better definition for this approximate level of capabilities. I do expect AI to advance faster than legacy organisations will adapt, such that it would be possible to have a world of “10% of jobs can be done by AI” but the AI capabilities need to be higher than “Can replace 10% of jobs in 2022″.
So, my understanding of ASI is that it’s supposed to mean “A system that is vastly more capable than the best humans at essentially all important cognitive tasks.” Currently, AI’s are indeed more capable, possibly even vastly more capable, than humans at a bunch of tasks, but they are not more capable at all important cognitive tasks. If they were, they could easily do my job, which they currently cannot.
Two terms I use in my own head, that largely correlate with my understanding of what people meant by the old AGI/ASI:
“Drop in remote worker”—A system with the capabilities to automate a large chunk of remote workers (I’ve used 50% before, but even 10% would be enough to change a lot) by doing the job of that worker with similar oversight and context as a human contractor. In this definition, the model likely gets a lot of help to set up, but then can work autonomously. E.g. if Claude Opus 4.5 could do this, but couldn’t have built Claude Code for itself, that’s fine.
This AI is sufficient to cause severe economic disruption and likely to advance AI R&D considerably.
“Minimum viable extinction”—A system with the capabilities to destroy all humanity, if it desires to. (The system is not itself required to survive this) This is when we get to the point of sufficiently bad alignment failures not giving us a second try. Unfortunately, this one is quite hard to measure, especially if the AI itself doesn’t want to be measured.
The Australian financial year starts and ends in the middle of the year, so it makes no difference to me if we do it in 2026. Let’s make it happen :)
I live in Australia, so I lack tax advantage for this. I am likely to still donate 1k or so if I can’t get tax advantages, but before doing so I wanted to check if anyone wanted to do a donation swap where I donate to any of these Australian tax-advantaged charities largely in global health in exchange for you donating the same amount to charities that are tax-advantaged in the US.
I am willing to donate up to 3k USD to MIRI, and 1.5k USD to Lightcone if I’m doing so tax-advantaged. If nobody takes me up on this I’ll still probably donate 2k USD to MIRI and 1k USD to Lightcone. I will accept offers to match only one of these two donations.
Also open to any alternate ways to gain tax advantage from Australia that are currently unknown to me in order to achieve this same outcome.
Fair point, if you add that you can’t assess it at less than you paid for it, this problem goes away.
Wouldn’t the equilibrium here trend towards a bunch of wasted labor where I deliberately lowball the value of the land, and then if someone offers a larger amount, I just say no and then start paying the larger amount, thus having a potential to pay less tax but losing nothing if I’m called out for it? No downside to me personally, and if this became common, it’d be harder to legitimately buy stuff. Seems like you’d need to pay some sort of fee to the entity credibly offering this larger amount to make it worth it.
This is the kind of content I keep coming back to this site for.
Obviously correct
Immediately useful as a day-to-day habit of thought
“Why didn’t I think of that!?”
I also like that it’s practical and practicable in day to day life while also being important for bigger, important questions.
Another example to add to the post. A few years ago I was learning to play Go, and thus playing against superior players. In Go, typically what happens is one player will make a move that signals something, like “I can take this territory” or “I can attack this group of pieces successfully” and the other player needs to decide whether to directly oppose this move, or to accept it and play elsewhere instead.
So I asked myself—if I disagree with a superior player’s assessment of the situation (i.e. he makes a move I think I can punish), what should I do? Statistically I’m probably wrong about my assessment. But then I realized that if I accepted their move, I would never find out why they were right. If I opposed the move, and was quickly punished (my group was destroyed, or the player successfully defended their “too aggressive” move) I would get to find out why I was wrong immediately. So I began playing according to my instincts even though they would lose me more exchanges in the short term, since it was the direction of maximum learning.
Outside of the office, I generally find it difficult to get appreciable amounts of work done. It feels like it takes a significant exertion of willpower to go from not doing a work-related task to doing one, at which point it generally becomes easier to continue with that task for a while. Performing this mental motion a few times per workday is enough for me to get close to full time hours in, but doesn’t feel enough for sixty hours. If I don’t perform this mental motion successfully, I wind up in a state of internal tension where I’m not actually putting in consistent mental effort towards solving the next problem in front of me. I do have a fix for this already—I work better in an office, where everyone around is also working.
So, the natural solution here would be “Find yourself an office type environment doing valuable work where it is entirely normal and expected for people to work these kind of hours.” This runs into the second issue, which is that on the rare occasions I have worked 10+ hours in a day, or generally pushed myself harder to try and stretch the edge of my conscientiousness, I tend to get headaches. Which are both unpleasant and also reduce my productivity, which ruins the entire point of the exercise. (The headaches are a known problem—I’m on a headache preventer that minimizes them. I could try upping the dosage, but I’ve already been told by one doctor that I should probably try not to be on this medication indefinitely.)
This also means that it wouldn’t be a good idea to go to an environment where everyone’s expected to work 60 hours, if I then don’t end up being able to do that even with the social and logistical environment set up in my favor. So it’d have to be an environment where it was both normal and expected to work 60 hours AND to work 40, AND the work was object-level valuable in my opinion, to be worth trying this experiment. This could be possible—I haven’t actually tried to seek out such an environment. But I do notice that I still do anticipate failing the original 60-hour goal due to the problem in the second paragraph.
On thinking about them:
“Having both the motivation and the mental stamina to work 60-hour weeks reliably.” Actually this probably would be “hard” rather than “impossible”. There are things I can try here that I haven’t tried that might work, so I have not yet tried enough things to declare it impossible. It’s more like I anticipate the possibility of this being impossible, as opposed to actually considering it impossible. Not a good example.
“Gaining 15+ points of IQ or the thing that IQ is measuring.”—So it seems like there are two ways this problem could go. In World A, there exists some stack of exercises and nootropics that can already let me do this without sacrificing something I’m not willing to sacrifice. In this case the problem is that of finding it—people can bullshit or lie, supplements are not a field where I expect complete honesty, and it wouldn’t surprise me if there’s a high amount of individualism such that someone could truthfully say they gained 15+ IQ points on X, but I need Y instead to achieve the same outcome. If I had to try each combination myself I’d run out of time very quickly. It’s impossible in the sense of winning the lottery is impossible—not something I can reliably make happen, as opposed to literally can’t be done. This is close enough that I consider it to be equivalent.
Alternatively, maybe this stack just doesn’t exist at all. Nootropics and exercises will not get you there. In which case, solving the problem means making advances in cognitive science that our culture hasn’t yet figured out how to do. And it’s not clear to me how I would succeed where lots of others have failed, here—leading onto my third point, I don’t see how I have a comparative advantage in this area, and if I have to work full-time for years to get to this point it is no longer worth it, when the point is to make me more effective at solving my current chosen problem.
“Taking something that isn’t my comparative advantage and making it that way”—I can think of a bunch of actions I can take here that would let me do better than I currently am. Getting tutoring, improving my ability to learn, talking it over with more experienced practitioners, etc. The key impossibility here is that people who are better than me at this can just do this stuff too and probably already are.
Like, imagine I want to be a top mathematician since I’m convinced that’s the only skill worth knowing for alignment. I can hire a tutor (and have done so), but better mathematicians can also do this and probably have. How do you get good enough to meaningfully contribute when people exist in this field who did the IMO at 17 and can do any of the same improvements I can come up with? So, I would have to find some method that was A) Incredibly effective, enough to bring me up to par with more talented people and B) Other people are unable or unwilling to do this thing, even really talented people in the field.
I don’t know how much context you need for the more personal examples, so I figure I’ll give them without context and then if you need more you can ask:
-
Having both the motivation and the mental stamina to work 60-hour weeks reliably.
-
Gaining 15+ points of IQ or the thing that IQ is measuring.
-
Becoming good enough at a field (abstract mathematics, mechanistic interpretability) that I’ve previously tried and found myself to be not that good at or interested in, such that it would then be worth my pursuing it as part of a research direction. (Another way you could write this is “Taking something that isn’t my comparative advantage and making it that way”)
-
Convincing an arbitrary person my values are good and worth adopting within a sixty-minute conversation. (I don’t actually WANT to do this in full generality! But it sure does seem impossible and it sure would be nice to do things that are closer to that area, sometimes)
-
Becoming a professional tennis player at the age of 33 with little tennis experience. (Again not something I actually want, but it sure does seem impossible. I figure if this is a bad example, you can just not include it.)
-
I think the biggest difference between wall-thinking and bridge-thinking here isn’t actually about the size of , but rather how easy is to alter. From a more mathematical standpoint, what matters is the rate of change of with respect to the effort put in. In AI safety, believing is very high is also correlated with a bunch of bridge thinking like “Alignment is incredibly difficult and current techniques have essentially zero chances of working”—i.e .
If I were to sum it up—bridge thinking assumes you need a lot of effort to start to meaningfully reduce from where we currently are. Wall thinking thinks there are marginal gains from the current position.
As an example, let’s take the following hypothetical belief: “If we had really good interpretability tools, then there would be a lot of low-hanging fruit we could pick with those tools. But without those tools we’re operating blindly, and can’t make much progress at all”. By this belief, we are currently in a bridge model—small improvements to current interpretability techniques will yield almost nothing. But if we did develop those good tools, we would now transition to wall thinking—there’s lots of marginal effort that leads to a reduction in by using those tools.
initially starts high or low. Which frame is appropriate would here depend on where you think we currently are on this graph, and is invariant with respect to the initial value of , provided is at least large enough to be concerning.
Here is a Claude-generated visualisation of what that would look like, demonstrating what the curves look like in each regime in my mind. This works whether