The best simple argument for Pausing AI?

Gary Marcus30 Jun 2025 20:38 UTC

155 points

Human-AI Safety AI-Assisted Alignment Language Models (LLMs)

Not saying we should pause AI, but consider the following argument:

Alignment without the capacity to follow rules is hopeless. You can’t possibly follow laws like Asimov’s Laws (or better alternatives to them) if you can’t reliably learn to abide by simple constraints like the rules of chess.
LLMs can’t reliably follow rules. As discussed in Marcus on AI yesterday, per data from Mathieu Acher, even reasoning models like o3 in fact empirically struggle with the rules of chess. And they do this even though they can explicit explain those rules (see same article). The Apple “thinking” paper, which I have discussed extensively in 3 recent articles in my Substack, gives another example, where an LLM can’t play Tower of Hanoi with 9 pegs. (This is not a token-related artifact). Four other papers have shown related failures in compliance with moderately complex rules in the last month.
If you can’t follow things such as Asimov’s Laws, you have no business running the world’s miltiary, infrastructure, etc. Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability.

Corollary: hyping LLMs is bad, because it increases the degree of global deployment of systems that lack a basic prerequisite for alignment – the ability of systems to comply with rules. This in turn elevates the risk of AI causing catastrophic harm to humanity.

Non ad-hominem comments are welcome.

What links here?

Ben Pace's comment on Noah Birnbaum’s Shortform by Noah Birnbaum (23 Oct 2025 20:41 UTC; 12 points)

Gary Marcus30 Jun 2025 20:38 UTC

155 points

23 comments1 min readLW link

Human-AI Safety AI-Assisted Alignment Language Models (LLMs)

Cole Wyeth 30 Jun 2025 21:02 UTC
102 points
53
Welcome to lesswrong!
I’m glad you’ve decided to join the conversation here.

A problem with this argument is that it doesn’t prove we should pause AI, only that we should avoid deploying AI in high impact (e.g. military) applications. Insofar as LLMs can’t follow rules, the argument seems to indicate that we should continue to develop the technology until it can.
Personally, I’m concerned about the type of AI system which can follow rules, but is not intrinsically motivated to follow our moral rules. Whether LLMs will reach that threshold is not clear to me (see https://www.lesswrong.com/posts/vvgND6aLjuDR6QzDF/my-model-of-what-is-going-on-with-llms) but this argument seems to cut against my actual concerns.
peterbarnett 1 Jul 2025 3:44 UTC
28 points
16
Here’s my shot at a simple argument for pausing AI.
We might soon hit a point of no return and the world is not at all ready.
A central point of no return is if we kick off a recursive automated AI R&D feedback loop (i.e., an intelligence explosion), where the AI systems get smarter and more capable, and humans are totally unable to keep up. I can imagine humans nominally still being in the loop but not actually understanding things, or being totally reliant on AIs explaining dumbed down versions of the new AI techniques being discovered.
There are other points of no return that are less discrete, such as if states become economically or militarily reliant on AI systems. Maybe due to competitive dynamics with other states, or just because the AIs are so damn useful and it would be too inconvenient to remove them from all the societal systems they are now a part of. See “The date of AI Takeover is not the day the AI takes over” for related discussion.
If we hit a point of no return and develop advanced AI (including superintelligent AI), this will come with a whole range of problems that the world is not ready for. I think any of these would be reasonable grounds for pausing until we can deal with them.^[1]
- Misalignment: We haven’t solved alignment, and it seems like by default we won’t. The majority of techniques for making AIs safer today will not scale to superintelligence. I think this makes Loss of Control a likely outcome (as in humans lose control over the entire future and almost all value is lost).
- War and geopolitical destabilization: Advanced AI or the technologies it enables are politically destabilizing, such as removing states’ second-strike nuclear capabilities. States may go to war or perform preemptive strikes to avoid this.
- Catastrophic misuse: Malicious actors or rogue states may gain access to AI (e.g., by stealing model weights, training the AI themselves, or using an open weights model), and use it to cause catastrophic harm. Current AIs are not yet at this level, but future AIs will likely be.
- Authoritarianism and bad lock-in: AI could lead to unprecedented concentration of power, it might enable coups to be performed with relatively little support from human actors, and then entrench this concentrated power.
- Gradual disempowerment: AIs could be more productive than humans, and economic competitive pressures mean that humans slowly lose power over time, to the point where we no longer have any effective control. This could happen even without any power seeking AI performing a power-grab.
The world is not on track to solve these problems. On the current trajectory of AI development, we will likely run head-first into these problems wildly unprepared.
1. ^
  Somewhat adapted from our research agenda.
- Crazy philosopher 7 Nov 2025 8:32 UTC
  1 point
  0
  Parent
  Let me expand on the “gradual disempowerment” point.
  Let’s suppose that AIs become better at strategic roles, such as generals, politicians, or heads of media organizations. At that stage, humans would no longer hold much real power and would control little, while still consuming large amounts of goods. AIs, in contrast, would be powerful, yet their own preferences would remain unsatisfied. This creates an unstable situation, because AIs could want, and be able, to carry out a coup d’état to seize human resources. It could end with the automation (read: killing) of humankind or by enslaving us and drastically reducing our level of consumption.
WillPetillo 1 Jul 2025 0:55 UTC
25 points
2
The basic contention here seems to be that the biggest dangers of LLMs is not from the systems themselves, but from the overreliance, excessive trust, etc. that societies and institutions put on them. Another is that “hyping LLMs”—which I assume includes folks here expressing concerns that AI will go rogue and take over the world—increases perceptions of AI’s abilities, which feeds into this overreliance. A conclusion is that promoting “x-risk” as a reason for pausing AI will have the unintended side effect of increasing (catastrophic, but not existential) dangers associated with overreliance.

This is an interesting idea, not least because it’s a common intuition among the “AI Ethics” faction, and therefore worth hashing out. Here are my reasons for skepticism:

1. The hype that matters comes from large-scale investors (and military officers) trying to get in on the next big thing. I assume these folks are paying more attention to corporate sales pitches than Internet Academics and people holding protest signs—and that their background point of reference is not Terminator, but the FOMO common in the tech industry (which makes sense in a context where losing market share is a bigger threat than losing investment dollars).
2. X-risk scenarios are admittedly less intuitive in the context of self supervised learning based LLMs than they were back when reinforcement learning was at the center of development as AI learned to play increasingly broad ranges of games. These systems regularly specification-gamed their environments and it was chilling to think about what would happen when a system could treat the entire world as a game. A concern now, however, is that agency will make a comeback because it is economically useful. Imagine the brutal, creative effectiveness of RL combined with the broad-based common sense of SSL. This reintegration of agency (can’t speak to the specific architecture) into leading AI systems is what the tech companies are actively developing towards. More on this concept in my Simulators sequence.

I, for one, will find your argument more compelling if you (1) take a deep dive into AI development motivations, rather than just lumping it all together as “hype”, and (2) explain why AI development stops with the current paradigm of LLM-fueled chatbots or something similarly innocuous in itself but potentially dangerous in the context of societal overreliance.
sjadler 2 Jul 2025 5:31 UTC
19 points
3
Welcome, Gary! Glad to have you posting here

One thing I notice in your post is two possibly different senses of pausing AI:

Toward the end, you write:

Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability.

My sense is that often when folks are suggesting a pause of AI, they mean pausing the frontier of AI development (that is, not continuing to develop more capable systems). But I don’t usually understand that as suggesting we stop the rollout of current systems, which I think is more what you’re describing here?
yrimon 2 Jul 2025 9:07 UTC
12 points
9
I found the title of this post misleading. I was expecting to find an argument for pausing AI development. Instead, this is an argument to “pause widespread rollout of Generative AI in safety-critical domains”.
The two types of pauses are related. The more an AI (agent) is deployed and given power and tools the more it can act dangerously. So pausing deployment should occur inversely to capabilities—more capable models should be given less freedom and tools. A sufficiently capable (unaligned) model should not be deployed (nearly) at all, including internally.
The missing step in the OP is the claim that frontier models are unaligned and within the error margin of being able to cause significant harm and so their deployment should be so limited as to make their development uneconomical. So we should pause development of frontier AI models.
As to the claim the post actually made, hyping LLMs in the sense of lying about their reliability is definitely bad, and comparable to other lies. Lying is usually harmful, and doing it about the safety and reliability of a tool that people depend on is awful. Hyping LLMs while being truthful about their capabilities doesn’t seem bad to me.
Dave Orr 1 Jul 2025 1:36 UTC
9 points
1
“Perhaps we should pause widespread rollout of Generative AI in safety-critical domains — unless and until it can be relied on to follow rules with significant greater reliability.”
This seems clearly correct to me—LLMs should not be in safety critical domains until we can make a clear case for why things will go well in that situation. I’m not actually aware of anyone using LLMs in that way yet, mostly because they aren’t good enough, but I’m sure that at some point it’ll start happening. You could imagine enshrining in regulation that there must be affirmative safety cases made in safety critical domains that lower risk to at or below the reasonable alternative.
Note that this does not exclude other threats—for instance misalignment in very capable models could go badly wrong even if they aren’t deployed to critical domains. Lots of threats to consider!
- sjadler 2 Jul 2025 5:35 UTC
  2 points
  0
  Parent
  It would surprise me if LLMs weren’t already in use in safety critical domains, at least depending on one’s definition of safety critical
  
  Maybe I’m thinking of the term overly broadly, but for instance, I’d be surprised if governments weren’t already using LLMs as part of their intel-gathering and -analysis operations, which presumably affect some military decisions and (on some margin) who lives or dies. For consequential decisions, you’d of course hope there’s enough oversight where some LLM hallucinations don’t cause attacks/military actions that weren’t justified
  - Dave Orr 2 Jul 2025 21:17 UTC
    2 points
    0
    Parent
    For that specific example, I would not call it safety critical in the sense that you shouldn’t use an unreliable source. Intel involves lots of noisy and untrustworthy data, and indeed the job is making sense out of lots of conflicting and noisy signals. It doesn’t strike me that adding an LLM to the mix changes things all that much. It’s useful, it adds signal (presumably), but also is wrong sometimes—this is just what all the inputs are for an analyst.
    Where I would say it crosses a line is if there isn’t a human analyst. If an LLM analyst was directly providing recommendations for actions that weren’t vetted by a human, yikes that seems super bad and we’re not ready for that. But I would be quite surprised if that were happening right now.
J Bostock 1 Jul 2025 15:58 UTC
6 points
4
I think there’s a couple of missing pieces here. Number 1 is that reasoning LLMs can be trained to be very competent in rewardable tasks, such that we can generate things which are powerful actors in the world, but we still can’t get them to follow the rules we want them to follow.
Secondly, if we don’t stop now, seems like the most likely outcome is we just die. If we do stop AI development, we can try and find another way forward. Our situation is so dire we should stop-melt-catch-fire on the issue of AI.
boazbarak 1 Jul 2025 15:40 UTC
6 points
−2
I am much more optimistic in getting AIs to reliably follow instructions (see https://www.lesswrong.com/posts/faAX5Buxc7cdjkXQG/machines-of-faithful-obedience )
But agree that we should not deploy systems (whether AI or not) in safety critical domains without extensive testing.

I don’t think that’s a very controversial opinion. In fact I’m not sure “pause” is the right term since I don’t think such deployment has started.
- Daniel Kokotajlo 1 Jul 2025 21:55 UTC
  6 points
  0
  Parent
  Would you agree that AI R&D and datacenter security are safety-critical domains?
  
  (Not saying such deployment has started yet, or at least not to a sufficient level to be concerned. But e.g. I would say that if you are going to have loads of very smart AI agents doing lots of autonomous coding and monitoring of your datacenters, analogous to as if they were employees, then they pose an ‘insider threat’ risk, and could potentially e.g. sabotage their successor systems or the alignment or security work happening in the company. Misalignments in these sorts of AIs could, in various ways, end up causing misalignments in successor AIs. During an intelligence explosion / period of AI R&D automation, this could result in misaligned ASI. True, such ASI would not be deployed outside the datacenter yet, but I think the point to intervene is before then, rather than after.)
  - boazbarak 1 Jul 2025 23:15 UTC
    6 points
    0
    Parent
    I think “AI R&D” or “datacenter security” are a little too broad.
    
    I can imagine cases where we could deploy even existing models as an extra layer for datacenter security (e.g. anomaly detection). As long as this is for adding security (not replacing humans), and we are not relying on 100% success of this model, then this can be a positive application, and certainly not one that should be “paused.”
    
    With AI R&D again the question is how you deploy it, if you are using a model in containers supervised by human employees then that’s fine. If you are letting them autonomously carry out large scale training runs with little to no supervision that is a completely different matter.
    
    At the moment, I think the right mental model is to think of current AI models as analogous to employees that have a certain skill profile (which we can measure via evals etc..) and also with some small probability could do something completely crazy. With appropriate supervision, such employees could also be useful, but you would not fully trust them with sensitive infrastructure.
    
    As I wrote in my essay, I think the difficult point would be if we get to the “alignment uncanny valley”—alignment is at sufficiently good level (e.g., probability of failure be small enough) so that people are actually tempted to entrust models with such sensitive tasks, but we don’t have strong control of this probability to ensure we can drive it arbitrarily close to zero, and so there are risks of edge cases.
- sjadler 2 Jul 2025 5:38 UTC
  3 points
  0
  Parent
  I’m surprised by the implication here which, if I read you correctly, is a belief that AI hasn’t yet been deployed ot safety-critical domains?
  
  OpenAI has a ton of usage related to healthcare, for instance. I think that this is basically all fine, well-justified and very likely net-positive, but it does strike me as a safety-critical domain. Does it not to you?
  - boazbarak 2 Jul 2025 11:14 UTC
    3 points
    0
    Parent
    “Healthcare” is pretty broad—certainly some parts of it are safety critical and some are less. I am not familiar with all the applications of language models for healthcare but If you are using LLM for improving efficiency in healthcare documentation then I would not call it safety critical. If you are connecting an LLM to a robot performing surgery then I would call it safety critical.
    It’s also a question of whether AIs outputs are used without supervision. If doctors or patients ask a charbot questions, I would not call it safety critical since the AI is not autonomously making the decisions.
    - sjadler 2 Jul 2025 19:51 UTC
      1 point
      0
      Parent
      Fair distinctions yeah. I’d still be surprised if AI isn’t yet deployed in safety-critical domains, but I hear you re: the view on specific healthcare stuff
AnthonyC 2 Jul 2025 9:52 UTC
4 points
0
I agree with you that deploying AI in high-impact safety-critical applications under these conditions and relying on the outputs as though they met standards they don’t meet is insane.
I would, naturally, note that (1), (2), and (3) also apply to humans. It’s not like we have any option for deploying minds in these applications that don’t have some version of these problems. LLMs have different versions we don’t understand anywhere near as well as we do the limitations of humans, but what that means for how and whether we can use them, even now, to improve reliability and overall quality of outcomes in specific contexts is not a straightforward derivation.
I would also add, (3) is in some sense true but also an unattainable ideal. There is no set of rules we know how to write down which actually specifies what we (or any individual) want to have happen well enough for any mind, natural or artificial, to be safe by following them. In domains where we can get closer to that ideal of being able to write down the right procedure, we have to do a tremendous amount of work to get natural minds to actually be able to follow them.
plex 10 Jul 2025 13:46 UTC
2 points
0
Yeah, I think this basically goes through. Though, even if we did have the ability to make rule-following AI, that doesn’t mean we’re now safe to go ahead. There are several other hurdles, like finding rules which make things good when superintelligent optimization is applied, and getting good enough goal-content integrity to not miss a step of self modification, plus the various human shaped challenges.
David James 11 Jul 2025 0:39 UTC
1 point
0

LLMs can’t reliably follow rules

I suggest rewriting this as “Present LLMs can’t reliably follow rules”. Doing so is clearer and reduces potential misreading. Saying “LLM” is often ambiguous; it could be the current SoTA, but sometimes it means an entire class.

Stronger claims, such as “Vanilla LLMs (without tooling) cannot and will not be able to reliably follow rule-sets as complicated as chess, even with larger context windows, better training, etc … and here is why.” would be very interesting, if there is evidence and reasoning behind them.
nebrelbug 3 Jul 2025 7:41 UTC
1 point
0
A simple solution to the problem is ensuring that the output of an LLM aligns with a specified schema.

It’s possible to do this already. Only want to give an LLM three “valid” options to choose from? Then define an output type with three valid options using a tool like dottxt-ai.github.io/outlines

In many ways, I think this is analogous to how legal systems enumerate only several valid ways of adjudicating a crime, out of the theoretically infinite decision space.
ACCount 2 Jul 2025 17:54 UTC
1 point
4
Completely unconvincing.
Getting humans to follow rules reliably is not a solved task either. And yet, human decision-making is relied upon all the time. Including in safety-critical domains.
This is why when humans want more reliability, they design entire systems of controlled decision-making. Systems that involve many different humans, and a lot of consensus decision-making and cross-checking. Which is something that can be done with AIs too.
If AI doesn’t pose any other risks, it just becomes a numbers game. If you can’t make AI decision-making more reliable than that of a human, evaluate whether making more mistakes is acceptable in your case, given AI’s other advantages. If you can, then kick the human out.
Matrice Jacobine 1 Jul 2025 17:22 UTC
1 point
0
FTR: You can choose your own commenting guidelines when writing or editing a post in the section “Moderation Guidelines”.
- habryka 1 Jul 2025 17:43 UTC
  6 points
  0
  Parent
  Moderation privileges require passing various karma thresholds (for frontpage posts, it’s 2000 karma, for personal posts it’s 50).
[ ]
[deleted]