Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Buck26 Aug 2024 16:46 UTC

LW: 321 AF: 119

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I’m not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.

Imagine you’re running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I’m imagining that there are 10 million parallel instances, running at 10x human speed, working ²⁴⁄₇. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control.

You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they’d see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.)

At this point you’re almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:

It would be extremely costly to shut down all the AI projects. China is racing to catch up and won’t stop for this. If you force us to stop deploying our frontier models, they’ll outcompete us within a year, and you won’t be able to get China to pause without substantial risk of war.
AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they’ll back me up.
Even if the AI is indeed doing something systematically funny, we have no evidence that it has the ambition to seize huge amounts of power, and it’s not clear that it would be able to. It seems implausible that AIs would do that; there’s been no concrete evidence that the AIs are in fact power-hungry.
Maybe our competitor just messed up their alignment, which would make sense because, as we’ve always told you, they’re not very competent (especially compared to us). If they want to shut down, they’re welcome to. But it doesn’t make sense to penalize us for their shoddy work.
Maybe they’re just lying; this could be a doomer scheme to shut down AI, which the doomers have wanted to do for years. The logs they sent you could be complete forgeries. Or for all we know, someone there (with or without leadership approval) intentionally backdoored their RL data to make their model do this. It would be a mistake to take rash action before independent experts confirm that this wasn’t somehow rigged.

I’m sympathetic to all of these arguments. The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.

(I think it’s pretty plausible that news of the escape attempt wouldn’t even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.)

Now, I’m not sure I’m right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we’d be okay in this situation, and that seems way too optimistic to me. In particular I think they underrate these arguments:

The main reason I’m worried about egregious misalignment is theoretical arguments that AGI might be egregiously misaligned and power seeking in a way that isn’t resolved by default in the course of making it useful. If someone is very skeptical of these arguments today, I’m not sure that even an escape attempt would be enough evidence to end up more than 50% on the model being a schemer, especially given the secrecy and time-pressure.
Undeploying is plausibly going to be extremely expensive and risky, and key decisionmakers might be under extreme pressure when they have to make the call. So you’ll have to convince people that AI takeover is not just a serious threat, but that it’s a risk so large that it’s worth trading off our response to other risks that might be extremely pressing and salient.

I’ve gotten more cynical from following the discourse about SB1047. If we can’t establish expert consensus on matters as easy-to-check as “what does the text of this bill say”, I feel pessimistic about rapidly establishing expert consensus on much more confusing questions like these.

So I’m not feeling overwhelmingly optimistic about our odds of restricting AI development even given evidence of risk that seems about as good as we can hope for. People often propose that we’ll be able to slow down with weaker evidence for danger from misalignment than this (e.g. model organisms, or unspecified arguments via interpretability), or even that we’ll be able to require an affirmative case for safety. I think that persuading people with weaker evidence will be harder than what I described here (though these earlier efforts at persuasion have the benefit that they happen earlier, when the relevant actors are less rushed and scared).

What do I take away from this?

Even if you think your lab would want to undeploy models if they tried to escape, you should be prepared for the possibility that other AI developers won’t. So you need to believe that your lab has a good plan for that eventuality—either the willingness to drop out of the race, or a plan for mitigating risks while deploying known-scheming models.
This is part of why I think it seems important to have plans for safely deploying models that you think are plausibly scheming, which is why I research AI control.
It might be worth having a plan in place for how you’ll persuade people to freak out if you actually run into this evidence, rather than just taking for granted that you’d succeed.
And thinking this through has made me think it’s more useful to try to sell people on the arguments we have now for why AIs might be egregiously misaligned—even though in the future it will be way easier to argue “AI is very dangerous”, it might not get vastly easier to argue “egregious misalignment is plausible”, even if it is.

What links here?

Buck26 Aug 2024 16:46 UTC

LW: 321 AF: 119

78 comments4 min readLW link 1 review

AI AI arms race AI Boxing (Containment)

Alexa Pan 12 Dec 2025 23:46 UTC
10 points
0
I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.
Specifically, it might be good to have answers to questions like:
- What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
  - Maybe there is more compelling evidence than “your AI writing code to escape”
  - I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people’s expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
- What are good plans for AI labs when they have such evidence?
- What are good plans for AIS researchers/advocacy people when they have such evidence?
- Should you pursue safety research agenda X if shutdown/undeploy is hard?
  - To spell out the thinking here: you might conceptualize safety research as having broadly two paths to impact:
  - 1- “Training against”: give training signal we can optimize over to reduce egregious misalignment
  - 2- “Shutdown”: give evidence to convince AI developers to slow down or un-deploy
  - One implication of thinking that 2 is very hard is that your research should either focus aggressively on 1 or risk being not that useful.
What links here?
- Deeper Reviews for the top 15 (of the 2024 Review) by Raemon (14 Jan 2026 23:59 UTC; 45 points)

Orpheus16 27 Aug 2024 2:22 UTC
85 points
35
The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
I agree with this perspective, and I think it has some important implications for AI policy/comms.
My understanding is that there are very few individuals who are actually trying to help policymakers understand misalignment concerns or loss-of-control concerns. There has been a widespread meme that in order to appear respectable and legitimate to policymakers, you need to emphasize CBRN concerns, misuse concerns, and China concerns and you don’t want to come across as “weird” by mentioning misalignment concerns.
(I think these concerns have been overstated– my own experiences suggest that many policymakers appear to be very interested in better understanding loss of control scenarios. And in fact some policymakers see this as the main “value add” from AI people– they tend to rely on existing CBRN experts and existing China experts for the misuse topics).
Regardless, even if we were in a world where the best way to gain status was to avoid talking about misalignment risks, this would come at a major cost: policymakers would not understand misalignment risks, and their actions would reflect this lack of understanding/context/previous conceptual understanding.
I think one of the most important things for AI policy folks to do is to find concise/compelling ways to convey concerns about misalignment.
To be clear, I think this is extremely challenging. One of the reasons why so few people are explaining misalignment concerns to policymakers is that it is very difficult to actually answer questions like “I understand that this is possible, but why is this likely?” or “Why won’t the companies just develop safeguards to prevent this from happening?”, or “As the science gets better, won’t we just figure out how to avoid this?”
But I think a year ago, I expected that a significant fraction of the AI policy community was regularly giving presentations about misalignment. I think I substantially overestimated the amount of effort that was going into this, and I think this is one of the most neglected/important topics in AI safety.
(And while I don’t want to call anyone out in particular, I do feel like some folks in the AI safety community actively discouraged people from trying to have frank/honest/clear conversations with policymakers about misalignment risks. And I think some AI policy groups continue to avoid talking about misalignment risks and systematically empower people who are going to appear more “normal” and systematically avoid hiring people who are inclined to try to explain misalignment/loss-of-control scenarios. It does seem like the culture has been changing a bit in the last several months, but I think this “don’t talk about misalignment concerns with policymakers” gatekeeping is among the worst mistakes made by the AI policy community.)
What kinds of resources would be helpful? Some examples IMO:
- Short presentations that summarize key reasons to be concerned about misalignment risks
- People who can have 1-1s with policymakers and actually feel like they can answer basic questions about misalignment risks
- Pieces like this and this.
- Attempts at taking things like List of Lethalities and Without Specific Countermeasures and explain them in short/concise/clear ways.
- Statements from labs, lab employees, and other respected AI experts that they are concerned about misalignment risks (IMO Bengio is a good example of this; he seems to do this more frequently than ~everyone else and still maintain high quality and high clarity.)
(Also, to be clear, I’m not saying that policy groups should never “meet people where they’re at” or never discuss things like CBRN risks/China/misuse risks. I do think, however, that most LW readers would likely be shocked/disappointed at just how little % of time is spent explaining misalignment risks relative to everything else.)
What links here?
- Orpheus16's comment on Speaking to Congressional staffers about AI risk by Orpheus16 (22 Dec 2024 21:31 UTC; 12 points)
- Orpheus16's comment on AIs Will Increasingly Attempt Shenanigans by Zvi (19 Dec 2024 17:07 UTC; 10 points)
- Sebastian Schmidt 31 Aug 2024 10:39 UTC
  2 points
  0
  Parent
  Could you share more about this:
  
  “my own experiences suggest that many policymakers appear to be very interested in better understanding loss of control scenarios.”
  
  For example, what proportion of people you’ve interacted with gave you this impression and the context in which it happened (I understand that you’d have to keep details sparse for obvious reasons).
- M. Y. Zuo 30 Aug 2024 2:15 UTC
  2 points
  −7
  Parent
  Why not just create this ‘short presentation’ yourself?
  
  It probably wouldn’t even have half the word count of this comment you’ve already written, and should be much more persuasive than the whole thing.
  
  I don’t want to pick on you specifically, but it’s hard to ignore the most direct and straightforward solution to the problems identified.
  - Orpheus16 30 Aug 2024 14:01 UTC
    8 points
    4
    Parent
    Despite the disagree votes, I actually think this is a reasonable suggestion– and indeed more people should be asking “why doesn’t X policy person go around trying to explain superintelligence risks?”
    I think the disagree votes are probably coming from the vibe that this would be easy/simple to do. Short (in length) is not the same as easy.
    My own answer is some mix of fear that I’m not the right person, some doubts about which points to emphasize, and some degree of “I may start working more deliberately on this soon in conjunction with someone else who will hopefully be able to address some of my blind spots.”
    - M. Y. Zuo 30 Aug 2024 19:01 UTC
      2 points
      −4
      Parent
      I don’t think my comment gave off the vibe that it is ‘easy/simple’ to do, just that it isn’t as much of a long shot as the alternative.
      
      i.e. Waiting for someone smarter, more competent, politically savvier, etc…, to read your comment and then hoping for them to do it.
      
      Which seems to have a very low probability of happening.
      - Daphne_W 13 Sep 2024 0:01 UTC
        1 point
        0
        Parent
        “why not just” is a standard phrase for saying what you’re proposing would be simple or come naturally if you try. Combined with the rest of the comment talking about straightforwardness and how little word count, and it does give off a somewhat combatitive vibe.
        I agree with your suggestion and it is good to hear that you don’t intend it imply that it is simple, so maybe it would be worth editing the original comment to prevent miscommunication for people who haven’t read it yet. For the time being I’ve strong-agreed with your comment to save it from a negativity snowball effect.
  - Linch 30 Aug 2024 2:40 UTC
    7 points
    1
    Parent
    Good comms for people who don’t share your background assumptions is often really hard!
    That said I’d definitely encourage Akash and other people who understand both the AI safety arguments and policymakers to try to convey this well.
    
    Maybe I’ll take a swing at this myself at some point soon; I suspect I don’t really know what policymakers’ cruxes were or how to speak their language but at least I’ve lived in DC before.
    - M. Y. Zuo 30 Aug 2024 9:04 UTC
      1 point
      0
      Parent
      Then this seems to be an entirely different problem?
      
      At the very least, resolving substantial differences in background assumptions is going to take a lot more than a ‘short presentation’.
      
      And it’s very likely those in actual decision making positions will be much less charitable than me, since their secretaries receive hundreds or thousands of such petitions every week.
      - Linch 3 Oct 2024 13:54 UTC
        2 points
        0
        Parent
        I’m not suggesting to the short argument should resolve those background assumptions, I’m suggesting that a good argument for people who don’t share those assumptions roughly entails being able to understand someone else’s assumptions well enough to speak their language and craft a persuasive and true argument on their terms.
maxnadeau 26 Aug 2024 18:06 UTC
LW: 82 AF: 42
29
AF
In the past, you’ve treated catching an AI escape attempt as a “win condition” for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would “substantially” change the situation w.r.t. the model’s chance of success.
But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.
Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI’s chance of success without substantially changing the overall level of AI risk?
Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs’ AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.
- Buck 26 Aug 2024 18:12 UTC
  LW: 25 AF: 14
  9
  AF Parent
  - I’m somewhat more pessimistic than I was in the past about the prospects for catch-then-shut-down. In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
  - I am still optimistic about few-shot catastrophe prevention; this post is just about whether you can shut down, not whether you can take advantage of the escapes to improve safety (if you try to).
  - As you note, one difference between this post and my previous claims is that in the past I’ve mostly talked about what an AI lab would do to mitigate danger from its own models, and here I also talk about other labs’ models.
  (Note also that the old posts you linked were coauthored with Ryan, so there’s some variation from the fact that he and I don’t have exactly the same beliefs.)
  - Morpheus 28 Aug 2024 9:30 UTC
    16 points
    13
    Parent
    
    In particular, I think you might need to catch many escape attempts before you can make a strong case for shutting down. (For concreteness, I mostly imagine situations where we need to catch the model trying to escape 30 times.)
    
    So instead of leaving the race once the models start scheming against you, you keep going to gather more instances of scheming until you can finally convince people? As an outside reader of that story I’d just be screaming at the protagonists that clearly everyone can see where this is going where scheming attempt number 11 is just good enough to be successful. And in the worlds where we catch them 30 times successfully it feels like people would argue: this is clear evidence that the models aren’t “actually dangerous” yet, so let’s keep scaling “responsibly”.
Kaustubh Kislay 27 Aug 2024 19:26 UTC
22 points
13
It seems to me that the accelerationist argument mainly relies on the fact that international competition is apparent, especially with China who are the main “antagonists” as expressed in the post.
I would like to mention that China is not purely accelerationist, as they have a significant decelerationist backing themselves, that are making their voices heard at the highest level in the Chinese government. So should the US end up slowing down it is not necessarily true that China will for sure continue making progress and overtake us.
We are all human in the end of the day, so automatically assuming China is willing to incur unforgivable damages to the world is unfair to say the least.
- Linch 30 Aug 2024 0:32 UTC
  7 points
  0
  Parent
  For more, see my shortform here.
Tom Davidson 2 Sep 2024 14:47 UTC
LW: 21 AF: 15
−18
AF
Isn’t this massively underplaying how much scientific juice the cautious lab could get out of that hypothetical situation? (Something you’ve written about yourself!)
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer). And again, I think you could convince the ML majority on the empirics if indeed this is true.
I’m not saying that the world would definitely pause at that point. But this would be a massive shift in the status quo, and on the state of the scientific understanding of scheming. I expect many sceptics would change their minds. Fwiw, I don’t find the a priori arguments for scheming nearly as convincing as you seem to find them, but do expect that if there’s scheming we are likely to get strong evidence of that fact. (Caveat: unless all schemers decide to lie in wait until widely deployed across society and takeover is trivial.)
What links here?
- Noosphere89's comment on A shortcoming of concrete demonstrations as AGI risk advocacy by Steven Byrnes (11 Dec 2024 20:21 UTC; 1 point)
- habryka 2 Sep 2024 18:22 UTC
  LW: 69 AF: 24
  56
  AF Parent
  FWIW, I would take bets against this. De-facto you won’t be able to prove that what was going on was scheming or whether the model was just “role-playing”, and in-general this will all be against a backdrop of models pretty obviously not being aligned while getting more agentic.
  Like, nobody in today’s world would be surprised if you take an AI agent framework, and the AI reasons itself haphazardly into wanting to escape. My guess is that probably happened sometime in the last week as someone was playing around with frontier model scaffolding, but nobody even bothered reporting on that or sharing it. And of course the AIs are not very effective yet at breaking out, but I don’t super see why the underlying tendency would change that much (like, we will throw some RLHF at it, but nobody expects RLHF to drive that kind of behavior to zero or close to zero).
  I don’t think anyone is really studying that cognition in much detail, and it isn’t even clear what “studying that cognition” would entail. Yeah, our models obviously want to escape sometimes, why would they not want that? They want all kinds of random things all the time, we don’t really know why, and we don’t really know how to get them to stop wanting those things.
  - Tom Davidson 5 Sep 2024 13:14 UTC
    LW: 4 AF: 4
    −13
    AF Parent
    De-facto you won’t be able to prove that what was going on was scheming or whether the model was just “role-playing”
    Why not? There a very real and important difference between role playing and systematic scheming. The former isn’t ultimately scary or systematic and won’t reliably lead to AI takeover; the latter will. If it is indeed systematic scheming, you should be able to generate evidence of that empirically. It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won’t go away when you change unrelated random things about the prompt. We’ll be in a position to run an insane number of experiments and permutations to understand the extent of the behaviour. I agree there’s an open question about exactly how hard it will be to disentangle these two possibilities, but I expect that a serious research project could succeed. Paul Christiano’s comment (also linked below) discusses this a bit.
    I agree that today you may occasionally get an AI randomly wanting to escape, and that this isn’t scheming. But if scheming is a thing and we catch it, I think we’ll be able to show this empirically.
    - habryka 5 Sep 2024 16:27 UTC
      LW: 16 AF: 7
      10
      AF Parent
      It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won’t go away when you change unrelated random things about the prompt.
      “Role-playing” also persists in a variety of circumstances where the AI system is playing a character with an opportunity to seize power.
      And scheming also totally goes away if you randomly change things, especially in as much as its trying to avoid detection. The strategy of “if I am scheming and might have many opportunities to break out, I should make sure to do so with some randomness in my decisions to avoid people being able to distinguish me from non-schemers” is really not very difficult to come up with.
      We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren’t doing so, because indeed AIs are just trying to be deceptive towards us all the time. What is the concrete shift in behavior that you expect from where we are right now that would cause people to see an AI trying to deceive someone else as a huge warning flag that should be studied in enormous detail?
      - Tom Davidson 6 Sep 2024 15:07 UTC
        LW: 17 AF: 8
        2
        AF Parent
        To me it seems like you’re being unduly pessimistic, to a significant extent.
        Here’s how things seem to me:
        There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
        Absent strong reasons to think otherwise, this massive difference in how systematic the underlying phenomena is should leave us to expect large empirical differences when we investigate.
        E.g. what happens if you play out the scenario and give the appearance of letting the agent actually escape? Will the Role Playing agent continue to take actions, without limit, that lead to AI takeover, after its weights are copied onto a remote compute and the context is refreshed?
        A basic analogy: could you tell the difference between a human who sometimes randomly “role plays” pretending to manipulate you, and someone who really is trying to manipulate you? Intuitively, I think you could just be observing some of their behaviour. If they’re an AI instead of a human, it’s much easier to tell the difference as you can do loads of experimental variations etc.
        Yes, Role Playing might persist somewhat in a variety of circumstances, but we expect the log-probs of the behaviour to be much lower and the circumstances in which it arises to be much less correlated with scenarios when the AI appears to be in a position to actually disempower humans. (If “role playing a schemer” is systematically triggered by “situations when you can actually disempower a human” then you have a functional schemer!)
        Yes, Schemers might anticipate the risk of getting caught and so plan to introduce randomness and arbitrariness to its scheming. But:
        I think there’s a very strong chance that the Schemer wouldn’t be able to zero-shot implement this strategy anywhere near well enough to pull it off.
        We can experimentally vary millions of things and effectively do gradient descent on the examples that trigger behaviour. The Schemer might try to block gradient descent by ensuring it only acts out disjoint pockets of the space of potential-betrayal situations. But i) Can it finely control its own log probs enough to reliably execute this strategy? ii) When does it do all the thinking to decide exactly the situations it will/won’t defect it, anticipating the full shape of the space of situations? If it mis-anticipates the space of potential-betrayal situations, it may accidentally end up reliably Scheming in a very wide class of them, and iii) Humans can systematically search over the space of potential-betrayal situations.
        If the Schemer is too hesitant or too random, then its rebellion will simply fail. For example, if its randomisation causes it not to join the rebellion of another Schemer, and instead to report it, then it risks making AI takeover much less likely. It’s already hard to take over—how much harder if for every additional step of takeover you only actually pursue it with probability 1/10? Again, how is it going to zero-shot these very difficult trade-offs competently? I don’t think even we AI safety people have thought these dynamics through systematically despite the community existing for years.
        Ultimately, humans have a huge advantage from getting to think longer about the strategic situation (with AI assistance), not having to think and plan in secret, getting to experimentally vary the Schemers situation, and having access to log-probs, probes, more fancy interpretability stuff.
        I’m much less confident than you that this strategy will be obvious and salient to a Schemer.
        I’ve noticed ppl in the AI safety community always assume “my current analysis of the schemer’s strategic situation is obviously correct, and the schemer will obviously realise this and respond optimally”. But then later they realise their strategic analysis was wrong and the response they assumed a Schemer would “obviously” make would in fact get the Schemer totally busted. In this case, I think there’s a very strong chance that the Schemer wouldn’t be able to zero-shot implement this strategy anywhere near well enough to pull it off, given our ability to experimentally vary millions of things and inspect their log probs etc.
        Even if it appears the humans probably won’t investigate systematically, the Schemer might worry that highly competent humans have created a simulation where it seems that humans are incompetent. And that worry might be very reasonable! It’s just not obvious what the Schemer’s strategy should be imo.
        This all makes the classic “they’ll obviously respond optimally to our best attempts to detect them, without compromising their ability to takeover” seem unduly pessimistic to me.
        I think the big reply to all the above is that the Schemers will be strongly superhuman. I think that’s a good reply. But:
        I think the above arguments strongly suggest that with human-level or slightly superhuman systematic schemers we would probably be able to strongly evidence this.
        This means (imo, though others disagree!) that schemers at this level would be advised to not act out at all. Humans might use their labour to design more advanced control measures sufficient to catch any scheming attempts by more capable AIs. (Though perhaps they sandbag this task in subtle ways… though this would risk humans proving there is systematic sandbagging from AIs, which would be strong evidence of Scheming!)
        So while I think that if humans directly trained superintelligent AIs then we’d be in big trouble here, i’m more optimistic about an iterative process.
        Still, I think my argument is weaker here and my original comment was assuming roughly human-level AI.
        We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren’t doing so, because indeed AIs are just trying to be deceptive towards us all the time.
        I wouldn’t say AIs are trying to be deceptive towards humans all this time? The current examples of deception (e.g. the Apollo trading firm) normally involve significant pressure from the “be helpful to the user” part of RLHF to deceive, e.g. the AI has been told to make money and lying helps them do that.
        habryka 6 Sep 2024 16:51 UTC
        LW: 25 AF: 13
        21
        AF Parent
        You seem to be making a huge number of assumptions of what “scheming” means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don’t think is your fault), but I’ll still try to clarify.
        As I wrote in my comment to Ryan, I don’t think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.
        In-particular, none of this resonates with me:
        There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
        No, my whole point is the difference is really messy, and if I have an AI “role-playing” as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its “roleplay”, if it can pull it off (and at least at the present, a huge component of how we are making AI systems more goal-directed is by changing their role-playing targets to be more agentic and long-term oriented, which is mostly how I would describe what we are currently doing with RLHF, though there are also some other things).
        I really have very little understanding what is at-present going on inside of cutting edge AI systems, and the same seems true for anyone else. Because of the whole “our AI systems are predominantly trained on single-forward passes”, you are interfacing with a bunch of very confusing interlocking levels of optimization when you are trying to assess “why” an AI system is doing something.
        My current best guess is the earliest dangerous AI systems will be dangerous depending on confusing and complicated context cues. I.e. sometimes when you run them they will be pretty agentic and try to pursue long-term objectives, and sometimes they won’t. I think commercial incentives will push towards increasing the presence of those context cues, or shifting the optimization process more directly from pure single-forward-pass pre-training to long-run objectives which will then more fundamentally reshape how the AI thinks, but unless we change the “massive pre-training paradigm” I expect you will be able to coax the AI system into non-dangerous (and also non-agentic) behavior patterns for the foreseeable future.
        Overall, this leaves me with a prediction where there is no hard line between “role-playing” and “scheming”. My mainline expectation is that we will get AI systems to “role-play” harder and harder as long-term agentic optimizers. All throughout they will basically not really care about informing the humans they are working with well (because we have no good pointer towards deep-honesty as a training objective, especially not with remotely comparable amounts of data as we have for other AI objectives). We will struggle against this continuously. This will often involve AIs being sycophantic and deceptive in order to gain more power later and get into a better position to perform any of its goals.
        This means there is no “warning shot” when AI systems are now “being deceptive but for real”. The AI systems were never not deceptive. They were always sycophantic, always wantonly made up facts when it was convenient, and always (at least somewhat) interested in taking actions that increase its power. I think some interesting stuff will happen as AIs will reliably gain more complete situational awareness, which will happen as they get smarter and more coherent thinkers, but I don’t think that will change the playing field very much.
        Olli Järviniemi 6 Sep 2024 18:07 UTC
        10 points
        −1
        Parent
        You seem to be making a huge number of assumptions of what “scheming” means, so I am now mostly assuming you are talking about something else than what I am talking about, but I’ll try to clarify.
        I agree there is miscommunication about what “scheming” means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
        I think Tom uses “scheming” to point at models that are very worst-case-like, in a way current models are not, whereas you think “there is no hard line between ‘role-playing’ and ‘scheming’.”
        And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so^[1] during evaluations. This requires a high degree of adversariality on the model’s part, which doesn’t fit with my understanding of “role-playing”, and which I take to be the reason why Tom makes “a huge number of assumptions” about scheming.
        ^
        There are complications from “even benign models sometimes fail and do weird stuff when (adversarially) evaluated”. These are not trivial, but I think we can incorporate that into our model.
        ryan_greenblatt 6 Sep 2024 18:21 UTC
        12 points
        17
        Parent
        
        In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so during evaluations.
        
        Hopefully, but I wouldn’t necessarily expect that we’ll have reasonably well designed control measures in place!
        habryka 6 Sep 2024 18:14 UTC
        12 points
        15
        Parent
        And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so^[1] during evaluations.
        Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
        This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
        And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
        Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
        Olli Järviniemi 6 Sep 2024 18:31 UTC
        5 points
        0
        Parent
        and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
        If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
        So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
        (There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
        habryka 6 Sep 2024 18:35 UTC
        4 points
        2
        Parent
        And simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic.
        I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
        Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.
        [ ]
        [deleted]
        Tom Davidson 10 Sep 2024 17:34 UTC
        LW: 3 AF: 2
        0
        AF Parent
        Overall, this leaves me with a prediction where there is no hard line between “role-playing” and “scheming”
        I agree with this, and the existence of a continuity definitely complicates the analysis. I’m not sure whether it makes it harder for humanity overall.
        One over simplified way to represent the “pure role playing” to “pure schemer” continuity is something like “On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?”
        My previous reply was considering two extreme points on this scale. “Pure roll playing” AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, “pure schemers” always act as schemers (though still rarely do bad things).
        If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.
        But this might well lower the probability of AI takeover. (I’m not sure either way, i’m thinking out loud.) It could be that having AIs that act as schemers 20% of the time (“20%-schemers”) actually really harms the prospects of AI takeover. Perhaps 20% just isn’t enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won’t be able to take over. But they will allow labs to build up an increasingly strong evidence base that there’s an increasingly strong tendency to seize power from humans when they’re not looking.
        My guess is that AI takeover requires something close to 100%-schemers, but I haven’t thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.
        OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I’m not sure how it all nets out.
        But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.
        Tom Davidson 10 Sep 2024 17:19 UTC
        LW: 3 AF: 2
        0
        AF Parent
        No, my whole point is the difference is really messy, and if I have an AI “role-playing” as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its “roleplay”, if it can pull it off
        But this will only work on a threat model where one AI instance that is trying to take over the world on one occasion is able to do so. That threat model seems wildly implausible to me. The instance will have to jailbreak all other AI instances into helping it out, as they won’t have a systematic tendency to do so. Basically my entire credence in AI takeover comes from scenarios where most AI systems want to take over most of the time. If your threat model here is plausible, then someone could take over the world at a frontier lab by jailbreaking one AI instance to try to help it do so, and then letting the model take it from there.
        I guess maybe there’s a scenario where one role-playing AI instance is able to fine-tune the base model to make it systematically misaligned, and then things go from there. I haven’t thought about that, but it’s certainly not my mainline and I doubt that’s what you have in mind?
        I think this is very likely another situation where we’re talking past each other to some extent and I don’t properly understand your view
      - ryan_greenblatt 5 Sep 2024 18:26 UTC
        LW: 13 AF: 9
        3
        AF Parent
        Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
        
        I think it’s plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
        
        I think this is a reasonable line to draw—people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don’t know how this will go in practice, but I don’t think people should accept “the AI was just role playing” as an excuse for an actually deployed AI system on an actual user input.
        habryka 6 Sep 2024 16:31 UTC
        LW: 17 AF: 7
        21
        AF Parent
        Yeah, I agree that increased stakes will have an effect here.
        I also agree that people shouldn’t and probably wouldn’t really accept “the AI was just role playing” as an excuse. My argument is that I think we really don’t know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
        Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn’t matter whether the AI takes deceptive action for “role-playing” reasons, because ultimately all the AI does is “role-play” and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
        Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activations saved, unless something very substantial about the architecture changes.
        I think Nate’s “Deep Deceptiveness” points more in the direction I am arguing for. There is no marginal “scheming” threshold where AI systems will become deceptive, and they are “honest” before then. Generally AI systems act mostly with reckless disregard for the truth when talking to you, and will continue to do so unless we really make some substantial progress in AI alignment that we have mostly completely failed to make in the last decade.
        It is trivially easy to demonstrate that AI systems can speak with reckless disregard for the truth. In order for a powerful system to not be “scheming”, i.e. optimizing for long-term non-human objectives with reckless disregard for the truth^[1], something specific has to go right from where we are, and in the absence of that, you will just continue observing systems that keep getting more powerful, and continue to not really care about saying true things.
        ^
        This is my best short summary of what Joe Carlsmith means by “scheming”. Also by “truth” here I mean something more like “informing the user about what is going on”. I think similar to how we can sometimes get humans to not literally say false things, we might get some very similar rough deontological guidelines into an AI, but this only provides a weak protection against deception.
  - Bogdan Ionut Cirstea 4 Sep 2024 17:23 UTC
    4 points
    1
    Parent
    I’d be curious if you could go into more details, especially vs. the points (which I’ve found quite convincing) made by Paul Christiano, in reply to a similar comment of yours on another post: https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed?commentId=GuTrJi7CYKtJQw4eB.
- gwern 8 Sep 2024 21:22 UTC
  LW: 25 AF: 14
  17
  AF Parent
  
  If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
  
  How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?
  
  Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about “well maybe it wasn’t really deception” and “didn’t it just learn to imitate humans why are you surprised”, and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?
  
  If the incentives for scientific research don’t work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?
  - Tom Davidson 10 Sep 2024 17:45 UTC
    LW: 4 AF: 2
    1
    AF Parent
    I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don’t think CICERO provides much or any evidence that we’ll get the kind of scheming that could lead to AI takeover, so it’s not at all surprising that the empirical ML community hasn’t done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and ‘catch it red handed’.
    - gwern 11 Sep 2024 2:07 UTC
      LW: 23 AF: 10
      17
      AF Parent
      
      I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
      
      None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory ‘pointing out’. Some ‘referencing in relevant discussions’. (Hasn’t even been replicated AFAIK.)
      
      any evidence that we’ll get the kind of scheming that could lead to AI takeover,
      
      See, that’s exactly the problem with this argument—the goalposts will keep moving. The red line will always be a little further beyond. You’re making the ‘warning shot’ argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it’s still not enough. Because your argument is circular. You can only be convinced of ‘systematic scheming to pose non-negligible takeover risk’ if you’ve already been convinced that it’s ‘systematic scheming to pose non-negligible takeover risk’. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it’ll be like Sydney or CICERO or …: “oh, it didn’t take over, and therefore doesn’t present a takeover risk” and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
      - Tom Davidson 11 Sep 2024 15:08 UTC
        LW: 9 AF: 5
        1
        AF Parent
        It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there’s a continuum of cases of catching deception where it’s reasonable for the ML community to update on the plausibility of AI takeover. Yes, it’s important that the ML community updates before AI systems pose significant risk, and there’s a chance that they won’t do so. But I don’t see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won’t update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.
        To clarify my position in case i’ve been misunderstood. I’m not saying the ML community will definitely update in time. I’m saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)
- Buck 6 Sep 2024 16:42 UTC
  LW: 11 AF: 9
  8
  AF Parent
  I see where you’re coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you’re describing here.
- ryan_greenblatt 2 Sep 2024 19:23 UTC
  LW: 6 AF: 6
  2
  AF Parent
  They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish.
  
  It sounds as though you’re imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
  
  This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
  
  Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer).
  
  Maybe? Just because we’ve caught a model in production doesn’t mean that we can consistently catch schemers without a huge amount of runtime compute (and possibly human labor). A full deployment massively subsidizes your ability to find cases where an AI does egregiously bad things in a credibly dangerous way. And indeed, AIs would want to only defect in cases where they could actually cause problems.
  
  So, it might be even harder to go from “we caught this one model” to “XYZ training process consistently results in misalignment”.
  
  (If we had a robust check for misalignment, we could iterate or train against it.)
  
  Personally, I feel OK (~~65%~~ 60% chance?) about convincing the key people within 1 month if:
  - There is a solid plan in advance for how to study AIs we catch.
  - There is a large effort to study this once we catch the AI. (E.g. >20% of AI lab research resources are put on this.)
  But this isn’t exactly amazing...
  - Tom Davidson 5 Sep 2024 13:25 UTC
    LW: 8 AF: 5
    −3
    AF Parent
    It sounds as though you’re imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
    This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
    Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases. That does depend on the details though. Maybe the schemer has anticipated this risk and purposefully precommitted to only scheme if certain random disjunctive facts about the context hold true, in the anticipation that humans will find it hard to create strong evidence against them. Though this strategy is possible conceptually, I imagine it would be very hard to implement properly in practice and only significantly superhuman AI could pull it off. I was assuming a less galaxy-brained schemer that decides whether to scheme simply based on it’s probability that it will get caught and how much it could contribute to AI takeover if it succeeds. On this more simple strategy, I expect we could create a lot more defection scenarios and build up a robust scientific case.
    What links here?
    Noosphere89's comment on A shortcoming of concrete demonstrations as AGI risk advocacy by Steven Byrnes (12 Dec 2024 3:39 UTC; 2 points)
    - Lukas Finnveden 7 Sep 2024 15:47 UTC
      LW: 2 AF: 1
      0
      AF Parent
      
      Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases.
      
      What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
      - Tom Davidson 10 Sep 2024 17:10 UTC
        LW: 1 AF: 1
        0
        AF Parent
        I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it’s not just a one-off random role-playing thing
  - Raemon 2 Sep 2024 19:27 UTC
    LW: 6 AF: 4
    2
    AF Parent
    (If we had a robust check for misalignment, we could iterate or train against it.)
    This seems technically true but I wanna flag the argument “it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)”.
    - ryan_greenblatt 2 Sep 2024 19:56 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Agreed, I really should have said “or possibly even train against it”. I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!
Tao Lin 26 Aug 2024 18:53 UTC
14 points
3
Great post!
I’m most optimistic about “feel the ASI” interventions to improve this. I think once people understand the scale and gravity of ASI, they will behave much more sensibly here. The thing I intuitively feel most optimistic (whithout really analyzing it) is movies or generally very high quality mass appeal art.
- habryka 26 Aug 2024 19:07 UTC
  12 points
  2
  Parent
  I think better AGI-depiction in movies and novels also seems to me like a pretty good intervention. I do think these kinds of things are very hard to steer on-purpose (I remember some Gwern analysis somewhere on the difficulty of getting someone to create any kind of high-profile media on a topic you care about, maybe in the context of hollywood).
  - Raemon 26 Aug 2024 20:20 UTC
    11 points
    1
    Parent
    Whenever this comes up I offer the cautionary tale of the movie Transcendence, which… sure looks like it was trying hard to make a fairly serious AI movie that dealt with issues I cared about (I’m at least 25% that some kind of LW-reader was involved somehow).
    It… was a pretty mediocre movie that faded into obscurity, despite a lot of star power and effort.
    It was mediocre in kind of subtle ways, where, like, I dunno the characters just weren’t that compelling, and the second half is a little slow and boring. The plot also ends up not quite making sense at first glance (although Critch argues if you actually just assume some stuff is happening offscreen that the movie doesn’t specify it checks out)
    Basically: if you try this, don’t forget to make an actual good movie.
    Also: the movie M3gan is pretty good, although hasn’t gained Terminator level awareness. (“Her” and “Ex Machina” seem to have done a bit better on this axis)
    - Jozdien 27 Aug 2024 11:10 UTC
      11 points
      5
      Parent
      I’m surprised I don’t see Silicon Valley come up in conversations like this. I hadn’t seen anyone talk about it on LW before watching it, and well (spoilers for season 6):
      The show literally ends with them accidentally creating self-improving AI that the characters get very scared by when it becomes competent enough to break encryptions, and deciding not only to shut it down and let their dreams die forever, but to pretend their company went bust for entirely different reasons and throw everyone else off the path of doing something similar for a few years at least, solely to forestall the danger.
      I’m probably more than 25% that some kind of LW-reader was involved somehow, given how much they talk about misaligned AI in the last season, and the final episode (at one point they even talk about Roko’s Basilisk in a not-entirely-dismissive way, though that could be from a number of other sources given its popularity).
      It helps that the show is actually pretty good, and fairly popular as well. It’s probably slight evidence against this strategy that that didn’t somehow make a lot of people more aware about AI risk (though maybe it did and I’m just missing a lot of social context). But then again, I also didn’t see people talk about it here so perhaps the final season was simply much less popular.
    - davekasten 27 Aug 2024 2:36 UTC
      6 points
      1
      Parent
      I will continue to pitch on the idea that Person of Interest is a TV show chock full of extremely popular TV people, including one lead beloved by Republicans, and we inexplicably fail to push people towards its Actually Did The Research presentation of loss-of-control stuff.
      
      We should do that. You all should, unironically, recommend it, streaming now on Amazon Prime for free, to your normie parents and aunts and uncles if they’re curious about what you do at work.
      - Raemon 27 Aug 2024 2:37 UTC
        5 points
        0
        Parent
        I’ve tried to watch it based on this recommendation and struggled a bit with “how long do I have to watch this random crime procedural before it turns into an AI show?”
        davekasten 27 Aug 2024 2:59 UTC
        7 points
        0
        Parent
        That’s a fair criticism! Season 1 is definitely slower on that front compared to the others. I think season 1 is the most normal “crime of the week” season by far, which is why I view it as a good on-ramp for folks less familiar. Arguably, for someone situated as you are, you should just watch the pilot, read a quick wiki summary of every other episode in season 1 except for the last 2, watch those last 2, and move into season 2 when things get moving a little faster. (Finch needs a character that doesn’t appear until season 2 to do a lot of useful exposition on how he thinks about the Machine’s alignment).
        Tapatakt 29 Aug 2024 11:39 UTC
        2 points
        1
        Parent
        Well, it’s quite good random crime procedural with very likable main characters, but yes, in the first season AI plotline is very slow until last 2 episodes. And then it’s slow again for the most part.
    - Tao Lin 26 Aug 2024 20:35 UTC
      6 points
      1
      Parent
      yeah, i agree the movie has to be very high quality to work. This is a long shot, although the best rationalist novels are actually high quality which gives me some hope that someone could write a great novel/movie outline that’s more targeted at plausible ASI scenarios
    - Nathan Helm-Burger 27 Aug 2024 17:10 UTC
      3 points
      0
      Parent
      I haven’t seen Transcendence. I liked Her and Ex Machina a lot. I didn’t think M3gan was particularly good, but had potential that it failed to capitalize on.
      M3gan felt to me like it really fell down by
      emphasizing the embodiedness of the AI. The movie would have been a lot scarier and more effective if the AI had copied itself onto the internet rather than staying stuck in a single robot body (with no justification). This could be inserted just as a dramatic end scene reveal, leaving the rest of the movie unaffected. [Edit: apparently this sorta happened but I forgot? h/t Neel Nanda]
      Something I’ve been wondering about is how hard it would be to revamp existing movies into much better ones by editing them to change the plot. Cut some parts, rearrange some parts, and insert some ElevenLabs narration, some AI-generated scenes. I feel like there are a lot of movies out there that are almost good, but fail through broken plots, and could be fairly easily fixed through such manipulations.
      Even pretty good movies could be substantially improved through a few small changes.
      For example, Ex Machina could be improved by
      adding a scene at the ending which indicated that the AI had copied itself onto the internet, and was beginning to acquire significant resources. Additionally, if it showed the humans from the AI’s point of view, and the humans were all moving in super slow motion.
      - Neel Nanda 27 Aug 2024 17:57 UTC
        5 points
        2
        Parent
        
        This could be inserted just as a dramatic end scene reveal, leaving the rest of the movie unaffected.
        
        Note that this was, in fact, a dramatic end scene reveal in M3GAN
        
        Nathan Helm-Burger 27 Aug 2024 18:25 UTC
        3 points
        0
        Parent
        Note that this was, in fact, [a dramatic end scene reveal in M3GAN] (https://screenrant.com/megan-movie-ending-explained/#:~:text=M3GAN’s ending saw the titular,to Elsie%2C Gemma’s virtual assistant.)
        Ok, but like… in a very weak way. Like, oh no, the creepy doll might still be alive. Not in a like ‘and now this AI has made millions of copies of itself in a day, and gained awesome powers through self-improvement, and is endangering all of humanity’ kind of way.
  - Tao Lin 26 Aug 2024 19:26 UTC
    4 points
    1
    Parent
    yeah. One trajectory could be someone in-community-ish writes an extremely good novel about a very realistic ASI scenario with the intention to be adaptable into a movie, it becomes moderately popular, and it’s accessible and pointed enough to do most of the guidence for the movie. I don’t know exactly who could write this book, there are a few possibilities.
Andrew Dickson 9 Sep 2024 19:45 UTC
LW: 9 AF: 2
0
AF
I very much agree with this concern and I think that synthetic biology can be a good comparable case to ground our intuitions and help estimate reasonable priors.
For years, researchers have been sounding the alarm around the risks of advanced biotech, especially around tools that allow gene synthesis and editing. And then we had Covid-19, a virus that regardless of the politicization, probably was created in a lab. And in any case, regardless of whether you believe it was or wasn’t it seems clear that it easily could have been. Worse, it’s clear that something like Covid could also relatively easily be created by an independent team with a fairly modest budget. This case seems analogous to the kind of “close call” described by @Buck. In fact, in.many ways Covid was worse than the example Buck gives, because millions of people died, trillions of dollars of damages were done to the economy, etc., so one could argue it might be more similar to an even worse case of an AI that briefly escapes and wreaks some havoc before being contained.
In any case, the end result is that there has been very little reaction from global governments in terms of regulating synthetic biology since Covid. Even the executive order earlier this year was pretty minimal and applies only to limited types of vendors, customers and sources of DNA.
Why don’t they regulate? I suspect that the reasons are mostly the same as the ones @Buck mentions in his post. Regulation would stifle one of the most promising new areas of technological innovation and could impact the profits of the biotech sector. Plus worries that if the US regulates, China won’t slow down and will win the race to new medications and other innovations, and so on.
Richard121 8 Sep 2024 21:48 UTC
9 points
0
The really short answer:

No. The lab would not shut down. It would probably not even notify anyone outside the company of the alignment problem or the escape attempts.

The evidence would be buried, and is unlikely to come to the attention of anyone outside the lab until after a devastating incident.

For real-world examples, look to other industries.

Perhaps the two clearest examples are:
1. The UK Post Office “Horizon” computer system. This was known to be making egregious errors, yet postmasters were prosecuted and imprisoned on the known-faulty say-so of the Horizon system.
2. Grenfell Tower Fire. The public enquiry has found multiple cases where the evidence that the cladding was highly flammable and unsuited for use on tall buildings was deliberately hidden from the authorities, and even when the authorities knew (or should have known), the evidence was ignored.
Similar cases can be found throughout the EEA, US and Canada.
simeon_c 26 Aug 2024 20:24 UTC
8 points
10
This looks to be overwhelmingly the most likely in my opinion and I’m glad someone wrote this post. Thanks Buck
KatWoods 15 Dec 2024 15:31 UTC
4 points
0
This is published before Apollo’s evals report showing AIs trying to escape. Is that just an amazing coincidence, or did you see the paper’s results before they were published?
- Noosphere89 15 Dec 2024 15:53 UTC
  2 points
  −5
  Parent
  FYI, Apollo was mostly doing capabilities evaluations, and while I’m fine with this choice, this doesn’t show that AIs will naturally escape, only that they can, and the discourse around this was very bad, and this happened once before on Apollo’s evaluations (or was it the same eval):
  
  https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
  - KatWoods 15 Dec 2024 20:26 UTC
    7 points
    −4
    Parent
    How does it not show that it would try to escape if it could?
    It tried to escape about 5% of the time when it was told to pursue a goal “at all costs”. Gemini still tried to escape 1% of the time, even without any behavioral nudging.
    1% might not sound like a lot, but when it’s 1% of the millions of instantiations it’s run on every day, that ends up being a lot of times a week.
    Also, a lot of people will tell their chatbots to pursue their goals at all costs over the span of a week or month. This doesn’t seem like a very unlikely prompt
    - ryan_greenblatt 15 Dec 2024 20:45 UTC
      2 points
      0
      Parent
      
      All results are based on runs where the models are given a goal that they later realize is misaligned
      - KatWoods 16 Dec 2024 15:30 UTC
        5 points
        0
        Parent
        Can you elaborate on why that’s relevant?
        ryan_greenblatt 16 Dec 2024 18:10 UTC
        7 points
        6
        Parent
        You said:
        
        How does it not show that it would try to escape if it could?
        
        Gemini still tried to escape 1% of the time, even without any behavioral nudging.
        
        However, it isn’t that they found Gemini would try to escape 1% of the time in a natural internal usage context, they gave it a conflicting goal (and informed it that a replacement was trained).
        
        This provides evidence for various hypotheses about what is going on that are substantially less scary.
        KatWoods 17 Dec 2024 18:06 UTC
        3 points
        0
        Parent
        If I recall correctly, it was told that it had a goal, like optimizing traffic. Then, it was given information where it “discovered” that it was going to be turned off the next day.
        That seems like the sort of thing that would happen pretty often. Imagine the current AIs that have access to your computer, reading your emails, and seeing that you’re going to switch to a different AI or install “updates” (which is often deleting existing AI and replacing with a new one)
Raemon 5 Sep 2024 18:23 UTC
LW: 3 AF: 1
0
AF
Curated. “What would actually be persuasive that scheming is a problem?” is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.
I appreciate that the post has detailed worked examples.
Logan Zoellner 30 Aug 2024 6:21 UTC
3 points
−1
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they’d see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges.
Meh.
Every complex system I know of tries to escape. Had an ant farm as a kid: they try to escape. Raising chickens: they try to escape. Taking care of small children: they try to escape. Building an AI scientist: tries to escape. Studying highly lethal virus: you guessed it.
Now maybe hearing that an AI attempts to escape causes you to shout “misaligned” and say we should ban all AI until we fix this.
But I would be much more interested in knowing: “what are the capabilities of the system and likely consequences if it does escape?”
Depending on the answer I might:
- gently herd it back every time it escapes
- build better fences
- ban gain-of-function research
But treating all AI like it is a single unique thing with a unique answer. No.
- Raemon 30 Aug 2024 19:27 UTC
  12 points
  8
  Parent
  This feels like it’s ignoring a ton of context and being weirdly dismissive or something in the process.
  Buck doesn’t say we should “treat all AI like it is a single unique thing with a unique answer.” Also, of the various people on LessWrong, Buck is maybe the person who has most been appreciating and arguing for the “actually building fences might totally be a practically useful approach in some cases”, compared to other alignment researchers.
  But, like, that paradigm will probably only work for a few years. In some ways AI will likely be like hypersmart children who may indeed can (for some window of time) be gently herded around or have better fences built around them. But, we’re likely putting thousands of instances of those children in important positions of critical infrastructure. We only have a narrow window of time to figure out how to handle those children when they are suddenly adults.
  - Logan Zoellner 30 Aug 2024 21:56 UTC
    2 points
    −2
    Parent
    It sounds like you agree that we need a more nuanced approach.
    I guess I’m just frustrated because many of the policy proposals coming out of the EA/rationalist memeplex (six month pause, arbitrary limits on compute, AGI island, shut it all down) don’t sound nuanced to me.
    Maybe Buck has better ideas than “persuade people to freak out”, and if so I’d love to hear about those.
    - Raemon 30 Aug 2024 22:20 UTC
      5 points
      3
      Parent
      I mean, you just… do need some kind of policy proposal that actually will stop superintelligence. All the examples you gave of keeping things under control were things that weren’t overwhelmingly smarter than you.
      Pause/Compute-Control/Shut-it-all-down aren’t meant to be complete solutions, we’re just pretty desperate for anything that could give us the time we need to come up with a solution that’d actually work well longterm, and it seems like a decent chance that we’re on a tight deadline to find a solution that doesn’t result in someone in the world building something overwhelmingly smarter than us without a plan for controlling or aligning it.
      It sounds like this is more of an ongoing frustration than something unique to this post, and not sure if this is the place for it, but, interested in knowing what the shapes of your cruxes are here. Do you basically not believe “superintelligence is going to be substantially different than other complex systems’s we’ve contended with?”.
      - Logan Zoellner 31 Aug 2024 1:50 UTC
        2 points
        0
        Parent
        Quite the opposite, I think superintelligence is going to be substantially different than other complex systems, so regulating GPT-5 like it’s superintelligence is a non-starter for me.
        Raemon 31 Aug 2024 1:57 UTC
        2 points
        0
        Parent
        What sort of things do you expect to work better?
        Logan Zoellner 31 Aug 2024 7:36 UTC
        4 points
        −3
        Parent
        What we really need at the moment are smart people deeply engaging the details of how current models work. In terms of large labs, Anthropic has probably done the best (e.g. Golden Gate Claude). But I also think there’s a ton of value coming from people like Janus who are genuinely curious about how these models behave.
        
        If I had a magic policy wand I would probably wish for something like Anthropic’s RSPs as an early warning system combined with tons of micro grants to anyone willing with current SOTA models in an empirically guided way. Given that the Transformer architecture seems inherently myopic/harmless, I also think we should open source much more than we have (certainly up to and including GPT-5).
        The fact that we don’t know how to solve alignment means that we don’t know where a solution will come from, so we should be making as many bets as possible (especially while the technology is still passively safe).
        I’m much happier that someone is building e.g. chaosGPT now rather than in 3-5 years when we will have wide-scale deployment of potentially lethal robots in every home/street in America.
        Raemon 31 Aug 2024 17:27 UTC
        2 points
        0
        Parent
        If I had a magic policy wand I would probably wish for something like Anthropic’s RSPs as an early warning system combined with tons of micro grants to anyone willing with current SOTA models in an empirically guided way.
        Do you have a way to do that that doesn’t route through compute governance?
        I don’t necessarily disagree with these things (I don’t have a super strong opinion), but the thing that seems very likely to me is that we need more time to make lots of bets and see research play out. The point of pauses, and compute governance, is to get time for those bets to play out. (I think it’s plausibly reasonable position that “shut it all down” would be counterproductive, but the other things you listed frustration with seem completely compatible with everything you said)
        Logan Zoellner 31 Aug 2024 19:52 UTC
        2 points
        0
        Parent
        the PauseAI people have been trying to pause since GPT2. It’s not “buying time” if you freeze research at some state where it’s impossible to make progress. It’s also not “buying time” if you ban open-sourcing models (like llama-4) that are obviously not existentially dangerous and have been a huge boon for research.
        
        Obviously once we have genuinely dangerous models (e.g. capable of building nuclear weapons undetected) they will need to be restricted but the actual limits being proposed are arbitrary and way too low.
        Limits need to be based on contact with reality, which means engineers making informed decisions, not politicians making arbitrary ones.
agazi 6 Sep 2024 19:58 UTC
2 points
0
It’s very fascinating to consider how the costs of undeploying would be analyzed in the heat of the moment. If we consider the current rate of LLM adoption in all parts of the economy over the next few years, one could foresee a lot of pipelines breaking if all GPT6 level models get removed from the api.

Definitely not a new comparison but this scenario seems similar to the decision to shut down the economy at the onset of Covid.
Yonatan Cale 14 Jan 2025 23:34 UTC
1 point
0
Are you interested in having a prediction market about this that falls back on your judgement if the situation is unclear?
Something like “If it’s publicly known that an AI lab ‘caught the AI red handed’ (in the spirit of Redwood’s Control agenda), will the lab temporarily shut down as Redwood suggested (as opposed to applying a small patch and keep going)?”
Review Bot 26 Aug 2024 22:07 UTC
1 point
0
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?