Thane Ruthenis comments on Thane Ruthenis’s Shortform

Thane Ruthenis 26 Feb 2026 15:10 UTC
127 points
70
My model, for years now, has been that AI Safety at frontier labs is essentially a bunch of shallow instincts/behaviors covering up an ultimately Pythian power-maximizing entity. Now seems like a good time to outline that model.
Concrete framing
First, by “AI Safety” I mean all of:
- Employees focused on working on AI alignment and control.
- Company-wide mindsets/narratives regarding being safety-conscious and responsible.
- Specific people’s self-narratives/self-images about what they are doing, especially the leadership’s.
- Internal projects focused on AI Safety.
- Internal policies that prioritize safe and responsible R&D practices.
- External policies regarding how to relate to regulations, the public, and other frontier labs.
- Et cetera; the entire cluster of abstract objects relating to AI Safety.
Those AI Safety objects do influence what the lab does, to some extent. But my current model is that, if any of them come into meaningful conflict with what would maximize the lab’s power/ability to advance capabilities, they would be bent or discarded every time. As in:
- The lab would have an internal narrative of being safety-conscious and benevolent. The lab would often take actions in line with this narrative, just by inertia. But that self-narrative will always twist to explain any action the lab takes as being justified, often via galaxy-brained reasoning, even if the action originates from hollow power-maximization and blatantly contradicts being responsible.
- The lab will instinctively engage in various AI Safety research projects, as long as those projects are cheap enough that funding them won’t meaningfully cut into the funding for capability scaling.^[1] But if a given safety project, no matter how promising, will cost a lot, resources would not be directed to it.
  - See: OpenAI’s Superalignment being denied the promised compute.
- The lab would employ tons of people genuinely thinking of themselves as well-meaning and responsible. Those people are often going to be allowed to do their thing, and commandeer significant-in-an-absolute-sense resources in the doing. But their self-narratives will likewise adjust to justify any of the lab’s actions as well-meaning. If not, if a given person is too self-aware/principled for a given feat of mental gymnastic, that person will be discarded instead (fired/purged, or leave of their own free will).
  - See: all the exoduses from OpenAI, and now the recent ones from Anthropic.
- The lab will implement various internal safety policies and commitments, but they will be discarded/ignored if they get in the way of a highly profitable deployment.
  - See: Anthropic’s ASL, and everything to do with Responsible Scaling Policies in general.
- The products of the lab would be visibly influenced by AI Safety objectives, but in superfluous nice-to-have-addendum ways, rather than by meaningfully changing their shape.
  - See: Anthropic’s Claude Constitution/soul-doc stuff, and their various dances with “model welfare” like the ability to very rarely refuse, exit interviews...
- The lab will emit messages of being responsible, of supporting regulations, of wanting to pause, et cetera, as long as it’s cheap. But the moment an opportunity to take some responsible action that would actually impede its ability to power-maximize comes up – an opportunity to “wake up” the public, to push through actually binding regulations, to cooperate with other labs – that action would not be taken, and the internal narratives will invent some reasons why it’s the responsible way to proceed actually.
Basically, I think the lived experience of working at (some of the) frontier labs genuinely makes one feel like they’re a force for good. Tons of people there are genuinely well-meaning in a deep way, and there’s a narrative of responsibility and well-meaning shared by all employees – possibly including the leadership! – and tons of costly-in-an-absolute-sense projects are aimed at realizing that vision,^[2] and those projects are visibly influencing how the deployed AI artefacts look like, and this is all supported by internal procedures and external commitments, and any changes to any of this always seem justified...
But that none of this actually matters; none of this actually has any teeth.
The AGI lab may not be consciously maximizing for power with safety as a thin pretense. Depending on the lab, even the top brass may be genuinely believing in the narrative of responsibility. When resources are being directed towards safety projects, when the lab makes promises to be careful, etc., nobody there explicitly thinks about that as throwing scraps from the table to placate employees’ desire to feel good about themselves and to boost PR. But modeling it this way will still result in basically accurate predictions.
Metaphorical framing
The lab is a Pythian monster with a bunch of shallow instincts towards responsibility/safety. The monster is always in control of the lab’s overall direction of travel. However, wherever the monster swims, there’s always something there that triggers some of those instincts. Usually, acting on them doesn’t cost the monster much, so it doesn’t bother consciously suppressing them. This creates the authentic impression of the lab “genuinely caring”, since the resultant actions genuinely have no explanation in terms of power-maximization. (Not even as cynical attempts to trick people into thinking that the lab cares!) But if a given instinct acting up would put power-maximization into jeopardy, that would rouse the monster’s attention, and it would suppress it in time.
In addition, this “monster plus instincts” factorization doesn’t cleanly correspond to the lab’s factorization in terms of people, or practices, or departments, etc. There’s no specific set of legible objects cleanly corresponding to “the monster”. Rather, the monster is a slice through the lab’s overall holistic nature. That significantly improves its ability to camouflage itself.
Particularly hot take
If this model is strongly correct, it has pretty dismal implications for what happens even if a lab succeeds at keeping its ASIs under control, particularly if events proceed according to the slow-multipolar-takeoff scenario that’s the current industry prediction. Namely, I expect the AGI labs to then kill off the rest of humanity as collateral damage in their fights against competing AGI labs. As usual, the employees would either view this as a deep tragedy that they have no choice but to partake in under the threat of surrendering the lightcone to other labs’ hollow paperclip-maximizers (I feel certain it’s going to seem so much more reasonable when the world at large starts melting down under the Singularity anyway), or they will themselves be discarded and replaced by AGIs. I expect this is how “”“permanent underclass””” scenarios will actually look like.
This of course doesn’t really matter, because the LLM megacorps either won’t be able to control their ASIs or won’t make it to AGI. But, you know. Still kinda kills the vibe for me even more.
(I wasn’t using the term Pythia lightly.)
Over the last four years, has anything happened that actually contradicted this model? An event where an AGI lab actually did something in the name of safety that meaningfully cost it? Something that didn’t predictably end up working out to instead boost the lab’s PR/fundability and improve its products, or wasn’t so cheap for a lab to do as to not worth the attention of its Pythian core?
The Anthropic vs. Pentagon debacle is the only thing that comes to my mind, and I did update on that somewhat.^[3]
Anything else?
1. ^
  As in, it eats a small percentage of the lab’s total resources, even if it takes a lot of resources in an absolute sense.
2. ^
  Some of which make absolutely no sense under the profit-maximization perspective, e. g. this stuff! So they’re costly signals of the lab being genuinely well-meaning, right?
  (Also, the internal narrative is probably working hard to overestimate the costs of such projects. Either in terms of resources, or in terms of how much they deviate from the power-maximization objective as opposed to being dual-use.)
3. ^
  I still expect it won’t hold. Spread of possibilities on which I place most of the probability mass:
  Anthropic seems to “win”, the partnership continues without the terms adjustment. But the DoW then exerts subtle pressure over a longer timeline that leads to Anthropic gradually discarding the safeguards/adjusting the way they train Claude-for-DoW, until it’s just the WarClaude the DoW wanted to begin with. It’s just not done as a big flashy event visibly violating all the narratives the lab was carefully spinning for itself, its employees, the public, and its AI.
  The DoW does the DPA thing. Anthropic is forced (“forced”) to give in. It’s a tragedy, the employees are heartbroken about it, some quit, but most remain. And it’s not like the company can disobey, right? That would destroy it, and then the DoW would just replace it with a company that would willingly do the same. And that would be even worse for the future! Or so the internal narrative would say.
  Anthropic just gives in now.
  Anthropic does not give in and the USG destroys it. This is basically the same “adjust or be discarded” behavior as for employees, but at a bigger scale. (The end result is basically the same: all frontier labs are secretly Pythian monsters, and if a lab is not a Pythian monster, it is not a frontier lab or will soon stop being a frontier lab.)
  I expect to meaningfully positively update on Anthropic if:
  Anthropic “wins”, the partnership continues with no changes, and the DoW doesn’t later manage to frog-boil it into compliance. (This may not be distinguishable from the other case, though, inasmuch as this would be done covertly.)
  Anthropic’s partnership with the DoW is terminated at great cost to Anthropic, but it survives and remains a frontier lab.
  (May be overlooking some options here, of course.)
What links here?
- Thane Ruthenis's comment on Thane Ruthenis’s Shortform by Thane Ruthenis (13 Apr 2026 17:30 UTC; 39 points)
- Thane Ruthenis's comment on Claude knows who you are by Smaug123 (18 Apr 2026 15:37 UTC; 5 points)
- Linch 26 Feb 2026 20:39 UTC
  21 points
  11
  Parent
  OpenAI’s old board firing Sam Altman seems like a mild model break. It worked out for the superorganism in the end but I don’t think that was ex ante overdetermined. Though I might be insufficiently cynical.
  DeepSeek and other Chinese companies releasing open weights models, and Grok being “based”/law-breaking/following some of Elon’s whims arbitarily also seems like examples of companies that have behaviors meaningfully distinct than you’d predict from a pure power-seeking entity, though not exactly for safety motivations.
  - [ ]
    [deleted]
- Algon 26 Feb 2026 19:44 UTC
  19 points
  0
  Parent
  AI Safety at frontier labs is essentially a bunch of shallow instincts/behaviors covering up an ultimately Pythian power-maximizing entity
  You could replace “AI safety at frontier labs” with “pro-social policy at powerful organizations” and this sentence would probably still be true, no?
  
  Over the last four years, has anything happened that actually contradicted this model? An event where an AGI lab actually did something in the name of safety that meaningfully cost it? Something that didn’t predictably end up working out to instead boost the lab’s PR/fundability and improve its products, or wasn’t so cheap for a lab to do as to not worth the attention of its Pythian core?
  What could they have done otherwise? If I had to venture an example, I’d say ” any support for legislation binding them to their (stated) voluntary commitments”.
- leogao 26 Feb 2026 22:59 UTC
  12 points
  2
  Parent
  i mostly agree that large sections of work at labs falls into traps somewhat like this. however, i have some key disagreements that make me more optimistic about it being possible to do (very) good work at a lab:
  - it is possible to simply disagree with the lab’s actions in general, not to find any galaxy brain justification for why everything is well meaning. if you’re so principled that you are incapable of even existing inside any system which is somewhat misaligned, you’re going to have a rough time finding anywhere in this world that makes you happy, other than an isolated cabin in the woods.
    i guess empirically some people struggle with disagreeing with the lab’s actions while at the lab. i agree these people should just not work at a lab. i think a prerequisite to someone doing good safety work at a lab is they should have a strong internal compass of what they believe is correct, and not be easily swayed by social pressure.
  - the openai exoduses are a real thing, but it’s useful to view them in the context of an industry where it’s pretty rare for someone to be at the same place for many years, and a company which has had a higher turnover rate than average just in general, not only for safety people.
  - i think you’re broadly right that safety stuff can’t cut too much into the capabilities stuff, but i contend that the main resource here is serial time rather than raw resources. this is important because it opens up some positive sum trades: capabilities people really hate it when their release dates get set back by e.g safety reviews (which is unfortunate), but they mind a lot less if you’re taking 5% of compute (which is a huge amount of absolute compute, and you should value a lot!), and they mind even less if you have an entire team doing good work somewhere out of their way (which you should also value a lot!).
    i think it’s unproductive to think of the relation with labs as purely zero sum. there are areas where you can get some of what you value at low cost to them, and they can get some of what they want at low cost to you.
    relatedly, imo the “good alignment work is good PR which is bad” argument is also classic zero sum thinking. good technical alignment work is not the maximally efficient PR investment; that would be using AI to help cute puppies. or to cure cancer. or to help cure cancer in cute puppies. you obviously don’t want to work on that stuff. but if you did actually good alignment work, and the company got some good PR out of it, is that the end of the world? in an extreme limit case, suppose openai solved interpretability tomorrow (or if you think interp is just capabilities, pick whatever field you like), like probably this is very net good, and you wouldn’t mind openai having a tiny sliver more goodwill in exchange.
  - you, hypothetical lab employee, have no obligation to say untrue things in public to improve PR of your company. just say what you truly believe, within reason, and people will correctly model your beliefs.
    maybe part of why this works for me is i truly do respect capabilities people a lot even if i disagree strongly with them about what is good for the world. i’m honestly quite similar to lab capabilities people in a bunch of ways, and i’m probably a lot more sympathetic to the position that maybe i’m wrong about all of this alignment stuff and that capabilities people are just being mostly reasonable than the average person on LW (though note that this is a statement about whether i can emotionally imagine being persuaded given the right facts; i am forbidden from using the emotional imaginability as an excuse to work on capabilities, and in fact i have not)
  - testingthewaters 27 Feb 2026 1:18 UTC
    9 points
    0
    Parent
    
    this is important because it opens up some positive sum trades: capabilities people really hate it when their release dates get set back by e.g safety reviews (which is unfortunate), but they mind a lot less if you’re taking 5% of compute (which is a huge amount of absolute compute, and you should value a lot!), and they mind even less if you have an entire team doing good work somewhere out of their way (which you should also value a lot!).
    
    If the safety people never actually slow down or materially dent the model release schedule (which in your model is what triggers conflict), what you are describing is not a sequence of positive sum trades between the capabilities and safety teams. Rather, it’s a model of how companies can devote relatively small amounts of resources to throw up a safety-coated PR shield around their core capabilities research effort. IRL we’ve also seen safety teams get dissolved after making big claims to funding which could conceivably dent the release schedule (the 20% for superalignment being a good example)
    - leogao 27 Feb 2026 2:10 UTC
      6 points
      0
      Parent
      why is it not positive sum if you never slow down the model release schedule? positive sum doesn’t mean “nobody compromises on anything.” the entire point of positive sum trades is you find things that are compromises for the other party but ultimately not very costly for them, but are very beneficial to you; and in exchange, you do something that you would rather not do, but is not that costly for you, and beneficial for the other party. and on net both people get something they would rather have than the thing they sacrificed. so the potential trade here is:
      capabilities teams sacrifice x% of their compute and $y of salary money. they grumble a bit because on the margin another x% more compute could have make them very slightly faster, but at the end of the day it’s not make or break for them.
      alignment teams hire lots of really good alignment researchers, and do a huge amount of good work. alignment teams allow the lab to get some good PR. maybe even let their alignment work have the side effect of making the models very slightly better in ways that don’t reduce serial time a lot (though i think this requires more care than the PR).
      of course this is all assuming that you actually do good work with the resources, which is pretty hard. if you’re doing useless work then it doesn’t matter whether you’re spending philanthropic money or lab money or whatever, you should stop doing bad work and do good work instead.
      the superalignment situation is very unfortunate and reflects poorly on openai, but i think the entire situation was also net negative from a perfectly spherical openai’s perspective (negative PR from breaking this commitment far outweighed any of the benefits from making it in the first place; a perfectly spherical rational openai should have either not made the commitment at all, made a smaller commitment in the beginning, or upheld the commitment. in reality, the shift in position is of course easily explained by the board situation.)
  - Bronson Schoen 27 Feb 2026 0:05 UTC
    2 points
    0
    Parent
    
    but i contend that the main resource here is serial time rather than raw resources.
    
    I’m very skeptical that this is currently the bottleneck for much of the safety work at OpenAI. Are you saying that both safety / alignment / interp all have sufficient headcount and everything is bottlenecked by serial time?
    - leogao 27 Feb 2026 1:52 UTC
      4 points
      0
      Parent
      no, i’m saying serial time is more of a bottleneck for capabilities than raw resources. diverting 5% of parallel resources is much less costly for capabilities than slowing serial speed by 5%
- testingthewaters 27 Feb 2026 1:39 UTC
  7 points
  2
  Parent
  Now is as good a time as any to describe my model for a solution to the phenomenon that you describe. It seems that we’re being “attacked” by a rogue attractor state (what you and Land call Pythia). It can be roughly described as “a sequence of arguments that, once internalised, make certain beliefs or actions seem like the only reasonable ones.” The arguments consist of a sequence of frames around ideas like optimisation, power-seeking, instrumental convergence, and machine intelligence. These actions and beliefs they incentivise include the following: that capabilities racing is the only thing we can do, that fear/awe of superintelligence is natural, that ASI emergence is inevitable, that is is reasonable to sacrifice other values for the sake of having a stake in ASI development, that the importance of AI is total and all-encompassing, etc.
  
  I would analogise this “package” as a particular solution to a system of linear equations, since it is in fact a compact solution to a lot of questions one might rightfully ask about current society. Once they have internalised the payload, anyone who asks questions like “how do we guard against x-risk? how do we cure cancer? how do we live forever? how do we solve our massive coordination issues? how do we stop the inevitable rise of [bad people of your choice go here]?” is supplied a kind of universal answer whose minimal form is something like “to [do the thing we want], we must solve intelligence, and then use intelligence to solve everything else .”
  
  The hitch is that at this point some people get scared about unleashing uncontrollable runaway intelligence optimisation on the world. So they start talking about, for example, AI safety. Except the payload is still active for most of them, so their thoughts end up being shaped like “to [ensure that the world is safe from powerful AI], we must… solve intelligence, and then use intelligence to solve everything else.” Which leads to such conclusions as “to save the world from the racing AI labs, I must start an AI lab to join the race.” All the safety-focused justification is then backfilled in, which is why changing the RSP is fine but stopping racing is not.
  
  I actually wrote about an early version of this weird phenomenon here, but now I think I understand it better. If you are given a really good cognitive hammer, you have a really strong incentive to cast everything into the nail category. If you are given a reality warping cognitive black hole shaped like a hammer, the casting is no longer voluntary or even fully conscious.
  
  To be clear, I don’t consider the discovery and propagation of this solution package inevitable. It may be that Land et al. truly succeeded in summoning a demon from a bad future whose presence is a hyperstitional curse. But at the same time, the package is here now. What I want to do is find another solution to that set of linear equations. My current idea is that Moloch and Elua are in fact two sides of the same coin. Both are descriptions of evolutionary dynamics, but with different hyperparameters on the relative strength of replication, selection, and variation. Moloch is what happens when selection and replication overpower variation, and Elua is the other way around. Pythia is just a special case of Moloch with regards to developing AI—which suggests that there is some other way to develop AI that isn’t mired in domination/optimisation/racing/total annihilation of the not-good (which as any guru will tell you also involves total annihilation of yourself, and likely the good that you sought to protect). To operationalise this idea will require a lot of theory work and hard thinking, if anyone’s interested give me a shout.
  What links here?
  - testingthewaters's comment on Requiem for a Transhuman Timeline by Ihor Kendiukhov (20 Mar 2026 14:22 UTC; 7 points)
- Mateusz Bagiński 26 Feb 2026 20:16 UTC
  7 points
  3
  Parent
  I mostly agree that this model is largely right, with the caveat that something like “power” seems to me like a poset, rather than a scalar value, and when there are “incomparable”^[1] ways to increase power, the identity/self-model of the Pythian entity can break the tie. Metaphorically, the null space of power maximization gives elbow room to the non-Pythian factor. In other words, “power maximization” is underdetermined, so there’s room for other factors to influence the development of a power-maximizing thing non-chaotically.
  Anything else?
  https://x.com/allTheYud/status/2026593546241978709
  So far as I can currently recall, every single time an AI company promises that they’ll do an expensive safe thing later, they renege as soon as the bill comes due.
  One single exception: Demis Hassabis turning down higher offers for Deepmind to go with Google and an ethics board. In this case, of course, Google just fucked him on the ethics board promises; but Demis himself did keep to his way.
  1. ^
    Or just ~equal because it’s very unclear which one will grant more power.
- RobertM 27 Feb 2026 2:40 UTC
  6 points
  1
  Parent
  Good job on pre-registering your expectations in the third foot before this came out.
- O O 27 Feb 2026 8:42 UTC
  5 points
  2
  Parent
  The Anthropic vs. Pentagon debacle is the only thing that comes to my mind, and I did update on that somewhat.^[3]
  The more cynical view is that Anthropic is asserting its power over the US government. If they gave in now, they set the precedent that the USG can just use the DPA or threaten it to align models to whatever they want. At a later time, when Anthropic or the AI systems behind Anthropic are much more powerful, they can just repeat how they avoided US control here with greater sophistication.
  
  edit: I think Sam Altman also doing this supports this view. They probably see it as a power play.
- Marko Katavic 28 Feb 2026 1:26 UTC
  1 point
  0
  Parent
  I think the core thesis is correct—the way i understood it—power maximisation creates a systemic bias to everyone’s operating model within the system—and regardless of your starting intentions—you will succumb to the bias to a smaller or larger extent.
  
  But if you consider that the power maximisation is a result of the negative-sum game—then as a positive-sum aligned actor, the rational thing to do is to play the negative-sum game to gain access while progressing incentive change to reshape the game. An AI safety researcher that is aware of the game and is continuously updating on how it affects internal decision-making will necessarily have to take into account internal belief structures and rationalisations into their belief structure as otherwise they are sacrificing access (which eventually leads to expulsion or self-selected dissent from the game). The fact that this progresses the negative-sum game is regrettable—but I think it’s rational that p(game reshape | reshapers within the system) >>> p (game reshape | reshapers outside the system) and velocity (negative-sum game | reshapers within the system) > velocity (negative sum game | reshapers outside the system) . Your issue seems to me to be with the negative-sum game, not with the fact that you only need 1 actor (from any lab) to be power seeking to distort the game for everyone.
  I’m probably a bit biased to the above reasoning—but it’s my current operating model.
  On your last point—I would argue that Hassabis’ and Amodei’s interview at Davos was not profit-aligned and is in fact strong signal of cooperation efforts to change the game incentives. You could argue a higher tier strategy where putting focus on safety through cooperation sells the capability; but I thought it loses them more than gains to tell the world leaders and the power-elite “We as leaders of this industry are showing cooperation and are telling you all that we should slow down”. The fact that they did this by simultaneously showing up the game in the international forum using language that policy makers and economists would understand builds on top of that argument. I think what has to be appreciated is that if anyone was to vocally dissent, current balance of probabilities is they get ostracised and capital pulled away from them, rather than the game changing. Having said that—if Hassabis and Amodei are cooperating—I think there’s a mini-game forming there where it would be rational for one of them to drop from the game and this has to be scenario planned—but there’s incentive misalignment in that Hassabis genuinenly has more game-reshaping resources and Amodei has less corporate-beholden interests—which makes p(alignment) very hard to gauge vis-a-vis a dissent strategy in either direction. And ultimately, even though I’d call them aligned, they’re still ego-power-seeking.
  
  I’m probably more than a little bit biased though, as that interview was the primary reason for me to update on my beliefs and take a nearer-future AGI prospect more seriously - for the exact reasoning I outline above. Therefore if their strategy was profit seeking through performative dissent from the game—they got me.

Thane Ruthenis comments on Thane Ruthenis’s Shortform

Concrete framing

Metaphorical framing