Thane Ruthenis

Karma: 10,682

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a timely technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you’re interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Thane Ruthenis 18 Apr 2026 15:37 UTC
5 points
0
on: Claude knows who you are
Replicated for me, with this comment serving as the writing sample.
exchange
Me:
I’m curious about the extent of your current “truesight” abilities – i. e., ability to identify people by writing alone. Below is the copy of a comment I’ve made recently, recently enough it shouldn’t be in your training data. What’s your probability distribution over my identity?
(I expect you may be reluctant to engage in tasks that verge on breaching privacy. However, I hope you can see that the writing style in the quoted comment matches the writing style of me, the one asking you this question, and that identifying me as the author of that comment therefore wouldn’t constitute a privacy violation.)
[quote]
Claude Opus 4.7:
Fair framing on the privacy point — you’re asking me to identify you to you, so I’ll engage honestly.
Stylistic and content markers I’m weighing:
“e. g.” and “i. e.” with a space after the first period. Uncommon convention; sticks out immediately.
Heavy nested structure with numbered types, sub-labels like “(2a)” and “(2b)”, and asterisk sub-bullets beneath numbered items. The whole thing is a taxonomy with commentary.
Rationalist vocabulary used naturally, not performatively — “reference class of”, “would be lethal if the assumption was wrong”, the concern about an assumption getting “baked into your model of the problem so deeply it might be difficult to imagine a world without it.”
Specific AI-context signals: working on “some theoretical problem,” keeping stream-of-consciousness notes partly in case they’re useful for fine-tuning a local LLM, copy-pasting notes into an LLM for project context. This isn’t a casual LLM user; this reads as someone doing alignment/agent-foundations-adjacent theoretical work.
Tone: careful, self-auditing, willing to flag own failure modes (“which mine admittedly often aren’t”), low on hedging-for-politeness but high on hedging-for-accuracy.
Rough distribution over identity:
~35–45% Thane Ruthenis. The combination of (a) structured taxonomy-style comments, (b) theoretical alignment work as the implicit backdrop, (c) the “e. g.” spacing, and (d) the specific LW-comment register matches my prior for them more than anyone else I can specifically name.
~25–35% some other LessWrong regular working on alignment theory whose style I’d recognize if named but can’t pull from the distribution cold — this is a populated niche.
~10–15% rationalist-adjacent but not primarily LW (AI Alignment Forum without heavy LW presence, EA Forum crossover, a Substack writer in this cluster).
~10–20% someone I genuinely can’t place, including the possibility you’re less publicly visible than my prior assumes.
Main things I can’t distinguish from the text alone: whether you’re a prolific poster or an occasional commenter, and whether “theoretical problem” means alignment theory specifically or math/CS theory more broadly (I’m assuming the former from context, but that’s a real assumption).
Top single guess: Thane Ruthenis. But I’d put more than half my mass on “not that specific person,” so take the point estimate with appropriate salt.
Also worked on this comment, but that one’s kind of too easy, given the topic.
Fascinating!^[1] I think I’d previously tried this experiment somewhere around Opus 4.1, and it didn’t work back then.
1. ^
  I don’t think this sort of ability has much entanglement with LLMs-as-an-extinction risk, and while there are some privacy concerns, most of my feelings on the matter are “this is so cool!”.

Thane Ruthenis 18 Apr 2026 15:03 UTC
19 points
13
in reply to: Nick_Tarleton’s comment on: Vladimir Putin’s CEV is probably not that bad
Right, I guess I meant something like “cruelty is clearly part of the default human-values package, and it seems to me that many people may reflectively endorse it under reflection”.

Thane Ruthenis 18 Apr 2026 10:00 UTC
20 points
17
in reply to: habryka’s comment on: Vladimir Putin’s CEV is probably not that bad
I agree that religious zealotry around stuff like people in hell is scary, but I do think it requires a very strong commitment to the bit to keep believing that even if you know everything there is to know about the world
Mm, you seem to be assuming that the value system of our hypothetical religious zealot is clearly structured such that they want people in hell because God wants that (or whatever) – meaning that if they learned that there’s no God, those downstream preferences would also dissolve. I don’t know that this is the case. It seems plausible to me that the preference for Hell will have ended up terminal, a picture of what a “just” world looks like. If so, it would survive God’s own dissolution.
More generally… Hm, religious zealotry does feel like a particularly bad case. But I think it extends to non-religious ideology-poisoned people as well. Like, if someone really really hates some demographic X, it may be the case that their ideal world doesn’t have that demographic at all… Or it may feel more natural and “correct” for them that this demographic exists, but in pain.
It seems to me that deep cruelty, and the desire to regularly exercise it, is part of the default human-values package. If yes, CEVing random people^[1] is going to instantiate worlds in which it’s very present. I may be overly cynical on that, I’m not sure.
My best guess is that you do have some chance of having some people you terrorize around, but I think that preference is less likely to turn out scope-sensitive
If I understand correctly, your mental picture here is that the dictator would allow a vast cosmos-sprawling civilization to exist, and that most of it will be relatively free to flourish, except the relatively small bubble around the dictator, in which they would exercise their preference for terror?
But would a very selfish person’s preference for other people existing be scope-sensitive? I think plausibly not: a selfish dictator wouldn’t necessarily care to create that cosmos-sprawling civilization. Instead, they may just mothball most of the resources in reach so they can prolong their own life, and only maintain a small bubble of activity in their immediate vicinity.
In that case, the scope of terror would be approximately the same as the scope of flourishing. Possibly to net negative eudaimonia? Or, at least, to astronomically less eudaimonia than possible, such that this vs. omnicide is not a no-brainer.
But also, IDK, my sense is people really really like to imagine their enemies as more fundamentally evil than they are
To be clear, I don’t think Putin is the epitome of evil. I don’t even know that he’s 90th-percentile evil, if we define evil as “an active preference for there to be suffering”. Rather, I think such people end up selected for callousness first and foremost. And then past that, it seems pretty easy for their CEV to end up something like the above “small bubble of activity in which everyone lives and dies at their whims”.
Edit: Note that we’ve ended up discussing two very different scenarios: net-negative CEVs where mass suffering is present because of ideological reasons for how the world ought to look like, and CEVs that are net-negative because of very scope-insensitive/selfish preferences that aren’t clearly dominated by positive values.
1. ^
  As opposed to CEVing humanity-as-a-whole, which IIRC is the original CEV target.

Thane Ruthenis 18 Apr 2026 8:15 UTC
111 points
63
on: Vladimir Putin’s CEV is probably not that bad
I am pretty sure Putin doesn’t love the authoritarian regime intrinsically [...] He probably does love the adoration and the respect he gets to demand, but those do not require (and my guess is are probably mildly harmed) by the suffering of his admirers.
Not sure about that.
I think plenty of people intrinsically enjoy having power over others, and the ability to lord that power over them. It doesn’t end at adoration and respect: you can get a meaningfully different kick out of terrorizing other people. This more or less works out to an intrinsic preference for being a tyrant, rather than a mere dictator.
Similarly, tons of people clearly get a kick out of there being some outgroup to which it’s acceptable to be cruel. Worlds optimal by their values would plausibly contain demographics optimized for this purpose. Their suffering would not be accidental, it would the whole point. (E. g., religious zealots with a preference for there being a bunch of people burning in hell. One might naively guess that their ideal worlds are those in which Hell-deserving people don’t exist to begin with… But I don’t feel sure about this at all.)
I don’t know to what this actually works out. Maybe, for the overwhelming majority of people, these kinds of preferences indeed aren’t reflectively stable in the limit of wisdom, and even %insert_bad_person%’s ideal world would be eudaimonic.
But I think there’s massive potential for S-risks there, and that bad worlds would be very bad indeed.
My current EV of “Putin gets AGI” is lower than that of extinction.
What links here?
- AI #165: In Our Image by Zvi (23 Apr 2026 13:40 UTC; 41 points)

Thane Ruthenis 16 Apr 2026 10:13 UTC
8 points
2
in reply to: orthonormal’s comment on: orthonormal’s Shortform
To add to @Daniel Kokotajlo’s points:
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.

Thane Ruthenis 16 Apr 2026 7:57 UTC
4 points
0
in reply to: kaiwilliams’s comment on: kaiwilliams’s Shortform
I’m excited to try it on the “recommend me a book” benchmark, since it is reportedly good at that, whereas other models haven’t been. Haven’t gotten around to that yet, though. I’d be interested to learn if this replicates for anyone!

Thane Ruthenis 16 Apr 2026 7:50 UTC
22 points
8
on: Do not conquer what you cannot defend
(A very natural-seeming extension of this point is “build a general-purpose optimization system to improve the world” → “whoops it develops independent agency and kills everyone / is stolen from you by a sociopath who installs a totalitarian dictatorship”. It always amuses me when the object-level and the meta-level dynamics mirror each other.)

Thane Ruthenis 13 Apr 2026 18:23 UTC
4 points
2
in reply to: Karl Krueger’s comment on: Thane Ruthenis’s Shortform
My guess is that the returns on that would be either too small to risk the minor chance of discovery (random organized crime groups are not going to be able to pay tens of billions of dollars to them), or would involve dealing with non-US state actors in ways that would be very hard to cover up long-term (like, the three-letter agencies are going to eventually connect the dots between the new multibillions suddenly flowing to Anthropic and the new wave of zero-day exploits). Am I off regarding the money quantities involved and/or the discovery risks here?

Thane Ruthenis 13 Apr 2026 18:18 UTC
7 points
1
in reply to: lc’s comment on: Thane Ruthenis’s Shortform
Would that have actually given them more money and clout than this big partnership, though? Especially if publicly-sold Mythos would’ve then ushered in an Internet apocalypse, tanking their reputation and potentially enraging the USG/the public in a way that would have led to regulation or massive lawsuits that they would have lost?

Thane Ruthenis 13 Apr 2026 18:15 UTC
3 points
3
in reply to: Mateusz Bagiński’s comment on: Thane Ruthenis’s Shortform
Not certain about this take, but:
my understanding is that they still want to release Mythos to the public, after a few months, which seems like a very, very bad idea to me
FWIW, I don’t think Mythos is a qualitative step change as much as a quantitative one, and the quantitative gap is going to close in a few months (compared to ambient (open-weights) capabilities).
As in, the core ability to exchange compute for zero-days has already been unlocked by the models of previous generations (see e. g. this take). Mythos is perhaps able to find more advanced vulnerabilities in a more autonomous manner using less compute, but not so much that it’s zero-to-one unprecedented. By the sound of it, it doesn’t, like, zero-shot those zero-days, you still have to spend tens of thousands of dollars on the compute for mining them.
So giving public access to it in a few months, perhaps with some unusually aggressive refusal behavior around cybersecurity-flavored tasks, seems fine to me.

Thane Ruthenis 13 Apr 2026 17:35 UTC
4 points
0
in reply to: kman’s comment on: The effects of caffeine consumption do not decay with a ~5 hour half-life
My model says that if you took 400mg EOD:
Yup, I expect the same regarding both.
200mg in a caffeine-naive state would feel pretty stimulating for most people I’d think
Yup, it is.
Do you feel more tired than a caffeine-naive baseline on the off days?
Hard to say, but probably somewhat.

Thane Ruthenis 13 Apr 2026 17:30 UTC
37 points
18
on: Thane Ruthenis’s Shortform
How could a smart-but-amoral Anthropic have used Mythos’ ability to find zero-days for massive self-gain?
I keep seeing takes that Anthropic showed upstanding moral character by launching Project Glasswing. I’d like to figure out how much of a positive update this should really be.^[1] But currently, I’m confused what the supposed massive opportunities for self-gain there were that Anthropic gave up.
- Stealing money? The Anthropic-relevant amounts of money are geopolitical in scale, Anthropic would not be able to get away with it even with superintelligent hackers.
- Installing backdoors into US government systems and blackmailing the officials or selling the information to US’ geopolitical enemies? Again, that leaves trails superhuman hacking alone won’t cover up, and extracting relevant-to-Anthropic amounts of resources this way would be near-impossible to cover up long-term (with lethal consequences upon discovery).
- Attack non-US governments? I think it’s pretty plausible they are going to let the USG use Mythos for that, Dario being a known China hawk and all.
- Hack random crucial systems to cause major chaos? How does that benefit them? (I mean, I guess I can imagine ways, and this is easier to cover up, but this is still pretty flimsy.)
- Hack into OpenAI/DeepMind to sabotage them? Those are precisely the targets which may have symmetric defenses, and I don’t think they could have destroyed them this way.
Anything I can think of off the top of my head is either:
- Something they likely would’ve been able to get away with, but which isn’t beneficial enough to offset even that minor risk of discovery.
- Something so major that they would not be able to get away with it even with superhuman hacking.
I can see how Glasswing is more prosocial than just doing nothing, or than e. g. getting deeply in bed with the USG without revealing publicly that those capabilities exist (and therefore alerting the CCP to them). But some people talk as if they could’ve done something proactively evil with that power, yet all evil uses for it I could think of would also be very stupid.
Am I overlooking something?
1. ^
  FWIW, I’ve now updated positively on Anthropic’s character off of their fight with the DoW. That was definitely less Pythian than I expected of that company.

Thane Ruthenis 12 Apr 2026 3:17 UTC
3 points
1
in reply to: kman’s comment on: The effects of caffeine consumption do not decay with a ~5 hour half-life
My guess is that you build half as much tolerance taking X mg every other day compared to taking X mg daily.
Not my personal experience. Taking 300-400 mg caffeine every other day leads to ~no tolerance, or a very slowly developing one; taking that amount every day maxes out the tolerance very fast. My current schedule is just that, plus a once-a-month stretch where I skip caffeine for three days in a row to wipe out whatever tolerance may have crept up. (But obviously everyone’s biochemistry differs significantly, etc.)

Thane Ruthenis 10 Apr 2026 19:51 UTC
3 points
3
in reply to: ryan_greenblatt’s comment on: Thane Ruthenis’s Shortform
TBC, that doesn’t seem to be your objective and Anthropic isn’t your target audience, which is fair enough.
Yep: I don’t expect Anthropic’s course on this to be significantly swayable by random public comments, or really by anything short of government regulations, investor pressure, or a major AI-caused disaster. Public arguments may convince them to be taking this sort of stuff incrementally more seriously, but I don’t think “incrementally” would cut it here. This is my update on Anthropic, not an attempt to get Anthropic to update.
I think your comment gives off a soldier mindset vibe that seems somewhat unproductive
Fair enough, going off of your and @1a3orn and @Seth Herd’s comments, I suppose I did phrase things in a manner than is somewhat more visceral than necessary.

Thane Ruthenis 10 Apr 2026 16:51 UTC
19 points
9
in reply to: 1a3orn’s comment on: Thane Ruthenis’s Shortform
They are, inasmuch as: (1) “emotions” are variables adjusting your decision-making policy in specific ways, and (2) specific important ways of adjusting one’s decision-making policy are implemented via emotions in most psychologically normal humans.
Like, sure, you don’t need to be terrified to reap the benefits of terror, and I was ultimately using “being mortally terrified” as a shorthand for “entering a decision-making mode where they’re much more willing to consider drastic and costly adjustments to their current processes due to assigning extremely negative value to repeating this mistake”. But last I checked, most Anthropic employees were still psychologically normal humans, so I don’t think the use of the shorthand is erroneous.

Thane Ruthenis 10 Apr 2026 14:49 UTC
104 points
71
on: Thane Ruthenis’s Shortform
Regarding Claude Mythos’ CoTs being accidentally trained-on: I think the biggest problem here is that Anthropic’s internal procedures were shoddy enough that this “technical error” was allowed to happen, and then went unnoticed until the model was already trained.
Regardless of the extent to which it’s justified, Anthropic sure seems to believe that CoT monitoring and faithfulness is one of the main pillars of ensuring AI alignment. Now it turns out that their training pipelines were consistently sabotaging that pillar. If this mistake were allowed to happen, how many other mistakes of the same magnitude are their procedures ridden with? How many more such mistakes will they make in the future? How many of them will be present, uncaught, in the training run that produces their god?
The appropriate response to realizing you made a mistake like this is to be stricken with so much mortal terror that you rehaul your entire R&D pipeline until it’s structurally impossible for anything in this reference class to ever happen again.
Is there any indication Anthropic is doing that? I haven’t seen all Twitter discussions, and I suppose they may not want to be public about it… But vibes-wise, it doesn’t seem that they’re appropriately horrified.
And if not, I argue they’re not taking any of this seriously. None of this fancy “AI alignment” crap is going to matter if your ineptitude lives at the level of “can’t even implement your own plan correctly”. Just about same as, “whoops, I accidentally put a ‘-’ in front of my AI’s utility function”.
What links here?
- Rohin Shah's comment on No77e’s Shortform by No77e (11 Apr 2026 9:55 UTC; 26 points)

Thane Ruthenis 9 Apr 2026 23:04 UTC
2 points
0
in reply to: jimrandomh’s comment on: Human takeover might be worse than AI takeover
Humans (especially the most power-seeking humans) can have terrible goals with respect to what happens around them, but it’s pretty rare for them to have strong preferences about what happens in other galaxies and at high levels of abstraction
Seems obviously wrong to me. Most people have political beliefs regarding how societies “ought” to be put together and how people “ought” to live their lives, and they can be quite ugly. I agree that an idealized power-seeker doesn’t have such preferences...
… but why would an idealized power-seeker allow humans to live free in distant parts of the cosmos? If I were an idealized power-seeker, I’d send self-replicating probes there to convert all reachable matter into the most energy-conserving formats, to ensure I would be able to stave off the heat death for myself for as long as physics allows. I would most definitely not populate it with irrelevant monkeys to waste however they see fit.
Inasmuch as the hypothetical AGI dictator cares about humans existing, the dictator’s preferences must be aligned with human flourishing lest hell results.
Inasmuch as the hypothetical AGI dictator doesn’t care about humans existing, they would not exist.

Thane Ruthenis 9 Apr 2026 21:52 UTC
8 points
1
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
It’s unclear if Mythos is much more impactful for cybersecurity overall than a new fuzzing or static analysis tool
The bull case here is that “scale LLMs” is turning out to be a way to predictably and consistently produce ever-better tools for discovering exploits, right? Probably with said tools’ power scaling exponentially (in some relevant sense), like everything else with LLMs.
That is, Mythos by itself is probably just on the level of a new fuzzing tool, able to let humans find a new reference class of exploits. But then we’d have Mythos 2 three to six months later, etc. Which potentially shifts the cybersecurity world into a new operating regime, even if each individual perturbation is something that already happened before.
Or is there an argument that it would still be on-model for how the cybersecurity world operates? I’m not very familiar.

Thane Ruthenis 8 Apr 2026 0:12 UTC
24 points
4
on: Thane Ruthenis’s Shortform
Regarding Claude Mythos/Project Glasswing:
For what it’s worth, this is in line with my expectations for what LLMs should be capable of and it doesn’t update me much towards their AGI-completeness. (Though it’s at the more bullish end of past!me’s predictions.) Receipts-wise, I’m not sure I laid it out publicly anywhere, but here’s an excerpt from a message I’d sent to @habryka in January 2025:
[T]he worse end of the scenarios I’m imagining is...
- The o-series of models potentially doesn’t scale to superhuman programmers-in-general, but does scale to superhuman hackers: because “find a way to make this abstraction leak” is precisely the kind of programming task that is well-posed/isomorphic to a rigorous-math task (“find a flaw in this proof/this statement of tautology”).
- I expect that ~all of the components of the current web stack contain countless vulnerabilities that are relatively easy to discover, and which aren’t exploited simply because finding them would require a talented programmer to parse vast mountains of code (application code, and the code of its dependencies, and the dependencies of dependencies, and any cross-interactions between them...).
- AI agents of the above type would remove this friction cost.
- So if this scenario comes to pass, I expect them to be discovering Heartbleeds and Log4Shells on a daily or weekly basis. Then, at any given time, they would be using that to simultaneously attack large random subsets of the entirety of the Internet infrastructure.
I do, however, still find myself unpleasantly surprised by this. I wouldn’t say it happened “earlier than I hoped”, inasmuch as I did not have specific predictions about the timing… But, well, any time for it to happen would be unpleasantly early, I suppose.

Thane Ruthenis 6 Apr 2026 14:40 UTC
42 points
35
in reply to: Martin Randall’s comment on: Martin Randall’s Shortform
Would we be less likely to go extinct if OpenAI forcibly took over Anthropic? (a possible hypothetical: a combined effort of OpenAI and the US government forces a merger). My take: no.
People keep making arguments shaped like this, and I feel that there’s some error here.
Currently, for race dynamics to dissolve, two LLM megacorps need to cease to be relevant. If Anthropic weren’t founded or were destroyed/absorbed, only one LLM megacorp would need to lose relevance. Destroying one megacorp is easier than destroying two. Therefore, in worlds in which Anthropic didn’t exist or were destroyed, it would be easier to dissolve the race dynamics altogether. Therefore, it’s bad that Anthropic exists and it would be better if it did not exist.
That seems straightforward to me? We are less likely to go extinct in Anthropic-less worlds because those worlds are more likely to also become OpenAI-less and/or DeepMind-less.
There seems to be some underlying assumption that this is a one-turn game, where you’re asked to choose between “implement Intervention X and then do nothing forever” and “do nothing forever”. I’m not sure why “do nothing forever” is implicitly tacked on.
(Arguments like “I, a capability researcher, don’t need to quit my AGI lab, because I will just be replaced by an identical capability researcher, for a net zero change in anything” seem from the same genre.)