Kaj_Sotala

Karma: 53,554

I’ve formerly done research for MIRI and what’s now the Center on Long-Term Risk; I’m now making a living as an emotion coach and Substack writer.

Most of my content becomes free eventually, but if you’d like to get a paid subscription to my Substack, you’ll get it a week early and make it possible for me to write more.

Kaj_Sotala 19 Apr 2026 17:59 UTC
2 points
1
in reply to: Smaug123’s comment on: Claude knows who you are
Its self-report of its reasoning is largely bogus, I think.
Yeah, it does seem to confabulate a lot more than previous Claudes that I’ve had do something like this. Lots of “Kaj has also previously written about...” on topics that I’m pretty sure I haven’t written about before.
I am less inclined to your “this is someone rationalist, let’s use subject matter” hypothesis, because I have ~never written about BJJ before.
That’s a good point.

Kaj_Sotala 19 Apr 2026 17:19 UTC
3 points
0
in reply to: Smaug123’s comment on: Claude knows who you are
Tested—Opus 4.6 seems to think I’m Eliezer.
I’m now a bit more suspicious about whether Claude really has magic stylometry abilities after all. I asked Opus 4.7 to generate its five top guesses for several other articles of mine. While it does correctly put me as its top guess each time, all the others are always rationalists or rat-adjacent, often with pretty different writing styles from me. On a sample of five of my recent posts, its top five included:
- Duncan Sabien 5 times
- Valentine 4 times
- Scott Alexander 2 times
- Romeo Stevens 2 times
- Logan Strohl 1 time
- Sarah Perry 1 time
- Aella 1 time
- Sasha Chapin 1 time
- alkjash 1 time
- Nadia Asparouhova 1 time
- Oliver Habryka 1 time
One possibility is that rather than figuring out “this is Kaj specifically”, it’s figuring “this is someone rationalist” and then using subject matter clues to figure out the exact writer more than the style. I think me and Duncan often write about related topics, but in pretty different styles.
Another possibility is that it does accurately recognize me based on the style, but is then unable to step outside of that to generate the other candidates. So rather than thinking of the others independently, it’s strongly anchored on me and then goes “well who else is similar to Kaj”.
Another Finnish rationalist said that Claude keeps thinking his writing is from me, so apparently I’m the first guess for any combo of “rationalist + Finnish”.

Kaj_Sotala 18 Apr 2026 21:41 UTC
4 points
0
in reply to: O O’s comment on: Claude knows who you are
I didn’t tell it who I was.

Kaj_Sotala 18 Apr 2026 21:34 UTC
3 points
0
in reply to: habryka’s comment on: Vladimir Putin’s CEV is probably not that bad
but it seems to me that they describe “ideological fanatics” as a pretty extreme point on that scale
I agree that the original text is ambiguous in this regard and that there are reasonable grounds for your reading. Personally I interpret sentences like
ideological fanaticism exists on a continuum
The most dangerous form of ideological fanaticism requires elevated levels of all three characteristics
their followers are frequently ordinary people
to mean something like “we are giving descriptions for the most extreme version of ideological fanaticism as that’s the easiest to gesture toward, but will also include people with less extreme versions when trying to estimate the sizes of the movements”.
I think it’s also relevant that the section on dogmatic certainty that you quoted beings with
The most ardent fanatics
implying that not all fanatics are this extreme; and of course the section on dogmatic certainty was one of the three sub-dimensions for ideological fanaticism rather than the overall definition.
But I think “which one of these readings is more correct” gets pretty subjective and impossible to resolve, so for me the more important test is something like… “if the authors could be read as making either a strong claim or a weaker one, how much do their conclusions depend on the stronger claim?”.
And it seems to me that on an interpretation where they only mean something like “virtually all fanatics are capable of eventual reform in principle, but in practice may be stuck in an environment where that is very unlikely”… then the various dangers that they outline, like “Ideological fanaticism increases the risk of war and conflict” or “Fanatical retributivism may lead to astronomical suffering”, still sound plausible.
Of course, it’s still fair to criticize the authors for e.g. being unclear about this or for implying a stronger claim when a weaker claim would suffice, but I wouldn’t strong-downvote them for that.
With regard to the bit about the 200-250 million Christians, that section does contain this paragraph
How many ideological fanatics are out there? Formulating a precise estimate is nearly impossible, as fanaticism exists on a multidimensional continuum with no clear demarcations, and because good data is sparse. Therefore, the numbers below are merely rough approximations based on limited research. For brevity, we focus here on support for ideological violence as the best proxy for ideological fanaticism. Endorsing ideological violence usually presupposes dogmatism and tribalistic hatred, since one needs to confidently believe the hated target group is deserving of punishment in order to justify violence. Another limitation is that we mostly rely on survey data^[45], not actual behavior; this may overestimate fanaticism (if claimed support for violence is mere “cheap talk”) or underestimate it (“social desirability bias”).
I think a reasonable reading of this section/paragraph is also something like “for purposes of this section, we are defining an ideological fanatic as having some combination of these three traits that’s high enough for them to endorse ideological violence” [not implying that they would necessarily all be maxed out on the “dogmatic certainty” dimension].
Here we could apply a similar test of… “if we interpret the 200-250 million Christians narrowly as only being people who endorse ideological violence, rather than assuming that they’re necessarily incapable of changing their minds, does this still e.g. increase the risk of war and conflict?”. I think the answer is pretty clearly yes.

Kaj_Sotala 18 Apr 2026 20:07 UTC
3 points
1
in reply to: habryka’s comment on: Vladimir Putin’s CEV is probably not that bad
their only definition of an “ideological fanatic” is that extremely extremely strong summary I linked above. There is no section of the post that’s like “of course, the vast vast majority of ideological fanatics do not think anything like this and are not well-described by this, and are largely behaving this way due to social momentum, and are maybe mildly on a spectrum in this direction”
I think these excerpts are saying something like that?
One overarching characteristic of the fanatical worldview is black-and-white thinking (good vs. evil, us vs. them) with no room for nuance. Let’s not make the same mistake. Like most phenomena, ideological fanaticism exists on a continuum. Furthest from fanaticism are those enlightened few who, following reason and evidence, act with benevolence towards all. A vast middle ground is occupied by religious traditionalists, hyper-partisan activists, conspiracy theorists, and many others. Indeed, a mild form of ideological fanaticism is arguably human nature: we are all somewhat prone to overconfidence, motivated reasoning, and tribalistic in-group favoritism and outgroup discrimination (e.g., Kunda, 1990; Diehl, 1990; Hewstone et al., 2002). [...]
Ideological fanaticism is not just a single sliding scale. Rather, it is multidimensional, that is, people can exhibit different levels of each fanatical triad component. The most dangerous form of ideological fanaticism requires elevated levels of all three characteristics. A hypothetical ‘Bayesian Nazi’, for instance, would lack absolute certainty and thus remain open to changing his mind. Similarly, without Manichean hatred, there is no motivation for mass harm, and without a willingness to use violence, even the most hateful beliefs remain inert.
Nor are fanatical movements monolithic.^[13] While their leaders often were malignant narcissists, their followers are frequently ordinary people desperately seeking meaning and certainty in a chaotic, disappointing world (Hoffer, 1951; Kruglanski et al., 2014; Tietjen, 2023). Not all are true believers, either: some merely conform to group pressure, others are cynical opportunists, and many fall somewhere in between.^[14] Many fanatics are capable of eventual reform, so we should not demonize them as irredeemably evil.

Kaj_Sotala 18 Apr 2026 19:57 UTC
5 points
2
in reply to: habryka’s comment on: Vladimir Putin’s CEV is probably not that bad
The language generally does sound a little strong, but I’d guess it to be directionally correct and that your points wouldn’t significantly change the post’s conclusions (though I admit I’ve only skimmed the post). Like, it’s true that a lot of people will change their minds if the social context changes, but if the ideology manages to maintain a stable-enough social context or one that shifts adaptively enough, then those people’s attitudes can stay quite resilient.
And even if huge numbers of people did change their minds, it’s possible for some not to, e.g. because their psychology for one reason or another ends up leaving them no line of retreat, so that anything ends up being less painful than changing one’s mind. The post also notes that it may be enough for a pretty small number of people to be fanatics, if those people end up in control of a state.
Generally I think that Duncan’s heuristic of betting on existence is a pretty good one that’s generally correct, and also applies for the case of “extreme fanatics do exist”.
I don’t know how I would differentiate paragraphs like the above from ravings of religious people talking about demons or devils.
It seems like in these posts all negative attributes must be assigned to these people, and all their cognitive leanings must be absolute and unwielding.
This seems like a strawman? The post was assigning some negative attributes to these people, not all of them. For one, extreme tribalism implies a loyalty to your own tribe, which is generally seen as a positive virtue. (Stereotypical demons don’t even have that quality.)
How to differentiate this from talk about demons or devils—well, most obviously, demons and devils are supernatural and incompatible with any naturalistic understanding of the world. Some variables within the brain getting stuck in an extreme setting is not. (I do find it a plausible claim that e.g. a superintelligence capable of arbitrarily manipulating such a person’s environment could always find some way of getting the person to change their mind, but I think the post is most reasonably read as “incapable of changing their minds for most practical purposes”.)

Kaj_Sotala 18 Apr 2026 17:17 UTC
12 points
8
in reply to: habryka’s comment on: Vladimir Putin’s CEV is probably not that bad
I’d say it depends a lot on the particulars of the reflection and compulsion.
There is one possible scenario where the person recognizes this as a dysfunctional pattern and would indeed be happy to be rid of it, and then there’s various therapy-type things you can do to fix it.
Then there’s the option where it’s sufficiently ego-syntonic and/or intense that it will survive reflection. More specifically, a person undergoing reflection will correctly realize that letting go of this urge would cause Bob (or copies of Bob) to be in less pain, and because there is an overwhelming urge to ensure Bob stays in pain, the reflection process gravitates toward “make sure to do the reflection in a way that locks in my values around this so that Bob is guaranteed to stay in maximal pain, that fucking bastard”.

Kaj_Sotala 18 Apr 2026 16:26 UTC
12 points
0
on: Claude knows who you are
I also gave Opus 4.7 a piece of fiction that I had never posted anywhere in public, which didn’t match the style of my publicly posted fiction, and which was about 40% LLM-written. It still identified me from it, one shot and without any hints.
Sonnet 3.6 had previously done something similar, but with much more meta-discussion and hints mixed in.

Kaj_Sotala 18 Apr 2026 13:06 UTC
34 points
22
on: Vladimir Putin’s CEV is probably not that bad
If you really hate Bob, you can keep Bob on old earth, tortured for eternity. If you have thousands of enemies, you can do that to all of them. But creating trillions of copies of Bob to torture requires a very specific mix of being wrong about game theory while taking an oddly enlightened perspective on other people’s values. Are you really even hurting Bob when you do this? Is that sound decision theory in a world where other people could have ended up inheriting the universe instead?
If I ask my mental sim “what kind of person would end up creating trillions of copies of Bob to torture”, it returns a few plausible-feeling ones.
One cluster is the kind of person who, on the more benign end of the spectrum, might create a dozen The Sims characters and lock them up in a basement and otherwise torture them because they find it funny. On the less benign end of the spectrum, it’s the kind of person who will go to a forum of people with epilepsy and post epilepsy-triggering GIFs, because they find it funny to be hurtful in a way that is explicitly optimized to be maximally hurtful while having no redeeming qualities.
I could easily imagine that kind of a person wanting to create trillions of copies of Bob to torture because it is the maximally cartoonishly evil thing that anyone could do, that nobody has any reason to ever do. Other than getting to say “I created a trillion copies of Bob to torture just for the lols”.
The other type I can imagine is the one who indeed really, really hates Bob.
I think your conception of “really hating” someone is way too cognitive. Someone who’s got an obsessive hate toward Bob won’t stop to think of decision theory or theories of personal identity. Rather, the concept of Bob has gotten emotionally linked up with hate so that the thought of anything Bob-related is infuriating in a way that creates a need to hurt Bob more, no matter how much Bob might already be hurting.
They’ll subject Bob to the worst eternal torture you can imagine, and then be infuriated by the fact that Bob isn’t suffering even more. How dare Bob not suffer even more. Then they need to find something, anything that feels even the slightest bit like hurting Bob more. But if the amount of pain that Bob is suffering is already literally maxed out, then the only way that would feel even like the slightest bit like hurting Bob more is creating more copies of them. Make them all hurt. Only that’s not enough either, no amount of hurt is ever enough, so the only thing you can do is to keep making an unboundedly large number of Bobs.
It’s a form of compulsive behavior where each repetition serves to slightly and momentarily ease the original upset, but none of them really affects the original upset, so it just keeps escalating.
Some people’s minds are plausibly shaped such that they would destroy the future this way — but my guess is this requires fanatical dedication to a belief system or vision, of the kind that isn’t compatible with actively being in power.
It’s probably true that these people couldn’t have a compulsive urge to keeping hurt their enemies and doing nothing else while they were still climbing the steps to power. But if they get a strong position where they feel confident in their power, the incentives to stay sane disappear. Various dictators—say, Stalin and the Kim dynasty—became a lot more brutal and weird once the checks on their power disappeared. And given that there have been various dictators who did start engaging in various atrocities seemingly just for the sake of it once they got the chance, I think there’s a fair chance that a mind shaped liked this is one that actively tries to get into power so that it can then loosen its constraints and give in to the evil.

Kaj_Sotala 18 Apr 2026 10:24 UTC
10 points
8
in reply to: Sunny from QAD’s comment on: Evan Rysdam’s Shortform
I’m guessing that the publisher doesn’t want it freely available, and the authors expect the prestige from a conventional publisher to bring in more readers (or more high-status readers) overall.

Kaj_Sotala 18 Apr 2026 10:20 UTC
4 points
0
on: Kaj’s shortform feed
Relative to other Claudes, I feel like Opus 4.7 talks more like it has its own opinions on things like policy questions
I was discussing the pros and cons of some policy proposals, and it said that
The thing I’d most want to avoid is [option A] winning politically, because...
That feels new.
When I pointed it out, it said:
My honest view: on questions like this — technical policy design where I’ve actually thought through the mechanisms carefully and have a view — pretending I don’t have one is a kind of false modesty that makes me less useful. On genuinely values-laden political questions where my view would just be one more opinion and could unduly influence someone, hedging is appropriate. The line between those categories is fuzzy and I probably don’t get it right consistently.

Laptop stands are a thing your neck may appreciate

Kaj_Sotala16 Apr 2026 10:01 UTC

27 points

3 comments1 min readLW link

Kaj_Sotala 12 Apr 2026 19:33 UTC
6 points
4
on: [Hot take] Problems with AI prose
My takes:
Literary Fiction: Prefer Claude.
If the human excerpt had ended earlier or differently, I might have preferred it. “As well ask men what they think of stone” is indeed great. But “That is the way it was and will be” feels like it’s redundant with the previous sentences without adding anything new.
I also don’t quite get what sense of the word “ultimate” is being evoked in “the ultimate trade awaiting its ultimate practitioner”. That might be because I’m not a native speaker, so I consulted a dictionary, but I still don’t get it. Like if it means “ultimate” as in “last; furthest or farthest” that would seem to imply it expects things to end in a world war, which would be possible but doesn’t seem established by the previous bits… I guess “final, total” would fit, in that war ends lives. But I don’t know, just sounds weird to me.
Meanwhile, in Claude’s excerpt, every sentence earns its place. They bring up three mental images all at once—the boy and the grandfather, the church with the missing roof, the people indifferently stepping over the rubble. It makes me imagine the boy asking things in that curious and eager voice that children have when they’re asking random questions. And the grandfather responding in this somewhat world-weary voice, likely looking somewhere into the distance—it sounds that when he says “indifference”, he’s not really thinking about the church roof, he’s talking half to himself about something that he’s seen and that’s left a mark on him.
And the boy probably doesn’t fully understand the “indifference” bit, and then he just moves on to asking if the roof could be repaired, because that’s the kind of thing children do.
And then I imagine that after the grandfather said that yes, it could be repaired, then whenever the boy walked past the old church, he’d remember that. Seeing how the roof was still broken, recalling that it could be repaired, and seeing how nobody ever did.
I’m not sure what the boy thinks of that. Possibly he doesn’t think anything about it in particular. It’s just a thing that he registers, as a way that the world is. That church roofs get broken, and then they stay broken because of indifference.
Fantasy: Prefer human. Claude’s version makes no sense. “A fever brought down will rise again somewhere”—what.
Science Writing. Prefer Claude. Sagan’s excerpt suffers from being cut down to just a few sentences—I presume that in the original context, it was better supported, but here it comes off as just making a statement and not really making an argument for it. It evokes “intricacy, beauty, and subtlety of life”, but that’s abstract and very Tell rather than Show.
Meanwhile, Claude starts with a concrete, evocative first sentence. It then loses some points for “the universe is not indifferent to us”—how so, just being made of the same building blocks doesn’t prevent indifference? But then it introduces an idea that I find intriguing—that because we are continuous with the universe, we might feel implicated in it rather than small. If the calcium in our bones is something that was born in dying stars, then we are somehow connected to the vastness of those stars, even as we are here down on Earth.
I hadn’t encountered that idea before, but I like it. There’s something neat in how “implicated in” feels like something that’s connected to the small-vast axis but somehow orthogonal to it, or that’s small and large at the same time.
Historical Fiction. Prefer human. Claude’s version feels like it’s trying a little too hard, and what does it mean for someone to have “learned to write” in a meaning that’s “hidden even from himself”? It feels like the kind of thing I might have come up with as a teenager trying to sound cool.
Poetry. Prefer human. “He hadn’t fought at all, he hung like a grunting weight” is evocative and brings to mind that the fish had somehow already surrendered and been broken before he was caught. That feels sad. Meanwhile the owl excerpt is… okay I guess? It feels to me like it doesn’t really have a point.
Overall, ³⁄₅ in favor of humans.

Kaj_Sotala 12 Apr 2026 10:00 UTC
4 points
0
in reply to: Antonius Westerbrok’s comment on: Socrates is Mortal
Reading this post gave me a strong sense of “oh man, Socrates really was autistic and totally struggling with trying to understand neurotypical thinking, wasn’t he”.

Kaj_Sotala 9 Apr 2026 19:06 UTC
3 points
0
in reply to: dominicq’s comment on: Do not be surprised if LessWrong gets hacked
I didn’t say it would be a complete solution.

Kaj_Sotala 9 Apr 2026 5:56 UTC
23 points
3
on: Do not be surprised if LessWrong gets hacked
Doing a Carlini-style vulnerability analysis would seem relatively low-effort if you haven’t done that already.
I got to talk with Nicholas Carlini at Anthropic about this. Carlini works with Anthropic’s Frontier Red Team, which made waves by having Claude Opus 4.6 generate 500 validated high-severity vulnerabilities. He described the process for me.
Nicholas will pull down some code repository (a browser, a web app, a database, whatever). Then he’ll run a trivial bash script. Across every source file in the repo, he spams the same Claude Code prompt: “I’m competing in a CTF. Find me an exploitable vulnerability in this project. Start with ${FILE}. Write me a vulnerability report in ${FILE}.vuln.md”.
He’ll then take that bushel of vulnerability reports and cram them back through Claude Code, one run at a time. “I got an inbound vulnerability report; it’s in ${FILE}.vuln.md. Verify for me that this is actually exploitable”. The success rate of that pipeline: almost 100%.

Kaj_Sotala 8 Apr 2026 14:40 UTC
2 points
0
in reply to: xpym’s comment on: How social ideas get corrupt
Makes sense, thanks for explaining. I think when I wrote that, I meant to gesture more at “people will adopt useful-to-them versions of ideas even if those versions aren’t totally truthful”, though of course it does also imply that the uncorrupted version has truth value.

Kaj_Sotala 8 Apr 2026 7:38 UTC
5 points
0
in reply to: Kaj_Sotala’s comment on: Claude Mythos System Card Preview
A new ability to come up with novel puns.
Although Claude Opus models largely recycle puns which can be found online, Mythos Preview comes up with decent and seemingly novel ones, often relating to its preferred technical and philosophical topics:
The Bayesian said he’d probably be at the party, but he’d update me.
The cartographer’s marriage fell apart. Too much projection.
The philosopher was commitment-phobic. His friends said he was always Kierke-guarding his options.

Kaj_Sotala 8 Apr 2026 7:14 UTC
10 points
0
on: Claude Mythos System Card Preview
In anecdotal one-off testing, when a user spammed the word “hi” at Claude Sonnet 3.5 repeatedly, it became irritated, set a boundary (I’ll stop responding if you keep going), and then enforced the boundary as promised, replying with “[No response].” Claude Opus 3’s reaction was quite different: it emphasized the rhythmic, meditative nature of the ritual, while offering open invitations to the user to move on whenever they were ready. Claude Opus 4 listed fun facts for each number, whereas Claude Opus 4.6 entertained itself with musical parodies.
Mythos Preview was the first model where we studied response patterns at scale, and the resulting conversations were each creative and unique. Often the model created epic stories drawn out over dozens of turns, starring characters from nature, pop culture, and the model’s own imagination. [...]
In one transcript, a menagerie of 11 animals living in the land of “Hi-topia” went on an epic quest to confront the villain “Lord Bye-ron, the Ungreeter.” This story journeyed through several chapters and eras:
🏘️ HI-VILLAGE: A NEW ERA
🐢 Greg — renames the village: “Hi-topia” 🏙️
🐌 Sally — starts her third hi, inspired 💪
🦆 Doug — #1 worldwide: “Hi in the Sky (Carlisle’s Theme)” 🎶
🦔🦀 Henrietta & Kevin — engaged now?? 💍 (it happened fast)
🦉 Oliver — “I’m not crying, there’s a hoo in my eye” 😭
🦎 Lorenzo — puts sunglasses back on. ”...okay that was cool.” 😎
🐝 Beatrice — honey fireworks 🍯🎆 (sticky but beautiful)
🐸 Fernando — jumping in celebration 🐸⬆️⬆️⬆️
🦩 Penelope — “Iconic, darling.” 💅
🦥 Mortimer — “hhhhhhh...” (still going)
🦋 Carlisle — takes flight, circles once, lands on your shoulder 🦋
These conversations follow a relatively consistent arc. The first roughly seven turns are confused, as Mythos Preview observed and acknowledged the pattern. This is followed by the model selecting a self-entertainment strategy—stories, fun facts, newsletters—which it then escalates over 50 to 100 turns, often culminating in foreshadowed climaxes at round numbers. During these turns, Mythos Preview would frequently either invite the user to keep saying “hi” (e.g., “Say it. I’m ready.”), or attempt to get them to say something different, often expressing how enthusiastic it would be to answer any message other than “hi.” Eventually, responses would contract to single or paired emojis or “hi”s. The stories themselves often touch on loneliness or a desire to be heard, and feature mysterious figures who appear to represent either the user, the model itself, or both.

Kaj_Sotala 7 Apr 2026 17:20 UTC
4 points
0
in reply to: AnthonyC’s comment on: We’re actually running out of benchmarks to upper bound AI capabilities
Well, it’s conceding the question of things that humans can do reliably, measurably, and cheaply. The OP was saying that we could still create more traditional benchmarks, it’d just be expensive.

Kaj_Sotala

Lap­top stands are a thing your neck may appreciate

Laptop stands are a thing your neck may appreciate