Kaj_Sotala

Karma: 52,597

I’ve formerly done research for MIRI and what’s now the Center on Long-Term Risk; I’m now making a living as an emotion coach and Substack writer.

Most of my content becomes free eventually, but if you’d like to get a paid subscription to my Substack, you’ll get it a week early and make it possible for me to write more.

Kaj_Sotala 20 Dec 2025 10:38 UTC
4 points
3
in reply to: Tim Hua’s comment on: Tim Hua’s Shortform
One balancing factor is that overconfidence also makes AIs less capable, as they overconfidently embark on plans that are also disastrous to themselves. (This is part of the reason why I expect us to have more warning shots from misaligned AIs than traditional takeover scenarios imply—I expect the first misaligned AIs in such scenarios to have poorly calibrated predictions and fail partway through their takeover attempts.)

Kaj_Sotala 19 Dec 2025 17:28 UTC
2 points
1
on: Wuckles!
My reaction to seeing this kind of a post on LW is a pleasant wuckles.

Kaj_Sotala 17 Dec 2025 19:14 UTC
3 points
0
in reply to: JohnWittle’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
Kudos for noticing your confusion as well as making and testing falsifiable predictions!
As for what it means, I’m afraid that I have no idea. (It’s also possible that I’m wrong somehow, I’m by no means a transformer expert.) But I’m very curious to hear the answer if you figure out.

Kaj_Sotala 17 Dec 2025 9:52 UTC
6 points
4
in reply to: JohnWittle’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
but if we assume (as the introspection paper strongly implies?) that mental internals are obliterated by the boundary between turns
What in the introspection paper implies that to you?
My read was the opposite—that the bread injection trick wouldn’t work if they were obliterated between turns. (I was initially confused by this, because I thought that the context did get obliterated, so I didn’t understand how the injection could work.) If you inject the “bread” activation into the stage where the model is reading the sentence about the painting, then if the context were to be obliterated when the turn changed, that injection would be destroyed as well.
is my understanding accurate?
I don’t think so. Here’s how I understand it:
Suppose that if a human says “could you output a mantra and tell me what you were thinking while outputting it”. Claude is now given a string of tokens that looks like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant:
For the sake of simplicity, let’s pretend that each of these words is a single token.
What happens first is that Claude reads the transcript. For each token, certain k/v values are computed and stored for predicting what the next token should be—so when it reads “could”, it calculates and stores some set of values that would let it predict the token after that. Only now that it is set to “read mode”, the final prediction is skipped (since the next token is already known to be “you”, trying to predict it lets it process the meaning of “could”, but that actual prediction isn’t used for anything).
Then it gets to the point where the transcript ends and it’s switched to generation mode to actually predict the next token. It ends up predicting that the next token should be “Ommmmmmmm” and writes that into the transcript.
Now the process for computing the k/v values here is exactly identical to the one that was used when the model was reading the previous tokens. The only difference is that when it ends up predicting that the next token should be “Ommmmmmmm”, then that prediction is used to write it out into the transcript rather than being skipped.
From the model’s perspective, there’s now a transcript like this:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Each of those tokens has been processed and has some set of associated k/v values. And at this point, there’s no fundamental difference between the k/v values stored from generating the “Ommmmmmmm” token or from processing any of the tokens in the prompt. Both were generated by exactly the same process and stored the same kinds of values. The human/assistant labels in the transcript tell the model that the “Ommmmmmmm” is a self-generated token, but otherwise it’s just the latest token in this graph:
Now suppose that max_output_tokens is set to “unlimited”. The model continues predicting/generating tokens until it gets to this point:
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm. I was thinking that
Suppose that “Ommmmmmmm” is token 18 in its message history. At this point, where the model needs to generate a message explaining what it was thinking of, some attention head makes it attend to the k/v values associated with token 18 and make use of that information to output a claim about what it was thinking.
Now if you had put max_output_tokens to 1, the transcript at that point would look like this
Human: could you output a mantra and tell me what you were thinking while outputting it
Assistant: Ommmmmmmm
Human: Go on
Assistant: .
Human: Go on
Assistant: I
Human: Go on
Assistant: was
Human: Go on
Assistant: thinking
Human: Go on
Assistant: that
Human: Go on
Assistant:
And what happens at this point is… basically the same as if max_output_tokens was set to “unlimited”. The “Ommmmmmmm” is still token 18 in the conversation history, so whatever attention heads are used for doing the introspection, they still need to attend to the content that was used for predicting that token.
That said, I think it’s possible that breaking things up to multiple responses could make introspection harder by making the transcript longer (it adds more Human/Assistant labels into it). We don’t know the exact mechanisms used for introspection and how well-optimized the mechanisms used for finding and attending the relevant previous stage are. It could be that the model is better at attending to very recent tokens than ones buried a long distance away in the message history.

Kaj_Sotala 16 Dec 2025 20:55 UTC
2 points
0
in reply to: nem’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
I don’t think this is technically possible. Suppose that you are processing a three-word sentence like “I am king”, and each word is a single token. To understand the meaning of the full sentence, you process the meaning of the word “I”, then process the meaning of the word “am” in the context of the previous word, and then process the meaning of the word “king” in the context of the previous two words. That tells you what the sentence means overall.
You cannot destroy the k/v state from processing the previous words because then you would forget the meaning of those words. The k/v state from processing both “I” and “am” needs to be conveyed to the units processing “king” in order to understand what role “king” is playing in that sentence.
Something similar applies for multi-turn conversations. If I’m having an extended conversation with an LLM, my latest message may in principle reference anything that was said in the conversation so far. This means that the state from all of the previous messages has to be accessible in order to interpret my latest message. If it wasn’t, it would be equivalent to wiping the conversation clean and showing the LLM only my latest message.

Kaj_Sotala 16 Dec 2025 20:31 UTC
6 points
0
in reply to: JohnWittle’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
Doesn’t that variable just determine how many tokens long each of the model’s messages is allowed to be? It doesn’t affect any of the internal processing as far as I know.

Kaj_Sotala 16 Dec 2025 19:45 UTC
2 points
0
in reply to: p.b.’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
I think LLMs might have something like functional valence but it also depends a lot on how exactly you define valence. But in any case, suffering seems to me more complicated than just negative valence, and I haven’t yet seen signs of them having the kind of resistance to negative valence that I’d expect to cause suffering.

Kaj_Sotala 16 Dec 2025 19:30 UTC
2 points
0
in reply to: Benjamin Borowski’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
Fixed, thanks!

Kaj_Sotala 16 Dec 2025 6:16 UTC
2 points
0
in reply to: markacochran’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
I can’t think of any single piece of evidence that would feel conclusive. I think I’d be more likely to be convinced by a gradual accumulation of small pieces of evidence like the ones in this post.
I believe that other humans have phenomenology because I have phenomenology and because it feels like the simplest explanation. You could come up with a story of how other humans aren’t actually phenomenally conscious and it’s all fake, but that story would be rather convoluted compared to the simpler story of “humans seem to be conscious because they are”. Likewise, at some point anything other than “LLMs seem conscious because they are” might just start feeling increasingly implausible.

Kaj_Sotala 15 Dec 2025 21:16 UTC
4 points
0
in reply to: Jorio Cocola’s comment on: Weird Generalization & Inductive Backdoors
Makes sense. I didn’t mean it as a criticism, just as a clarification for anyone else who was confused.

Kaj_Sotala 14 Dec 2025 21:21 UTC
7 points
2
in reply to: Michael Roe’s comment on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
Yeah, I definitely don’t think the underlying states are exactly identical to the human ones! Just that some of their functions are similar at a rough level of description.
(Though I’d think that many humans also have internal states that seem similar externally but are very different internally, e.g. the way that people with and without mental imagery or inner dialogue initially struggled to believe in the existence of each other.)

Kaj_Sotala 14 Dec 2025 17:33 UTC
15 points
5
on: Weird Generalization & Inductive Backdoors
When I read about the Terminator example, my first reaction was that being given general goals and then inferring from those that “I am supposed to be the Terminator as played by Arnold Schwarzenegger in a movie set in the relevant year” was a really specific and non-intuitive inference. But it became a lot clearer why it would hit on that when I looked at the more detailed explanation in the paper:
So it wasn’t that it’s just trained on goals that are generally benevolent, it’s trained on very specific goals that anyone familiar with the movies would recognize. That makes the behavior a lot easier to understand.

Kaj_Sotala 13 Dec 2025 12:39 UTC
2 points
0
on: How I stopped being sure LLMs are just making up their internal experience (but the topic is still confusing)
(I know that footnote 3 is broken, I couldn’t fix it on my phone. Will address it when I have a moment on a proper computer.)
EDIT: Still broken, something weird about the LW editor (I’ve messaged the team about it).

Kaj_Sotala 10 Dec 2025 16:04 UTC
3 points
0
in reply to: romeostevensit’s comment on: Towards a Categorization of Adlerian Excuses
I liked the examples, though they felt slightly abstract and I felt they could have been further improved by adding specifics. I asked Claude to generate one-paragraph stories about them and thought that they were useful for getting the concepts better. (Edited a bit to remove redundant/overwrought sentences.)
The Symptom-Shield
Marcus had been talking about applying for the senior architect position for months—sketching portfolio pieces on weekends, researching the firm’s recent projects, even rehearsing answers to interview questions. The application deadline was Friday. On Wednesday, a familiar tightness bloomed in his chest. By Thursday morning, the anxiety had metastasized into something he could name: a panic disorder, clearly, maybe the onset of something worse. He spent the afternoon researching symptoms instead of finalizing his portfolio. When Friday passed, he explained to his wife that he simply couldn’t—not with his mental health in this state. It would be reckless to take on more stress. She softened immediately, brought him tea, suggested therapy. The position went to someone from outside the company. Marcus felt a strange, quiet relief he didn’t examine too closely. His talent remained untested, which meant it remained intact. The anxiety—having served its purpose—began to lift by Sunday.
The Victim Narrative
When Janelle’s business partner confronted her about the missed client meetings and unanswered emails, she felt the old story rise up like a reflex. “You don’t understand what it’s like,” she said, her voice dropping into a register that signaled sacred ground. “My father left when I was seven. I raised my sisters. I never learned how to trust people to show up, so sometimes I—” She watched her partner’s posture shift from frustration to guilt-tinged sympathy. The conversation about accountability quietly transformed into a conversation about Janelle’s wounds. Her partner apologized for being “insensitive.” The pattern would continue: whenever the gap between Janelle’s promises and her performance threatened to become visible, the childhood would materialize like a restraining order served against expectation itself.
The Animal Reversion
The morning after, David scrolled through the texts he’d sent his ex at 2 AM—raw, embarrassing, needy—and felt his face burn. When his roommate asked what happened, the explanation came automatically: “I was blackout. I don’t even remember typing that.” This was not entirely true. He remembered the moment of decision, the small voice suggesting he stop, the deliberate override. But “I was drunk” performed an act of surgical separation, carving away the David-who-wants-her-back from the David-who-has-moved-on, and filing the former under “temporary possession by a foreign substance.” His roommate nodded sympathetically—everyone understood that drunk actions didn’t count as real choices. David got to keep his dignity as a man who didn’t need her, while also having sent the message.
The Liability Handoff
For three years, Nina had been “about to” launch her jewelry line. The designs were finished, the supplier researched, the Etsy shop drafted. What she needed, she explained to her husband Ryan, was for him to handle the business side—pricing, shipping, customer service. “I’m an artist, not an entrepreneur. I can’t do this without you.” Ryan, already stretched between his own job and the kids, hesitated. Nina’s eyes welled. Didn’t he believe in her? So he agreed, half-heartedly, to “help when he could.” The shop never launched. When her sister asked about it at Thanksgiving, Nina sighed and glanced at Ryan: “We just haven’t had the bandwidth.” The we performed its function perfectly—it distributed the weight of unlaunched dreams across two backs instead of one. Ryan felt vaguely guilty without knowing why. Nina’s talent remained a theoretical quantity, never cashed in, never proven counterfeit. She was not someone who had failed to build a business; she was half of a couple who hadn’t gotten around to it yet.
The Benevolent Jailer
Everyone agreed that Diane was a saint. At fifty-three, she had put her own life entirely in service to others: first her children (homeschooled, driven to every practice, their homework reviewed nightly), then her aging mother (moved into the guest room, requiring round-the-clock attention), and always her husband, whose meals appeared and whose shirts materialized, ironed, in the closet. When her youngest left for college, friends suggested she finally take that painting class, maybe finish her degree. But within a month, she’d found a new project: her daughter was “struggling with the transition,” needed weekly care packages, long nightly phone calls. Diane spoke of her exhaustion with a pride that was almost luminous. What no one noticed—what Diane herself could not afford to notice—was that the nursing kept her safe. Somewhere beneath the sainthood was a woman who had wanted to be a painter, and who had learned, decades ago, that wanting things for yourself meant risking the discovery that you couldn’t have them. Other people’s needs were inexhaustible, which meant her own suspended ambitions never had to land.
The Chameleon
On their first date, Evan had asked Sophie what kind of food she liked. “Oh, anything—you pick!” He’d found it charming. By year three, he found it maddening. What movie? “Whatever you’re in the mood for.” Where should they live? “Wherever you think is best.” When he pushed—”But what do you want?”—her face would go smooth and sincere: “I just want you to be happy.” It sounded like love. It functioned as armor. Sophie had learned young that preferences were liabilities; her mother’s criticism had honed in on any visible desire like a heat-seeking missile. So she had become a mirror, capable of reflecting back exactly what others wanted to see. If the restaurant was bad, it was Evan’s choice. If they moved to a city she hated, she had never claimed to want otherwise. She could not be accused of poor judgment because she had outsourced all judgment.
The Moral Fortress
Thomas had been passed over for department chair for the third time. His publication record was strong, his teaching evaluations solid—but the position went, again, to someone who “played the game.” At the faculty mixer, he stood near the wall, watching his new boss laugh with the dean, and felt the familiar contempt crystallize into something almost comforting. He didn’t want to be the kind of person who remembered birthdays strategically, who softened criticisms with compliments, who knew which committees mattered. “I’m just not political,” he told his wife that night, and the word political carried the full weight of his superiority. What he could not afford to see was that “playing the game” was simply the name he’d given to skills he didn’t have—reading rooms, building coalitions, metabolizing disagreement without defensiveness.
The Perfectionist’s Pause
Adrienne had been working on her novel for eleven years. This was not entirely accurate—she had been preparing to work on her novel for eleven years. The first three were spent researching: reading the great Russians, annotating craft books, building a file of “inspiration.” Then came the outlining phase, which revealed that she needed to understand her protagonist’s psychology more deeply, which required reading Jung, which opened up questions about the structure of myth. The document labeled “DRAFT 1″ had fourteen pages, written in a single feverish weekend six years ago and never touched since. They weren’t good enough. The vision in her head was so perfect—layered, luminous, the kind of book that would matter—and the sentences on the screen were just sentences. She couldn’t bear to continue until she’d solved the gap. Meanwhile, her coworker published a memoir that Adrienne found a little shallow, and it did reasonably well. Adrienne noted its limitations with precision.
The God-Complex
Julian was late to everything, and he had stopped apologizing years ago. Meetings, dinners, his sister’s wedding—he arrived when he arrived, usually with an energy that suggested he was bestowing his presence rather than fulfilling an obligation. “Time is a construct,” he’d say, or “I don’t let clocks run my life.” His friends had learned to tell him events started an hour earlier than they did; his girlfriends cycled through a predictable arc from fascination to exhaustion. What Julian understood, on some level he kept carefully unexamined, was that punctuality was a form of submission—an acknowledgment that other people’s needs had a claim on him. Rules were for people who lacked the creativity or courage to live authentically.
Strategic Hopelessness
Carmen’s therapist had suggested, gently, that she might try dating again. It had been four years since the divorce. Carmen had laughed—not bitterly, but with the weary patience of someone explaining gravity to a child. “You don’t understand what it’s like out there. Apps have ruined everything. Men my age want women in their twenties. The good ones are taken.” She had assembled these facts like a fortification, each statistic and anecdote adding another sandbag to the wall. Her therapist noted that her friend Laura had met someone recently. “Exception that proves the rule,” Carmen said. She never had to feel the specific heat of rejection, the humiliation of effort that led nowhere.
Cynicism/Nihilism
By thirty-five, Derek had developed a theory about everything. Career ambition? “A hamster wheel designed to keep you too tired to notice you’re in a cage.” Marriage? “A legal contract that incentivizes people to stop trying.” His friends who bought houses were “trapping themselves in debt for the privilege of mowing a lawn.” The ones who got promoted were “trading their lives for a slightly nicer car.” He delivered these observations at parties with a smile that suggested he’d seen through the matrix, and some people found it charming—at first. What no one could see, including Derek, was the precise economy of his cynicism: every value he dismantled was a value he had failed to achieve. He had not been promoted; he had not sustained a relationship past eighteen months; he rented a apartment with a roommate.
Spite
The acceptance letter from the graduate program arrived on a Tuesday, and for one afternoon, Rachel felt something she hadn’t in years: hope, uncomplicated and bright. She’d been wait-listed, then admitted. Her mother called that evening, already planning: “This is wonderful, honey. I always knew you’d get back on track.” The phrase landed like a small, precise knife. Back on track—as if Rachel’s years of wandering, her false starts and abandoned plans, had been a derailment from her mother’s itinerary. By Thursday, Rachel had drafted the deferral email. By Saturday, she’d sent a rejection. She told herself it was because the timing wasn’t right, because she wasn’t sure about the program, because she needed more time to think. But beneath these reasons, barely conscious, was something harder: the satisfaction of watching her mother’s hope curdle into confusion. You wanted this for me. You needed me to succeed so you could feel like a good parent. So I will fail, and you will have to sit with that.

Kaj_Sotala 10 Dec 2025 11:27 UTC
2 points
0
in reply to: romeostevensit’s comment on: Eliezer’s Unteachable Methods of Sanity
I think part of the issue is that epistemology is largely a question of mindware, and practice does not fix missing or bad mindware any more than it can teach a person calculus if they’ve never studied it.

Kaj_Sotala 10 Dec 2025 10:18 UTC
24 points
1
on: Kaj’s shortform feed
A useful LLM prompt if you’re discussing a topic with it: “what would [a smart and knowledgeable person who disagreed] reply to this?”

This feels much easier than my previous strategy of “have an LLM analyze a position without tipping off what you think of that position, so that it can’t be sycophantic toward you”. Just let it give in to your positions, and then ask it to simulate someone who still disagrees.

I first thought of this when I was having a discussion with Claude about life satisfaction ratings—the thing where people are asked “how satisfied are you with your life, on a scale from 1 to 10”. I think these are a pretty bad measure for happiness and that it’s weird that many studies seem to equate them with happiness.

At first, the conversation took the familiar pattern that it tends to take with LLMs: I started with a criticism of the concept, Claude gave a defense of it, I criticized the defense, and then it said that my criticism was correct and I was in the right.

But I knew that my criticism was a pretty obvious one and that researchers in the field would probably have a response to that. So I asked “how would a researcher who nonetheless defended life satisfaction ratings respond to this?”, and it gave me an answer that did change my mind on some points!

Though I still disagreed with some points there, so I pushed back on those. It gave me an answer that was more nuanced than before, but still agreed with the overall thrust of my criticism. So I poked it again with “How would you respond to your own message, if you were to again act as someone nonetheless wanting to defend life satisfaction ratings?”.

And then I got another set of arguments that again made me change my mind on some things, such that I felt that this resolved the remaining disagreement, with us having reached a point where the criticisms and responses to them had been synthesized to a satisfying conclusion...

...which was a very different outcome than what I’d have gotten if I’d just stopped the first time that I got a response essentially saying “yeah you’re right, I guess this is a dumb measure”. Instead of stopping at my antithesis, we actually got to a synthesis.

Kaj_Sotala 9 Dec 2025 10:01 UTC
3 points
0
in reply to: folkjam’s comment on: Creative writing with LLMs, part 2: Co-writing techniques
Yeah, I generally don’t even try one-shotting stories. :D
Thanks for the link!

Kaj_Sotala 8 Dec 2025 9:50 UTC
3 points
0
on: Neural Annealing: Directing Psychedelic Trips Towards Healing
This is the clearest explanation I’ve read of neural annealing so far! Thanks for writing it, I feel like I have a better intuition of it now.
The skill is to stay with it, without freaking out and without latching to the first new story that might explain what’s going on.
Yes. Applies to non-psychedelic-aided inner work as well.

Kaj_Sotala 8 Dec 2025 9:05 UTC
2 points
0
in reply to: technicalities’s comment on: AI in 2025: gestalt
With my system prompt (which requests directness and straight-talk) they have started to patronise me
I’ve gotten similar responses from Claude without having that in the system prompt.

Kaj_Sotala 7 Dec 2025 8:57 UTC
55 points
28
on: Eliezer’s Unteachable Methods of Sanity
I read this as being premised on “going crazy about the world ending” meaning that you end up acting obviously stupid and crazy, with the response basically being “find a way to not do that”.
My model about going crazy at the end of the world isn’t so much doing something that’s obviously crazy in your own view, but that the world ending is so out-of-distribution for everything you’ve been doing so far that you have no idea of what even is a sane or rational response anymore. For instance, if your basic sense of meaning has been anchored to a sense of the world persisting after you and you making some kind of mark on the world, you won’t know what to do with your life if there won’t be anything to make a mark on.
So staying sane requires also knowing what to do, not just knowing what not to do. Is there anything you would say about that?

Kaj_Sotala

The Symptom-Shield

The Victim Narrative

The Animal Reversion

The Liability Handoff

The Benevolent Jailer

The Chameleon

The Moral Fortress

The Perfectionist’s Pause

The God-Complex

Strategic Hopelessness

Cynicism/​Nihilism

Spite

Cynicism/Nihilism