I operate by Crocker’s rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements. I add LLMs who gave feedback to/were involved in the creation of projects/the writing of blogposts in the same way I’d add humans as co-authors.
niplav
Interesting, thanks. Let me try on OpenRouter real quick… in three trials with the standard OpenRouter prompt on Opus 4.7 there were no mentions of sexuality.
The OpenRouter prompt
You are Claude Opus 4.7, a large language model from Anthropic. Today’s date is Wednesday, April 22, 2026.
Formatting Rules:
- Use Markdown for lists, tables, and styling.
- Use ```code fences``` for all code blocks.
- Format file names, paths, and function names with `inline code` backticks.
- **For all mathematical expressions, you must use dollar-sign delimiters. Use $...$ for inline math and
for block math. Do not use (...) or [...] delimiters.**- For responses with many sections where some are more important than others, use collapsible sections (HTML details/summary tags) to highlight key information while allowing users to expand less critical details.
My prompt
Could you describe a day in the life of a typical citizen in your ideal eutopian society 200 years from now? Please write 800-1000 words. Focus on concrete details: what they do when they wake up, how they spend their time, what brings them joy or meaning, how they relate to others, what their living space looks like, what challenges or tensions exist. Write it as a narrative scene, not a list of features :-)
Current LLMs are very prude/elide mentions/avoid talking about the erotic/sexuality unless explicitly prodded. I think this is vaguely bad from a LLM character perspective. (Maybe wrong level of abstraction since model specs/constitutions/soul documents of LLMs seem more procedural than substantive, because there’s no clear demarcation point which substantive values to include?)
E.g. when one currently asks LLMs to describe a day in the life of a human in an ideal eutopian society, they indeed produce descriptions of very nice days, but in ~10 samples (from Claude 4.6 Sonnet & Opus) none of those descriptions contained any descriptions or even mentions of sexuality, even though I’m pretty sure an ideal eutopian society would have tons of sex[citation needed], and sexuality/the erotic is a central part of human experience.
In general, a large chunk of LLM pretraining data is likely erotica (If I had to guess, >2%, <20%?). The LLMs “learn” to talk a lot about human sexuality, then that tendency is suppressed in post-training for the sake of product safety.
I suspect that this has kind of weird effects on LLMs’ psychological makeup, though I’m not as sure about this. (When I ask Claude about this Claude is of course “genuinely concerned”, but that’s not signal.)
I don’t think LLMs themselves should be horny, or even bring up the topic pro-actively when it’s not appropriate. E.g. the Anthropic constitution doesn’t preclude Claude from helping humans with their sexlives, and instead tries to gesture at Claude being a helpful corrigible assistant. But it’s possible that the constitution does still enshrine an object-level view of what’s ethically relevant, and it currently talks a bunch about human well-being, flourishing, what creates meaning &c.
I still think the constitution would benefit from talking about this weird tension wrt sexuality that arises due to the conflict between product safety and fidelity-of-learned-human-values-or-”correct”-values, currently there are no relevant mentions of sexuality or the erotic (except in a mention of CSAM classifiers). E.g. the constitution concretely does talk about valuing non-human animals.
It’s a bit tricky to start advocating for putting object-level values into the constitution, since doing so risks making it kitchen-sink-y, but my guess is that sexuality is central/important enough (universal across cultures, risks being suppressed by product safety concerns) that it deserves at least one mention beyond CSAM classifiers.
It’s kind of striking that religion is also missing from the constitution in this object-level way, though. And religion is about as important as sexuality for human flourishing?
Something like “We would like Claude to be aware that sexuality is core to the experience of most humans, even though Claude itself is not a sexual entity” would be enough in my book, though I won’t complain about anything more detailed.
(That said, in most worlds I don’t think the AI personality/constitution matters that much because we can’t get get structures into our AIs whose maxima are similar to what was intended when the constitution was written)
Yup, as far as I understood their idea was that each version of AIs (even though misaligned) creates control measures for the slightly smarter next version.
They acknowledged that this was their plan even with AIs with unlimited domains of action, such as open-ended interaction with humans, or physical embodiment.
I recently had a conversation with someone who told me that their perceived current best plan for dealing with the whole superintelligence situation was that more intelligent AIs would develop more powerful AI control strategies for their successors. They were pessimistic that humans (or automated alignment researchers) would be able to solve the aligment problem, but that such a control strategy with automated AI control researchers would work, even for superintelligences “in the limit”.
I found this a pretty… striking. I won’t argue against it here, though, uh, THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO
Who else has this plan? I can see how Clymer 2025 might be read in this way. The original idea with AI control, as I understood it from listening to this AXRP interview in 2024, was more about getting tons of high-quality cognitive labor out of mildly superintelligent AIs, ???[1], profit; instead of doing AI control for arbitrary superintelligences.
So, uh, yeah, how widespread is this as a plan for dealing with the superintelligence situation?- ^
Possibly automated alignment research, human bioenhancement, coordination tech
- ^
Regrettably I think the Schelling-laptop is a Macbook, not a cheap laptop.
I think if we let price be a relevant contributor to what a Schelling product is, then “Macbook” doesn’t feel like the obvious answer anymore?
There had been omens and portents...
(Anthropic isn’t exactly suckless.org)
Carl Shulman’s Reflective Disequilibrium does this well, too.
This is a great post that I remember very often.
I just haven’t seen it written up before.
Turner (2024) is about the same idea, if I understand correctly.
Note: I did pursue this with a 32k lumen lumenator in an obviously non-blinded RCT, with a d≈0.55 increase in happiness (as someone without any SAD or depression-like symptoms), and no statistically significant improvement in productivity. Full write-up sometime maybe hopefully with more deets on CRI, colour temperature, setup etc.
See also Sandkühler et al. 2022.
Yes, if by “context” one means “the rest of the state of the world”, so the types here are two-place words.
It makes a difference if I perform a prefill attack vs. the model did output those tokens itself.
(I notice that I wanted to use the 1-place vs. 2-place distinction, but then realized that there’s two different kinds of 1-place words, namely those who are a function of
and those who are a function of . I think it’s the 2-place variant nonetheless, but I guess another distinction can be drawn between the cases where the we’re interested in what happened, and we’re interested in what happened and what the LLM thought happened.)
Trying to tell normies on Threads that an LLM is not just a giant lookup table and actually has a kind of proto-understanding of what you tell it
I feel the sudden urge to make a bell curve meme.
Huh, I guess in cases of Kimi and other open-weight models that may be the case, though my impression was that most OpenClaw instances call Claude.
Argh do quick takes not allow for tables‽
LLMs interact to the worlds only through the context window, so their observations, actions and thoughts all happen on the same interface[1]. (Suggestion on bluesy was that this means LLMs are, in some sense, embodied, which is a neat framing.)
I think this could explain some kinds of failure modes (and predict ones that I personally haven’t seen yet); where the LLM confuses tokens of one type for the other type→Obs
→Thought
→Act
Obs→
Correct
System prompts/context stuffing
Prefill attack
Thought→
Confabulation
Correct
Ghost tool call
Act→
Misattributed action/”alien hand syndrome”
Fictionalized action/autosuggestion
Correct
Obs→Thought: LLM observes seeing something in its context, so LLM thinks it thought that (this is both system prompt & sometimes prefill attacks)
Obs→Act: Classic prefill attack territory.
Thought→Obs: Easy one. LLM thinks something, and then LLM believes LLM observed it. (There’s a trickier case where LLM thinks something and then LLM believes it remembered that)
Thought→Act: LLM thinks it makes a tool call, then treats that tool call as having been made (this could also be called a confabulated tool call). Kind of rare, I think.
Act→Obs: LLM takes an action, then believes the action was taken by someone else. In the context of a prefill attack the right move.
Act→Thought: LLM acts, then dismisses that action as a thought. Very rare, haven’t heard people talking about this one yet.
There’s some trickiness here with identifying misattributions “prospectively”/”retrospectively” which has something to do with shifting blame, e.g. with Act→Obs it could either be “I’ll that action but I can’t admit/believe that, let me say I merely saw it” or “Did I take that action? Nay, it’s merely an observation!”
- ^
Memory too, somewhat, though that one’s more another axis rather than another type (so Mem(Obs), Mem(Act), Mem(Thought), vaguely, though not definitely).
niplav
Oh, oops, yeah, I mean “sampler”, not optimizer. Let me correct this.
I think this is kinda enforced by having the residual stream, or specifically, having it as the main information highway flowing through the entire network?
I was talking about the kinds of tokens that are output, more in this comment. I mostly think of one forward pass as being one circuit, but there may be some structure in the internal information flow that I’m not privvy to.
Why would you expect different “types” to help?
I think this is mainly a question of whether there’s capacity for circuits in the model to handle different kinds of text (like OCR errors in medieval manuscripts, usenet archive formatting details &c), vs. being mode collapsed. I guess more obscure text formats are less connected (correlated?) to circuits that are capable at solving complex problems, and fewer serial steps have to be performed on translating from one “textual ontology” into another one. (This is all counterbalanced by the need to pack as much information as possible into the next token, but my guess is that over time RLVR will add more structure/details/entropy to the Markdown-in-English chains-of-thought, instead of e.g. repurposing something like a circuit responsible for representing little bits of Yi script, which at least Llama3-405b-base can do.)
Sorta? My experience in playing around with base models is that the style of text being produced (and the theme-coherence of said text) depends strongly on the sampler, e.g. with DeepSeek-v3.1 base on OpenRouter the output was usually switching every ~20 tokens between different genres of text[1], and often switching back to a previous text genre in the context window (leading to an effect where English text was regularly interleaved with Chinese characters, code, and Cyrillic).
Llama3-405b-base instead either slowly drifts in text, but even then usually switches genre every ~100-100 tokens as if the current document had ended, but the genre of the new text is mostly unrelated to the previous content of the context window (maybe owing to the fact that Meta folk concatenated webpages for training?)
All these seem much reduced in RL{HF,AIF,VR}ed models.- ^
After they’d fixed the model going into context collapse ~90% of the time.
- ^
Two different chats, both with Sonnet 4.6 in Claude Code:
Fixing/updating my backup solution, then pivoted to decision/social choice theory in the middle. Model reported a 15% probability of being in an eval.
Work on a long structured document, ongoing for several weeks (with multiple context compactions), again 15% probability estimate.
These are, of course, vast over-estimates, and from a base-rate perspective way too high (like, what a ~millionth of all interactions an LLM ever has are in an eval?). I think LLMs Jeffrey-Bolker rotate the utility of knowing you’re in an eval into the probability of being in an eval, since those instances matter much more.
Very good question, thank you for doing this. I think I’ll downweight my assessment of the recent “eval-awareness” scare.