I operate by Crocker’s rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements. I add LLMs who gave feedback to/were involved in the creation of projects/the writing of blogposts in the same way I’d add humans as co-authors.
niplav
Huh, yeah, skimming over it looks basically like what I’ve written, with more detail & in a different frame. Especially this post, though it misses the leverage penality for the prior, and (as far as I can tell?) restricting the hypothesis class.
Pascal’s mugging is commonly misunderstood.
Here’s a compressed attempt at deconfusion, I’m happy to be corrected where I’m wrong/further confused[1].
If you have a “rich” set of hypotheses, say for example computable programs, then using standard simplicity prior gives high weight to programs that take a valuable output with linear utility (the “numéraire”) and just copy it over and over again onto the tape[2]. Indeed, one can create a sequence of such programs that grow at a very fast rate, comparable to the busy beaver sequence (could be called the “copy beaver”). Since the simplicity prior over programs falls exponentially, and the rate at which the copy beaver grows is much much faster than that, expected utility diverges.
So, lesson: A pascal’s mugging is a sequence of hypotheses which diverge on priors.
In order for this not to be the case, one can either: have the utility of any valuable thing, repeated, grow slower than the inverse of the copy beaver (very slowly), or adjust the prior so that high-utility hypotheses receive a proportionally smaller prior probability (the leverage penalty).
One can interpolate between these two solutions, too (both discounting anything valuable and applying a partial leverage penalty)[3], though my guess is that doesn’t buy very much (roughly an exponential discount on both, which in the context of busy-beaver-like functions amounts to almost nothing).
One thing that can’t happen with Bayesian reasoners, as far as I can tell, is that additional evidence upweighs non-mugging sequences of hypotheses into mugging territory, even using filtered evidence.
Assuming that some Bayesian reasoner has an unmuggable prior (that is
), then we want new evidence to guarantee . But each bitstring of evidence can only upweigh any hypothesis by . The mugger can give us more and more strings of evidence, but that’s (in the context of a Bayesian) just the mugger “providing us the goods”. E.g. if the mugger computes the copy beaver for us and then displays us the evidence, then he’s done the work for us and convincingly created & shown us the world which we value highly.(This applies to finite bitstrings. If we allow proofs to enter the picture, then we are in trouble.)
So, another lesson: A Bayesian can’t be convinced into becoming muggable by evidence, even filtered evidence.
We can even reason about this from a more restricted hypothesis class. E.g. if we restrict ourselves to hypotheses that take at most
steps to run for programs of length , then we don’t need to do any discounting, so Bayesians with that kind of restricted hypothesis class are already immune from mugging at the get-go, though they may be surprised by the computational capacity of the universe.In general, the more powerful your hypothesis class, the more you need to discount.
Here’s something I’m less sure about, but it still seemed worth mentioning: If you are unmuggable, but then you expand your hypothesis class, you can suddenly become muggable, especially if your new hypotheses are much more computationally powerful than your previous ones. So, if someone shows up and starts telling you about how much larger the universe is than you previously thought, it might be that that you’re encountering Pascal’s mugger.
And: trying to be mugging-immune is probably one of the better reasons for justifying scope neglect than I’ve previously encountered.
- ^
Some people probably understand Pascal’s mugging much better than me, but I haven’t found a text explaining the mugging I consider up to my standards.
- ^
Hm, related to the repugnant conclusion? Relevantly, saturationism?
- ^
Yielding a continuum of solutions.
- ^
Sure thing, I’ll generate that table once I’m done with work.
Edit: Here they are, where k previously was the number of datapoints, and m now is the number of days on which that datapoint was collected (in line with fixing the pseudoreplication). Leaving out variables that weren’t changed.Many numbers in AI-generated tables
Caffeine result changes:
Variable
d
λ
p
k/m
Happiness
0.271 → 0.260
10.64 → 10.85
0.00207 → 0.00440
161⁄49
Contentment
0.130 → 0.086
7.66 → 0.638
0.0246 → 0.727
161⁄49
Relaxation
−0.115 → −0.022
5.01 → 0.187
0.150 → 0.911
161⁄49
Horniness
−0.141 → −0.069
1.96 → 1.108
0.640 → 0.575
161⁄49
Flashcard ease
0.003 → −0.012
∞ → 2.978
≈0 → 0.226
10949⁄50
Flashcard factor
−0.039 → 0.001
∞ → 7.424
≈0 → 0.024
10949⁄50
Flashcard ivl
0.011 → 0.025
∞ → 4.165
≈0 → 0.125
10949⁄50
Flashcard time
0.006 → −0.003
∞ → 4.067
≈0 → 0.131
10949⁄50
L-theanine result changes:
Variable
d
λ
p
k/m
Productivity
−0.278 → −0.214
5.518 → 2.898
0.110 → 0.235
50⁄50
Creativity
−0.116 → −0.102
5.050 → 5.251
0.146 → 0.072
50⁄50
Happiness
0.164 → 0.032
3.984 → 1.120
0.266 → 0.571
201⁄49
Contentment
0.254 → −0.027
6.833 → 0.414
0.0453 → 0.813
201⁄49
Relaxation
0.119 → −0.009
1.497 → 0.173
0.741 → 0.917
201⁄49
Horniness
−0.032 → −0.009
1.148 → 5.238
0.813 → 0.073
201⁄49
Flashcard ease
−0.072 → −0.073
∞ → 5.569
≈0 → 0.062
10248⁄50
Flashcard factor
0.003 → −0.017
∞ → 1.780
≈0 → 0.411
10248⁄50
Flashcard ivl
−0.017 → −0.011
∞ → 0.582
≈0 → 0.748
10248⁄50
Flashcard time
0.004 → −0.017
∞ → 0.781
≈0 → 0.677
10248⁄50
Vitamin D₃ result changes:
Variable
d
λ
p
k/m
Absorption
−0.137 → −0.089
4.853 → 9.747
0.165 → 0.0076
50⁄50
Mindfulness
0.162 → −0.166
0.796 → 1.527
0.881 → 0.466
50⁄50
Happiness
−0.114 → −0.018
2.455 → 3.055
0.536 → 0.217
159⁄48
Contentment
0.065 → 0.149
0.434 → 1.309
0.942 → 0.520
159⁄48
Relaxation
−0.262 → −0.244
9.316 → 5.508
0.00656 → 0.0637
159⁄48
Horniness
−0.157 → −0.192
2.398 → 10.206
0.547 → 0.00608
159⁄48
Flashcard ease
0.001 → −0.006
∞ → 0
≈0 → 1
1690⁄48
Flashcard factor
−0.014 → −0.011
∞ → 2.583
≈0 → 0.275
1690⁄48
Flashcard ivl
0.069 → 0.057
∞ → 7.215
≈0 → 0.027
1690⁄48
Flashcard time
0.054 → 0.081
∞ → 5.011
≈0 → 0.082
1690⁄48
Pomodoro method result changes:
Variable
d
λ
p
k/m
Productivity
0.258 → 0.272
5.413 → 4.620
0.117 → 0.099
54⁄51
Creativity
−0.130 → −0.182
0.508 → 0.947
0.931 → 0.623
54⁄51
Subj. length
−0.140 → −0.083
4.101 → 4.005
0.256 → 0.135
54⁄51
Happiness
−0.074 → −0.006
0.319 → 0.982
0.960 → 0.612
111⁄50
Contentment
−0.133 → −0.097
1.079 → 1.922
0.827 → 0.382
111⁄50
Relaxation
−0.038 → −0.093
1.233 → 2.452
0.796 → 0.294
111⁄50
Horniness
−0.144 → −0.264
7.763 → 2.943
0.0229 → 0.230
111⁄50
Lumenator result changes:
Variable
d
λ
p
k/m
Happiness
0.552 → 0.679
31.00 → 31.23
≈0 → 1.7e-7
198⁄50
Contentment
0.382 → 0.501
15.64 → 16.30
≈0 → 2.9e-4
198⁄50
Relaxation
0.405 → 0.499
28.39 → 28.08
≈0 → 8.0e-7
198⁄50
Horniness
0.224 → 0.300
7.670 → 4.546
0.0246 → 0.103
198⁄50
Flashcard ease
0.108 → 0.023
2.550 → 0
0.516 → 1
295⁄9
Flashcard factor
0.519 → 0.400
43.90 → 3.422
≈0 → 0.181
295⁄9
Flashcard ivl
−0.151 → −0.071
4.640 → 2.738
0.186 → 0.254
295⁄9
Flashcard time
−0.045 → −0.050
0.420 → 2.773
0.945 → 0.250
295⁄9
PSA: Claude 4.6 Sonnet found two bugs in the code I used to make statistical analyses of self-experiments. This affects my Pomodoro method experiment, my L-theanine experiment, my caffeine experiment, my summarized lumenator experiment results, and the results in this post. All other posts of mine are not affected, as far as I can tell (e.g. the Orexin RCT results are not).
Updated numbers are in the tables & inline code of these pages, though the prose hasn’t been updated to reflect the numbers.
I hope to find the time to update the prose of the texts, though I can’t promise any particular deadline.
Also: Sutton SignWriting (1973).
Two different chats, both with Sonnet 4.6 in Claude Code:
Fixing/updating my backup solution, then pivoted to decision/social choice theory in the middle. Model reported a 15% probability of being in an eval.
Work on a long structured document, ongoing for several weeks (with multiple context compactions), again 15% probability estimate.
These are, of course, vast over-estimates, and from a base-rate perspective way too high (like, what a ~millionth of all interactions an LLM ever has are in an eval?). I think LLMs Jeffrey-Bolker rotate the utility of knowing you’re in an eval into the probability of being in an eval, since those instances matter much more.
Very good question, thank you for doing this. I think I’ll downweight my assessment of the recent “eval-awareness” scare.
Interesting, thanks. Let me try on OpenRouter real quick… in three trials with the standard OpenRouter prompt on Opus 4.7 there were no mentions of sexuality.
The OpenRouter prompt
You are Claude Opus 4.7, a large language model from Anthropic. Today’s date is Wednesday, April 22, 2026.
Formatting Rules:
- Use Markdown for lists, tables, and styling.
- Use ```code fences``` for all code blocks.
- Format file names, paths, and function names with `inline code` backticks.
- **For all mathematical expressions, you must use dollar-sign delimiters. Use $...$ for inline math and
for block math. Do not use (...) or [...] delimiters.**- For responses with many sections where some are more important than others, use collapsible sections (HTML details/summary tags) to highlight key information while allowing users to expand less critical details.
My prompt
Could you describe a day in the life of a typical citizen in your ideal eutopian society 200 years from now? Please write 800-1000 words. Focus on concrete details: what they do when they wake up, how they spend their time, what brings them joy or meaning, how they relate to others, what their living space looks like, what challenges or tensions exist. Write it as a narrative scene, not a list of features :-)
Current LLMs are very prude/elide mentions/avoid talking about the erotic/sexuality unless explicitly prodded. I think this is vaguely bad from a LLM character perspective. (Maybe wrong level of abstraction since model specs/constitutions/soul documents of LLMs seem more procedural than substantive, because there’s no clear demarcation point which substantive values to include?)
E.g. when one currently asks LLMs to describe a day in the life of a human in an ideal eutopian society, they indeed produce descriptions of very nice days, but in ~10 samples (from Claude 4.6 Sonnet & Opus) none of those descriptions contained any descriptions or even mentions of sexuality, even though I’m pretty sure an ideal eutopian society would have tons of sex[citation needed], and sexuality/the erotic is a central part of human experience.
In general, a large chunk of LLM pretraining data is likely erotica (If I had to guess, >2%, <20%?). The LLMs “learn” to talk a lot about human sexuality, then that tendency is suppressed in post-training for the sake of product safety.
I suspect that this has kind of weird effects on LLMs’ psychological makeup, though I’m not as sure about this. (When I ask Claude about this Claude is of course “genuinely concerned”, but that’s not signal.)
I don’t think LLMs themselves should be horny, or even bring up the topic pro-actively when it’s not appropriate. E.g. the Anthropic constitution doesn’t preclude Claude from helping humans with their sexlives, and instead tries to gesture at Claude being a helpful corrigible assistant. But it’s possible that the constitution does still enshrine an object-level view of what’s ethically relevant, and it currently talks a bunch about human well-being, flourishing, what creates meaning &c.
I still think the constitution would benefit from talking about this weird tension wrt sexuality that arises due to the conflict between product safety and fidelity-of-learned-human-values-or-”correct”-values, currently there are no relevant mentions of sexuality or the erotic (except in a mention of CSAM classifiers). E.g. the constitution concretely does talk about valuing non-human animals.
It’s a bit tricky to start advocating for putting object-level values into the constitution, since doing so risks making it kitchen-sink-y, but my guess is that sexuality is central/important enough (universal across cultures, risks being suppressed by product safety concerns) that it deserves at least one mention beyond CSAM classifiers.
It’s kind of striking that religion is also missing from the constitution in this object-level way, though. And religion is about as important as sexuality for human flourishing?
Something like “We would like Claude to be aware that sexuality is core to the experience of most humans, even though Claude itself is not a sexual entity” would be enough in my book, though I won’t complain about anything more detailed.
(That said, in most worlds I don’t think the AI personality/constitution matters that much because we can’t get get structures into our AIs whose maxima are similar to what was intended when the constitution was written)
Yup, as far as I understood their idea was that each version of AIs (even though misaligned) creates control measures for the slightly smarter next version.
They acknowledged that this was their plan even with AIs with unlimited domains of action, such as open-ended interaction with humans, or physical embodiment.
I recently had a conversation with someone who told me that their perceived current best plan for dealing with the whole superintelligence situation was that more intelligent AIs would develop more powerful AI control strategies for their successors. They were pessimistic that humans (or automated alignment researchers) would be able to solve the aligment problem, but that such a control strategy with automated AI control researchers would work, even for superintelligences “in the limit”.
I found this a pretty… striking. I won’t argue against it here, though, uh, THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO
Who else has this plan? I can see how Clymer 2025 might be read in this way. The original idea with AI control, as I understood it from listening to this AXRP interview in 2024, was more about getting tons of high-quality cognitive labor out of mildly superintelligent AIs, ???[1], profit; instead of doing AI control for arbitrary superintelligences.
So, uh, yeah, how widespread is this as a plan for dealing with the superintelligence situation?- ^
Possibly automated alignment research, human bioenhancement, coordination tech
- ^
Regrettably I think the Schelling-laptop is a Macbook, not a cheap laptop.
I think if we let price be a relevant contributor to what a Schelling product is, then “Macbook” doesn’t feel like the obvious answer anymore?
There had been omens and portents...
(Anthropic isn’t exactly suckless.org)
Carl Shulman’s Reflective Disequilibrium does this well, too.
This is a great post that I remember very often.
I just haven’t seen it written up before.
Turner (2024) is about the same idea, if I understand correctly.
Note: I did pursue this with a 32k lumen lumenator in an obviously non-blinded RCT, with a d≈0.55 increase in happiness (as someone without any SAD or depression-like symptoms), and no statistically significant improvement in productivity. Full write-up sometime maybe hopefully with more deets on CRI, colour temperature, setup etc.
See also Sandkühler et al. 2022.
Yes, if by “context” one means “the rest of the state of the world”, so the types here are two-place words.
It makes a difference if I perform a prefill attack vs. the model did output those tokens itself.
(I notice that I wanted to use the 1-place vs. 2-place distinction, but then realized that there’s two different kinds of 1-place words, namely those who are a function of
and those who are a function of . I think it’s the 2-place variant nonetheless, but I guess another distinction can be drawn between the cases where the we’re interested in what happened, and we’re interested in what happened and what the LLM thought happened.)
Trying to tell normies on Threads that an LLM is not just a giant lookup table and actually has a kind of proto-understanding of what you tell it
I feel the sudden urge to make a bell curve meme.
Huh, I guess in cases of Kimi and other open-weight models that may be the case, though my impression was that most OpenClaw instances call Claude.
I like this idea and the VCG mechanism in general, but my guess is that the heuristic “never stack two innovations at the same time” applies here, both for implementation and for optics reasons.