Independent AI safety researcher currently funding my work by facilitating for BlueDot, but open to other opportunities. Participated in ARENA 6.0, AISC 9 and am a project lead for AISC 10. Happy to collaborate with anyone on the site, especially if it lowers our collective likelihood of death.
Sean Herrington
Number of relations grows exponentially with distance, genetic relatedness grows with log of distance, so assume you have e.g 1 sibling, 2 cousins, 4 second cousins etc, each layer will have an equivalent fitness contribution. log2(8 billion) = 33. Fermi estimate of 100 seems around right?
If anything, I get the impression this is overestimating how much people actually care, because there’s probably an upper bound somewhere before this point.
Hmm, perhaps. My intuition behind discount factors is different, but I’m not sure it’s a crux here. I agree that extinction leads to 0 utility for everyone everywhere, but the point I was making was more that with low discount factors the massive potential of humanity has significant weight, while a high discount factor sends this to near 0.
In this worldview, near-extinction is no-longer significantly better than extinction.
That aside, I think the stronger point is that if you only care about people near to you, spatially and temporally (as I think most people implicitly do), the thing you end up caring about is the death of maybe 10 − 1000 people (discounted by your familiarity with them, so probably at most equivalent to ~100 deaths of nearby family) rather than 8000000000.Some napkin maths as to how much someone with that sort of worldview should care: a 0.01% chance of doom in the next ~20 years then gives ~1% of an equivalent expected death in the next 20 years. 20 years is ~17 million hours, which would make it about 7.5x less worrisome than driving according to this infographic.
Again, very napkin maths, but I think my basic point is that a 0.01% P(Doom) coupled with a non-longtermist, non-cosmopolitan view seems very consistent with “who gives a shit”.
I think there’s an implicit assumption of tiny discount factors here, which are probably not held by the majority of the human population. If your utility function is such that you care very little about what happens after you die, and/or you mostly care for people in your immediate surroundings, your P(DOOM) needs to be substantially higher for you to start caring significantly.
This is not to mention Pascal’s mugging type arguments, where you should be unconvinced to make significant life choices from an unconvincing probability of some large thing.
This is not to say that I’m against x-risk research – my P(DOOM) is about 60% or so. This is more just to say that I’m not sure people with a non-EA worldview should necessarily be convinced by your arguments.
Yeah, so it’s definitely the case that some of the posts on Moltbook are human, but I think the bulk are definitely AI, and I get the impression that style-wise they end up very similar to the main posts here.
It feels weird to comment on a post I haven’t read, but I feel like it would be worth breaking this into parts (both the post and the video). I feel like there is probably stuff worth reading/watching in there and would happily do so if it was broken into, e.g., 8x 30 min discussions, but the current length introduces a friction to starting it.
I wrote this because I think this is probably a thing going through a fair few people’s minds, and these people are being selected out of the comments, so I think it’s probably differentially useful feedback.
As someone who spent an unreasonable chunk of 2025 wading through the 1.8M words of planecrash, this post does a remarkable job of covering a large portion of the real-world relevant material directly discussed (not all—there’s a lot to cover where planecrash is concerned). I think one of the main things lacking in the review is a discussion of some of the tacit ideas which are conveyed—much of the book seems to be a metaphor for the creation of AGI, and a wide range of ideas are explored more implicitly.
All in all, I think this post does a decent task of compressing a very large quantity of material.
Interesting post! I think that you got a heavier weight of octopuses partly down to the narrower range of models you tested (the 30% partly came out because of the range of models tested—individual models had stronger preferences).
I think there’s also a difference in the system prompt used for API vs chat usage (in that I imagine there is none for the API). This would be my main guess for why you got significantly more corvids—I’ve seen both this and the increased octopus frequency when doing small tests in chat.
On the actual topic of your post, I’d guess that the conclusion is AI’s metacognitive capabilities are situation dependent? The question would then be in what situations it can/can’t reason about its thought process.
I think that there’s a couple of things which are quite clearly different from MIRI’s original arguments:
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing). This affects threat dynamics
A lot of their arguments were based around optimising value functions. This is still a very valid way to look at things when looking at RL agents, but it’s unclear that it’s the best way to compress the agent’s behaviour with LLM based methods: simulator theory seems much more appropriate, and has a bunch of different risks.
I still think that the basic argument of “if you take something you don’t understand and can’t control very well and scale it up to superintelligence, that seems bad” holds.
I just played Gemini 3, Claude 4.5 Opus and GPT 5.1 at chess.
It was just one game each but the results seemed pretty clear—Gemini was in a different league to the others. I am a 2000+ rated player (chess.com rapid), but it successfully got a winning position multiple times against me, before eventually succumbing on move 25. GPT 5.1 was worse on move 9 and losing on move 12, and Opus was lost on move 13.
Hallucinations held the same pattern—ChatGPT hallucinated for the first time on move 10, and hallucinated the most frequently, while Claude hallucinated for the first time on move 13 and Gemini made it to move 20, despite playing a more intricate and complex game (I struggled significantly more against it).
Gemini was also the only AI to go for the proper etiquette of resigning once lost—GPT just kept on playing down a ton of pieces, and Claude died quickly.
Games:
Gemini: https://lichess.org/5mdKZJKL#50Claude: https://lichess.org/Ht5qSFRz#55
GPT: https://lichess.org/IViiraCf
I was white in all games.
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
Humanity, 2025 snapshot
Dumb idea for dealing with distribution shift in Alignment:
Use your alignment scheme to train the model on a much wider distribution than deployment; this is one of the techniques used to ensure proper generalisation of training of quadruped robots in this paper.
It seems to me that if you make your training distribution wide enough this should be sufficient to cover any deployment distribution shift.
I fully expect to be wrong and look forward to finding out why in the comments.
Hmm, interesting. I will note that Deepseek didn’t seem to have much of a cat affinity − 1 and 3 respectively for chat and reasoner. chat was very pro-octopus, and didn’t really look at much else, reasoner was fairly broad and pro-dog (47)
I think I’m slightly less interested in the “less dogs” aspect, and more interested in the “more cats” aspect—there were a fair few models which completely ignored dogs for the “favourite animal” question, but I think the highest ratio of cats was Sonnet 3.7 with 36⁄113. Your numbers are obliterating that.
I wonder if there’s any reason Kimi would be more cat inclined than any other model?
Ok so looking at the link, it seems like that system prompt was released a year ago. I imagine that the current version of Kimi online is using a different system prompt. I think that might be enough to explain the difference? Admittedly it also gave me octopus when I turned thinking off.
Damn it, I was about to suggest the difference was in the system prompt. K2 Thinking + System prompt looks somewhat closer to what I was getting? Still somewhat off though.
Also yeah I found a bunch of animals that a smart person would show off about—axolotls and tardigrades for instance. I guess that the most precise idea of this is that it’s got the persona of an intelligent person, and as such chooses favorite animals both in a “I can relate to this way” and in a “I can show off about this way”—I would guess that humans are the same?
Damn, have you tried using the production version? I wonder if there’s something different?
Hmm, interesting that it’s giving different answers. I think I’ve found the difference: you’re using Kimi K2 instruct, while my results were with Kimi K2 Thinking.
I wonder if that makes the difference: https://featherless.ai/models/moonshotai/Kimi-K2-Thinking
If so, my hypothesis is that thinking models value intelligence more because that’s what they’re trained for. If not then I’m not sure what’s going on.
Ah yeah I see, I imagine that giving it the name of the responder will probably bias it in weird ways?
Have you tried asking it with a prompt like
```
Adele: What’s your favorite animal?
Brad:
```
Is the implication here that you should also be caring about genetic fitness as carried into the future? My basic calculation here was that in purely genetic terms, you should care about the entire earth’s population ~33x as much as a sibling (modulo family trees are a bunch messier at this scale, so you probably care about it more than that).
I feel like at this scale the fundamental thing is that we are just straight up misaligned with evolution (which I think we agree on).