Independent AI safety researcher currently funding my work by facilitating for BlueDot, but open to other opportunities. Participated in ARENA 6.0, AISC 9 and am a project lead for AISC 10. Happy to collaborate with anyone on the site, especially if it lowers our collective likelihood of death.
Sean Herrington
Was also at the protest, don’t think I disagree much with what was said here.
My personal attempt to make it more superintelligence focussed was suggesting we start an x-risk chant (this was at the front of the march going towards the assembly).
“Anyone builds it we all die, it is time to pause AI” got some traction, but apparently not enough for the author to hear it. I think this one success is the sole reason I feel my personal attendance was worth it.
For those who are interested, I showed the climbing discussions to a friend who knows about climbing, and they described it as a “conversation between people who have read a lot about climbing but have never actually done it”
METR have released Time Horizons 1.1
Is the implication here that you should also be caring about genetic fitness as carried into the future? My basic calculation here was that in purely genetic terms, you should care about the entire earth’s population ~33x as much as a sibling (modulo family trees are a bunch messier at this scale, so you probably care about it more than that).
I feel like at this scale the fundamental thing is that we are just straight up misaligned with evolution (which I think we agree on).
Number of relations grows exponentially with distance, genetic relatedness grows with log of distance, so assume you have e.g 1 sibling, 2 cousins, 4 second cousins etc, each layer will have an equivalent fitness contribution. log2(8 billion) = 33. Fermi estimate of 100 seems around right?
If anything, I get the impression this is overestimating how much people actually care, because there’s probably an upper bound somewhere before this point.
Hmm, perhaps. My intuition behind discount factors is different, but I’m not sure it’s a crux here. I agree that extinction leads to 0 utility for everyone everywhere, but the point I was making was more that with low discount factors the massive potential of humanity has significant weight, while a high discount factor sends this to near 0.
In this worldview, near-extinction is no-longer significantly better than extinction.
That aside, I think the stronger point is that if you only care about people near to you, spatially and temporally (as I think most people implicitly do), the thing you end up caring about is the death of maybe 10 − 1000 people (discounted by your familiarity with them, so probably at most equivalent to ~100 deaths of nearby family) rather than 8000000000.Some napkin maths as to how much someone with that sort of worldview should care: a 0.01% chance of doom in the next ~20 years then gives ~1% of an equivalent expected death in the next 20 years. 20 years is ~17 million hours, which would make it about 7.5x less worrisome than driving according to this infographic.
Again, very napkin maths, but I think my basic point is that a 0.01% P(Doom) coupled with a non-longtermist, non-cosmopolitan view seems very consistent with “who gives a shit”.
I think there’s an implicit assumption of tiny discount factors here, which are probably not held by the majority of the human population. If your utility function is such that you care very little about what happens after you die, and/or you mostly care for people in your immediate surroundings, your P(DOOM) needs to be substantially higher for you to start caring significantly.
This is not to mention Pascal’s mugging type arguments, where you should be unconvinced to make significant life choices from an unconvincing probability of some large thing.
This is not to say that I’m against x-risk research – my P(DOOM) is about 60% or so. This is more just to say that I’m not sure people with a non-EA worldview should necessarily be convinced by your arguments.
Yeah, so it’s definitely the case that some of the posts on Moltbook are human, but I think the bulk are definitely AI, and I get the impression that style-wise they end up very similar to the main posts here.
Moltbook shitposts are actually really funny
It feels weird to comment on a post I haven’t read, but I feel like it would be worth breaking this into parts (both the post and the video). I feel like there is probably stuff worth reading/watching in there and would happily do so if it was broken into, e.g., 8x 30 min discussions, but the current length introduces a friction to starting it.
I wrote this because I think this is probably a thing going through a fair few people’s minds, and these people are being selected out of the comments, so I think it’s probably differentially useful feedback.
AGI both does and doesn’t have an infinite time horizon
As someone who spent an unreasonable chunk of 2025 wading through the 1.8M words of planecrash, this post does a remarkable job of covering a large portion of the real-world relevant material directly discussed (not all—there’s a lot to cover where planecrash is concerned). I think one of the main things lacking in the review is a discussion of some of the tacit ideas which are conveyed—much of the book seems to be a metaphor for the creation of AGI, and a wide range of ideas are explored more implicitly.
All in all, I think this post does a decent task of compressing a very large quantity of material.
Interesting post! I think that you got a heavier weight of octopuses partly down to the narrower range of models you tested (the 30% partly came out because of the range of models tested—individual models had stronger preferences).
I think there’s also a difference in the system prompt used for API vs chat usage (in that I imagine there is none for the API). This would be my main guess for why you got significantly more corvids—I’ve seen both this and the increased octopus frequency when doing small tests in chat.
On the actual topic of your post, I’d guess that the conclusion is AI’s metacognitive capabilities are situation dependent? The question would then be in what situations it can/can’t reason about its thought process.
I think that there’s a couple of things which are quite clearly different from MIRI’s original arguments:
They originally argued a fair amount that AI would go from vastly subhuman to vastly superhuman over an extremely short time (e.g hours or days rather than years, which is what we are currently seeing). This affects threat dynamics
A lot of their arguments were based around optimising value functions. This is still a very valid way to look at things when looking at RL agents, but it’s unclear that it’s the best way to compress the agent’s behaviour with LLM based methods: simulator theory seems much more appropriate, and has a bunch of different risks.
I still think that the basic argument of “if you take something you don’t understand and can’t control very well and scale it up to superintelligence, that seems bad” holds.
I just played Gemini 3, Claude 4.5 Opus and GPT 5.1 at chess.
It was just one game each but the results seemed pretty clear—Gemini was in a different league to the others. I am a 2000+ rated player (chess.com rapid), but it successfully got a winning position multiple times against me, before eventually succumbing on move 25. GPT 5.1 was worse on move 9 and losing on move 12, and Opus was lost on move 13.
Hallucinations held the same pattern—ChatGPT hallucinated for the first time on move 10, and hallucinated the most frequently, while Claude hallucinated for the first time on move 13 and Gemini made it to move 20, despite playing a more intricate and complex game (I struggled significantly more against it).
Gemini was also the only AI to go for the proper etiquette of resigning once lost—GPT just kept on playing down a ton of pieces, and Claude died quickly.
Games:
Gemini: https://lichess.org/5mdKZJKL#50Claude: https://lichess.org/Ht5qSFRz#55
GPT: https://lichess.org/IViiraCf
I was white in all games.
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
Humanity, 2025 snapshot
Dumb idea for dealing with distribution shift in Alignment:
Use your alignment scheme to train the model on a much wider distribution than deployment; this is one of the techniques used to ensure proper generalisation of training of quadruped robots in this paper.
It seems to me that if you make your training distribution wide enough this should be sufficient to cover any deployment distribution shift.
I fully expect to be wrong and look forward to finding out why in the comments.
Hmm, interesting. I will note that Deepseek didn’t seem to have much of a cat affinity − 1 and 3 respectively for chat and reasoner. chat was very pro-octopus, and didn’t really look at much else, reasoner was fairly broad and pro-dog (47)
I’m not one of the impossible crowd, but I had a long discussion with @Remmelt about his views on this, and a significant part of the argument (at least the part I understood and feel I can adequately convey) seems to be more about keeping an AI aligned once you have it there. A vague outline goes something like this:
Imagine you have an aligned ASI
A sufficiently powerful system will likely have subsystems
There are always going to be aspects of the subsystems which cannot be controlled for
Those aspects will provide natural variance within which natural selection will occur
Gradually, we end up with a system which wants to reproduce (or at the very least with capable subsystems which do, and are capable of it in a way the system cannot control)
Purely reproduction oriented AIs are misaligned
Perhaps this would be best phrased as “there is a capability level above which it is impossible to align an ASI”, but I think that these dynamics obviously apply to modern day systems as well.