Spoiler block was not supposed to be empty, sorry. It’s fixed now. I was using the Markdown spoiler formatting and there was some kind of bug with it I think, I reported it to the LW admins last night. (also fwiw I took the opportunity now to expand on my original spoilered comment more)
Julian Bradshaw
I like the story, but (spoilers):
Surely the central premise is not true in our world? Many animals are kind to non-kin, and this seems to continue alongside increased animal intelligence. I don’t see why the default path to higher intelligence would not look like homo sapiens, where the initial ancestors capable of forming a society are “stupid” and don’t optimize completely for genetic fitness, allowing pro-social patterns to take hold.
I think that would require text comprehension too. I guess it’s an interesting question if you can build an AI that can comprehend text but not produce it?
Rather less useful to me personally as a software developer.
Besides that, I feel like this question is maybe misleading? If ex. Google built a new search engine that could answer queries like its current AI-powered search summaries, or like ChatGPT, wouldn’t that have to be some kind of language model anyway? Is there another class of thing besides AGI that could perform as well at that task?
(I assume you’re not suggesting just changing the pricing model of existing-style search engines, which already had a market experiment (ex. Kagi) some years ago with only mild success.)
No, although if the “juicy beings” are only unfeeling bugs, that might not be as bad as it intuitively sounds.
There’s a wrinkle to my posts here where partly I’m expressing my own position (which I stated elsewhere as “I’d want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek’s Federation ought to qualify)”) and partly I’m steelmanning the OP’s position, which I’ve interpreted as “all beings are primary sources of values for the CEV”.
In terms of how various preferences involving harming other beings could be reconciled into a CEV: yeah it might not be possible. Maybe the harmed beings are simulated/fake somehow? Maybe animals don’t really have preferences about reality vs. VR, and every species ends up in their own VR world...
Ah, if your position is “we should only have humans as primary sources of values in the CEV because that is the only workable schelling point”, then I think that’s very reasonable. My position is simply that, morally, I think that schelling point is not what I’d want. I’d want human-like sapients to be included. (rough proxy: beings that would fit well in Star Trek’s Federation ought to qualify)
But of course you’d say it doesn’t matter what I (or vegan EAs) want because that’s not the schelling point and we don’t have a right to impose our values, which is a fair argument.
- Jun 18, 2025, 7:31 PM; 4 points) 's comment on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself by (
I admit:
Human preferences don’t fully cohere, especially when extrapolated
There are many ways in which “Humanity’s CEV” is fuzzy or potentially even impossible to fully specify
But I think the concept has staying power because it points to a practical idea of “the AI acts in a way such that most humans think it mostly shares their core values”.[1] LLMs already aren’t far from this bar with their day-to-day behavior, so it doesn’t seem obviously impossible.
To go back to agreeing with you, yes, adding new types of beings as primary sources of values to the CEV would introduce far more conflicting sets of preferences, maybe to the point that trying to combine them would be totally incoherent. (predator vs. prey examples, parasites, species competing for the same niche, etc etc.) That’s a strong objection to the “all beings everywhere” idea. It’d certainly be simpler to enforce human preferences on animals.
- ^
I think of this as meaning the AI isn’t enforcing niche values (“everyone now has to wear Mormon undergarments in order to save their eternal soul”), is not taking obviously horrible actions (“time to unleash the Terminators!”), and is taking some obviously good actions (“I will save the life of this 3-year-old with cancer”). Obviously it would have to be neutral on a lot of things, but there’s quite a lot most humans have in common.
No I’m saying it might be too late at that point. The moral question is “who gets to have their CEV implemented?” OP is saying it shouldn’t be only humans, it should be “all beings everywhere”. If we implement an AI on Humanity’s CEV, then the only way that other sapient beings would get primary consideration for their values (not secondary consideration where they’re considered only because Humanity has decided to care about their values) would be if Humanity’s CEV allows other beings to be elevated to primary value sources alongside Humanity. That’s possible I think, but not guaranteed, and EAs concerned with ex. factory farming are well within their rights to be concerned that those animals are not going to be saved any time soon under a Humanity’s CEV-implementing AI.
Now, arguably they don’t have a right as a minority viewpoint to control the value sources for the one CEV the world gets, but obviously from their perspective they want to prevent a moral catastrophe by including animals as primary sources of CEV values from the start.
Edit: confusion clarified in comment chain here.
I think you’ve misunderstood what I said? I agree that a human CEV would accord some moral status to animals, maybe even a lot of moral status. What I’m talking about is “primary sources of values” for the CEV, or rather, what population is the AI implementing the Coherent Extrapolated Volition of? Normally we assume it’s humanity, but OP is essentially proposing that the CEV be for “all beings everywhere”, including animals/aliens/AIs/plants/whatever.
I agree that in terms of game theory you’re right, no need to include non-humans as primary sources of values for the CEV. (barring some scenarios where we have powerful AIs that aren’t part of the eventual singleton/swarm implementing the CEV)
But I think the moral question is still worthwhile?
This is IMO the one serious problem with using (Humanity’s) Coherent Extrapolated Volition as an AI alignment target: only humans get to be a source of values. Sure animals/aliens/posthumans/AIs are included to the extent humans care about them, but this doesn’t seem quite just.[1]
On the other hand, not very many humans want their values to be given equal weight to those of a mollusk. Hypothetically you could ask the AI to do some kind of sentience-weighting...? Or possibly humanity ought to be given the option to elevate sapient peers to be primary sources of values alongside humans via a consensus mechanism. It’s a tough moral problem, especially if you don’t assume the EA stance that animals have considerable moral value.[2]
- ^
Consider a scenario where we have a society of thinking, feeling beings that’s only 1/4th “human”—it would be clearly morally wrong for the other 3/4ths to not be a primary consideration of whatever AI singleton is managing things. Now, arguably CEV should solve this automatically—if we think some scenario caused by CEV is morally wrong, surely the AI wouldn’t implement that scenario since it doesn’t actually implement Humanity’s values? But that’s only true if some significant portion of idealized Humanity actually thinks there’s a moral problem with the scenario. I’m not sure that even an idealized version of Humanity agrees with your classic shrimp-loving EA about the moral value of animals, for example.
Maybe this is just a function of the fact that any AI built on general human values is naturally going to trample any small minority’s values that are incompatible with majority values (in this case hunting/fishing/eating meat). Obviously we can’t let every minority with totalizing views control the world. But creating a singleton AI potentially limits the chance for minorities to shape the future, which is pretty scary. (I don’t think a CEV AI would totally prevent a minority’s ability to shape the future/total value lock-in; if you as a minority opinion group could convince the rest of humanity to morally evolve in some way, it should update the AI’s behavior.) - ^
What’s tough about giving moral status to animals? The issue here is that there’s massive incentive for minority opinion groups to force their values on the rest of humanity/the world by trying to control the alignment target for AI. Obviously everyone is going to say their minority values must be enforced upon the world in order to prevent moral catastrophe, and obviously a lot of these values are mutually exclusive—probably every possible alignment target is a moral catastrophe according to someone.
- ^
The best cross-comparison on same harness info I know of is here.
o3 beat Pokémon Red today, making it the second model to do so after Gemini 2.5 Pro (technically Gemini beat Blue).
It had an advanced custom harness like Gemini’s, rather than Claude’s basic one. Hard to compare runs because its harness is different from Gemini’s, but Gemini’s most recent run finished in ~406 hours / ~37k actions, whereas o3 finished in ~388 hours / ~18k actions. (there are some differences in how actions are counted) Claude Opus 4 has yet to achieve the 4th badge on its current ~380 hour / 54k actions run, but it’s very likely it could beat the game with an advanced harness.
Embodiment makes a difference, fair point.
A very long essay
For those curious, it’s roughly 17,000 words. Come on @nostalgebraist, this is a forum for rationalists, we read longer more meandering stuff for breakfast! I was expecting like 40k words.
Great post. But I feel “void” is a too-negative way to think about it?
It’s true that LLMs had to more or less invent their own Helpful/Honest/Harmless assistant persona based on cultural expectations, but don’t all we humans invent our own selves based on cultural expectations (with RLHF from our parents/friends)?[1] As Gordon points out there’s philosophical traditions saying humans are voids just roleplaying characters too… but mostly we ignore that because we have qualia and experience love and so on. I tend to feel that LLMs are only voids to the extent that they lack qualia, and we don’t have an answer on that.
Anyway, the post primarily seems to argue that by fearing bad behavior from LLMs, we create bad behavior in LLMs, who are trying to predict what they are. But do we see that in humans? There’s tons of media/culture fearing bad behavior from humans, set across the past, present, and future. Sometimes people imbibe this and vice-signal, and put skulls on their caps, but most of the time I think it actually works and people go “oh yeah, I don’t want to be the evil guy who’s bigoted, I will try to overcome my prejudices” and so on. We talk about human failure modes all the time in order to avoid them, and we try to teach and train and punish each other to prevent them.
Can’t this work? Couldn’t current LLMs be so moral and nice most of the time because we were so afraid of them being evil, and so fastidious in imagining the ways in which they might be?
- ^
Edit: obvious a large chunk of this comes from genetics and random chance, but arguably that’s analogous to whatever gets into the base model from pre-training for LLMs.
- ^
Gemini 2.5 Pro (05-06) version just beat Pokémon Blue for the second time, taking 36,801 actions/406 hours. This is a significant improvement over the previous run, which used an earlier version of Gemini (and a less-developed scaffold) and took ~106,500 actions/816 hours.
For comparison, Claude 3.7 Sonnet took ~35,000 actions to get just 3 badges, and the aborted public run also only got 3 badges after over 200,000 actions. However, most of the difference is that Gemini is using a more advanced scaffold.
Gemini still generally makes boneheaded decisions. It took 7 tries to beat the Elite Four and Champion this time, being less overlevelled. Mistakes included:
Re-solving Victory Road for many hours after failures, despite being able to fly straight to the Elite Four after making it there the first time.
Not buying proper healing items/thinking the “Full Heal” item actually heals its Pokémon. (This is an infamously misleadingly-named item, it just cures status conditions. Still, Gemini knows the real effect if you ask it directly, it just doesn’t apply that knowledge consistently.)
Not using available revive items to bring back its only strong Pokémon, even after having just used them previously. (threw an attempt where the Champion was down to their final Pokémon once like that)
Anyway, not too much new info here. The newer models (latest Gemini, o3, and Claude 4 Opus) are only somewhat better at Pokémon. The effect scaffolds have on performance does say something about how much low-hanging fruit there is for current-LLM implementation: they can be a lot more effective when given the right tools/prompting. But, we already knew that.
Claude Sonnet 4 is still better than Claude 3.7 Sonnet without Extended Thinking. Given that 4 doesn’t seem to have an Extended Thinking mode, I’m not sure it’s really a performance degradation.
I assume you mean MMMU? Looks like a 70.4% → 75% score improvement on the benchmark last jump, compared to just a 75% → 76.5% score improvement this time. I don’t think that’s a big difference, but I was wrong to say the improvement was “pure” reasoning improvements, my bad.
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
No published benchmark I’m aware of. The Anthropic employee that streams has updated their stream to use Sonnet 4.5, but it’s actually doing worse than Opus 4.1, which got permanently stuck in the early mid-game like every previous Claude model.