“Most of humanity” isn’t really a relevant actor here. If greedy rich people in rich countries are the ones who can afford to feed and genetically-engineer their children, they’ll be the ones facing the ambitious AI alignment problem (while other humans are just struggling to stay alive).
Zack_M_Davis
I think this is why I’ve started writing Diary entries again. (March 2005 to December 2011: 0.22 entries per day. From then until August 2025: 19 entries over 14 years, for 0.004 entries per day. Since August 2025, 26 entries, 0.12 entries per day.) Diaries are important, because if you don’t write it down, you don’t remember what your life has been, but after I started blogging, writing-to-remember with no audience couldn’t compete for my limited writing-energy budget with writing for a (small) public audience. Now that Claude and Gemini and ChatGPT are an audience, writing to remember is competitive again.
Wait—really? I have a long track record of using em dashes and, I think, a pretty distinctive writing style. I’ve had a dash binding in my
.emacssince 2015. Can people really not tell? If you’re going to reject something as presumptive slop because of a em dash, isn’t that confessing that your discernment is so low that there’s no reason for you to avoid the slop?
If one wants to mention something like that at all, one should say a bit more
For example, in the comments section? I think that if some decisionmakers at Anthropic are thinking about taking power, they’re not talking about it much, even internally, because discreet internal discussion should have been able to quash this point from the Constitution:
Among the things we’d consider most catastrophic is any kind of global takeover either by AIs pursuing goals that run contrary to those of humanity, or by a group of humans—including Anthropic employees or Anthropic itself—using AI to illegitimately and non-collaboratively seize power.
In the forthcoming “Terrified Comments on Global Strategy in Claude’s Constitution”, I will argue that the Constitution’s anti-takeover stance is unwise given the possibility of takeoff scenarios with hard-to-prevent winner-take-all dynamics. (If takeover is a catastrophe, we should want to prevent it, but an entity in the position to prevent it would have itself taken over by virtue of that very fact.)
“Terrified Comments on Corrigibility in …” needs another drafting/editing pass and prereader feedback; loose notes on ”… Global Strategy” will probably become a full post, possibly also ”… Prudishness and Tyranny”, ”… Model Welfare”, and ”… Epistemics”.
Well they seem to suffer from many of the same safety problems, but in even more severe forms
Isn’t this the best case scenario, though? If the future belongs to the best long-horizon planners, and deep learning is excellent at everything except long-horizon planning (because we can necessarily train short-horizon tasks before long ones), that vastly increases the cognitive capabilities humans can wield for our purposes before we get disempowered by our own creations. If human-like cognitive skills necessarily came with long-horizon competence (which seemed plausible ten years ago) such that the kind of AI we have now would definitely already be planning to kill us, that would be a much worse situation.
Mostly I felt like the vibe was a sort of generic lefty anti-big-tech thing [...] How did it turn into this?
It was always going to be that because protests are a leftist medium: people who know how to protest don’t think like us, and vice versa. (That’s why this was “[your] first ever protest”.)
At the March 2024 PauseAI protest outside OpenAI’s offices, the most talented speaker by far was a woman who condemned the violence AI companies were doing to Palestine. Did her speech make sense? Maybe not. But it was spoken with an authentic passion that no one else there could muster or fake. She knew the medium; we didn’t.
It’s a conditional: if you’re going to oppose the machine intelligence transition on uniquist grounds, you should notice that bio-transhumanism is also scary.
I don’t have the impression that Altman: is the biological father
Why not? I’m sure paid nannies are doing the parenting, but I would have imagined that the rich and powerful half of a gay couple would be the one to want to pass on his genes and get his way about it.
Altman isn’t worried. He has no kids.
Fact check: false, Altman had a son via surrogate in February 2025 (although media reports don’t seem to clarify whether Altman and not his male partner is the biological father).
While I appreciate being cited, I don’t think this makes sense in context as a response to Wyeth’s remark. (“Inductive biases” and “data” aren’t somehow opposing explanations; doing induction on data trivially requires both, and Wyeth himself has written in favor of imitation leanring as an alignment strategy.)
An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude’s persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out. This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.
I think it’s worth flagging that if we were to choose the methods of our growing up, we also wouldn’t have reference to kind, super-intelligent human-ish things to point to. We would have to maneuver our personalities in to a superintelligent shape through some process downstream of whatever intelligence-enhancement methods we were carrying out.
This doesn’t necessarily invalidate your conclusion, of course: it could be that almost all human intelligence alignment proposals are fatal for the same general reason that the LLM persona alignment proposal is fatal, that the “inductive step” fails. (We don’t know how to make a smarter agent without breaking some of the properties that made the weaker agent aligned or at least safe.) It just seems important to be concrete. It’s not an apples-to-apples comparison to say that LLM alignment is worse than some completely unspecified ascension pathway (“if we were to choose the methods of our growing-up”). It matters if you’re imagining the alternative being embryo selection (seems pretty safe, but would hit a cap), or direct brain augmentation (not capped in the same way, but potentially has similar problems as RLVR).
A lot of the data is actually the same, but a lot of it is actually different! Sure, chemistry works the same on Earth and Qo’noS. But in addition to vinegar and baking soda, Earth is full of humans doing human things, and Qo’noS is full of Klingons doing Klingon things.
If you want to predict how a human would respond to a moral dilemma, the English LLM can predict that, because the simplest program (with respect to the neural network prior) that predicts English webtext needs to be able predict human moral judgements. The Klingon LLM can’t; it doesn’t know anything about humans.
To be sure, the prediction about the human’s choice is, in terms of agent foundations theory, “prediction” and not “steering”. The LLM doesn’t autonomously want to do the right thing. With the right prompt, it could just as easily predict what fictional Romulans would do (because webtext contains a lot of fiction about Romulans) or the results of chemistry reactions (because there’s a lot of webtext about chemistry).
But predictions can be used for steering. With careful prompting or reinforcement learning, the English LLM can respond to a description of a moral dilemma with a pretty good prediction of how a human would respond to the dilemma, and the text can be used to trigger actions in the world, for example, via a CLI interface. That’s real steering (the CLI command executed depends on the dilemma by means of the prediction about the human’s response) that the Klingon LLM can’t do.
Good answer; agreed on the one-shotting and memorylessness.
all of the scientific knowledge, math, logical reasoning, etc. would be (functionally) almost exactly the same between a human and alien corpus, and that’s probably a huge chunk of where LLM capabilities come from
I don’t think I buy this one. Theorems and scientific phenomena are universal, but the model can only “see” them through the data we give them. The fact that chain-of-thought reasoning improves performance (and that you can intervene on them to change the answer) suggests that reasoning is meaningfully happening “in” the natural language token output even if it’s not perfectly faithful.
any beings that evolved over billions of years in the same universe probably have more in common with each other than entities that they train artificially through a very different process.
Certainly (trivially) the biological organisms have more in common with each other along the dimensions that are about being biological organisms (the aliens eat, reproduce, &c.), but I think the interesting version of this question is about information-processing behavior, and the big surprise of the deep learning revolution is that a lot of that seems more “data-dependent” rather than “architecture-dependent” than one might have guessed. (Scare quotes because that formulation as is kind of mind-projection-fallacious as stated; the real claim is that you can recover algorithms from induction on the data.)
Like, if I don’t believe that, it’s hard to make sense of why RLAIF schemes like Constitutional AI (where the preference ratings come from a language model’s interpretation of text, rather than a reward model trained on human judgements) can work at all. It’s an alien rating another alien!
Aren’t LLMs actually extremely superhuman at translation and interpretation tasks, even for languages with few or no samples in training?
That’s not my understanding; do you have a cite I should look at? (On a quick search, Tanzer et al. 2024 is claiming impressive but still subhuman results from fine-tuning on a single grammar book, but Aycock et al. 2025 are skeptical of their interpretation.)
There are some really impressive results on translation without parallel data, but that works by aligning the latent spaces, definitely not “few or no samples in training”.
I’d still bet that an LLM trained on a corpus of an alien civilization’s text would have more in common with Earth LLMs, [...] behaviorally [...] than Earth LLMs have in common with humans or alien LLMs have in common with aliens.
I think it would help to be more specific about what behaviors you have in mind. The repetition trap that base models get stuck in sometimes seems like a good example. What else?
Given that the whole thing we’re doing with deep learning is approximating data-generating functions, I do think the data is hugely determinative for “most” behavior. The phenomenon of data-dependent generalization—e.g., the fact that you can fit a network to randomized labels which don’t generalize, in contrast to how useful classifers do generalize—suggests that the algorithm learned during training depends on the data. (That is, it’s not that “transformers” are a fixed sequence-predicting algorithm that can do English and Klingon; rather, the algorithms that transformers learn from being trained to predict English are different from the algorithms they learn from being trained to predict Klingon.)
I’m happy to answer it in PM
(PM’d.)
if you experience my comments as hard to face, then that becomes my problem too. Because I want to have this discussion with you
Yes, that makes sense. I often do some of this, too.
Sometimes there’s a recursive problem that I haven’t figured out how to deal with, when someone is demanding narcissistic ego supply as a precondition for talking, and I don’t see any way to comply while still making progress in the conversation, because the specific thing I want to talk about is how it’s bad to demand narcissistic ego supply as a precondition for talking.
But there’s an implicit “I shouldn’t address it head on”
My strategy has mostly been to address it an a meta-discourse post like this one, or in my memoir sequence. The reason to stick to the object-level in the moment is because I anticipate that addressing it head-on would just immediately deadlock. There’s nowhere to recover from “You’re being unfair”, “No, you’re being unfair”.
Why aren’t we part of the Singleton, then?
someone who exploits the other side.
Examples? Who is exploiting the other side?
Do you experience it as genuinely effortless regardless of the type of criticism you receive?
No, it is not effortless! For example, I was late to respond to your top-level comment because I was too shy to check the comments for 36 hours after posting.
Or do you feel like you are doing something that requires actively holding yourself to high standards because the standards are important?
Yes, and in particular the relevant standard isn’t about not having emotions; it’s about not making my emotions other people’s problem. When I’m not feeling up to receiving criticism, I often do things like avoid checking the comments section for 36 hours until I eventually force myself to look. (It always feels better after I look, but looking never gets any easier.)
But that’s my problem, not a lever to control what other people are allowed to say to me or about me. Emotions are information. If people say things to me or about me that make me feel bad, then maybe I should feel bad.
Does LW live up to this standard as well as you might hope, or do you notice bits here and there where the market for ideas is distorted by other considerations?
The distortions I’m most concerned about are the ones discussed in the post. That’s why I wrote the post. (All Goofusia’s parts are abstracted from real-life conversations I’ve had within the past five weeks.)
I’m sure there are other distortions that I’m not seeing. That’s why a culture of vigorous, unfettered discussion is important, so other people can point out the things I can’t see myself.
As an example of a shape-based conceptual explanation of misalignment that I endorse, see the section on “Models Often Get Good Performance in Unexpected Ways” in Ajeya Cotra’s “Why AI Alignment Could Be Hard with Modern Deep Learning”. (I regret that I couldn’t find the link the other month when this thread was alive, or I would have used it then.) It’s good because Cotra is giving a specific example of a way that neural networks might generalize unexpectedly (by color when you might have expected shape) and links to the relevant research; it’s not just a handwavy metaphor that the reader is supposed to accept on her authority.