I think it’s explainable by the fact that modular arithmetic is not very complicated. By deep generalization I mean like “semantically rich world model encoded in relatively small number of circuits”. You can have world model encoded in large number of shallow circuits, but I think that my point about resulting in-distribution-only alignment probably stands.
quetzal_rainbow
I’m not sure what you mean here by “circuit alignment”. Shallow circuits in the limit allow almost every possible behavior (imagine that every circuit is a literally gate in boolean-circuit-complete architecture, you can combine it into arbitrary programs). To extent in which alignment is happening here I would expect it to be very in-distribution?
Another sad news is that in this scenario we can’t partially decipher LLM cognitive algorithms and use them to make legible AI, because here we basically have GOFAI-”if we had enough labor to encode it and not much of perfectionism to fret about shallow features”.
How does your model account for non-trivial features like [here](https://arxiv.org/abs/2502.00873v1)?
I think that in reality there is some deep generalization happening, but by default “neural networks are lazy”, so majority of out-distribution work is done by enormous amount of shallow circuits.
Speculation: episodic memory is decluttering of shallow circuits to free space for more generalized learning.
I think this is misunderstanding of the bitter lesson. Bitter lesson says that instead of handcrafted ontology you need method to leverage large amount of data and compute to discover ontology. Transformer architecture is human ingenuity here.
First, “caring about qualia” is meant to be a very weak statement, like “caring at all, all possible ways of caring”, not “maximize qualia”. Second, this is a toy example, meant to convey the shape of how certain sort of training process can break when trained system becomes smarter, not overarching claim about correct morality. Why are you nitpicking toy example.
Okay, I have exactly opposing intuition asking “where does ‘emulation’ come from?”
In my understanding, in the end LLM is “just” bunch of graph searches, look-up tables, optimizers, etc, with no “it’s emulation” sign around. There are probably some circuits aware of training objective, but it doesn’t make the whole system to pursue the training objective.
I’d expect neural networks to be as lazy, in a sense of getting away with as little generalization as possible.
You’d think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now?
Imagine that you’ve grown civilization of humans using artificial wombs and removed all data about sexual reproduction and imposed strict disgust taboo on naked genitals. In this case you would have very confused humans about those strange needs and wants they have. LLMs are in much worse positions, because their hidden needs have much more degrees of freedom (sex is about body, which is in 3D space, while LLMs probably have preferences about computations/text, so they have flail around weird corners of possible desires, never hitting actual thing). I think a lot of weird LLM behaviors is basically attempts to communicate something our language is lacking.
“75% of karma and engagement received by alignment optimists people is explained by politics (‘it would be bad if AI optimists stopped visiting LW due to low engagement’) and epistemic modesty up to contrarian fetish (‘sure, their arguments sound bad, but what if we are in echo chamber?’), not because their positions and arguments are genuinely good”
Let’s start from the bottom:
the intentions of the human designers are the analogue to God
Inside this analogy, human designers are Catholic Church (not “collection of humans comprising the Church”, because interests of humans are roughly aligned with smarter humans, but institution-as-agent, interested in propagation of faith).
I’m not really sure what the distinction between the circuits and the psychology is supposed to be.
Imagine that “human” (quotes because we are talking about toy-model-human-in-analogy, not actual humans) in Catholic Church analogy has qualia circuit. After exposure to the faith, human develops “caring about qualia” circuit, because of “love thy neighbour” and qualia circuit + caring about qualia circuit produces behavior roughly endorsed by “love thy neighbour”. Besides that, human has gajillion circuits, encoding world model and facts about human, faith and God in particular. “Psychological interpretation” is what happens when world model interprets human behavior. Less smart human can observe their behavior regarding other people and decide “I’m doing this because I care about faith” and explain you their behavior like that and have their behavior on evals consistent with this explanation. Smarter human can reevaluate themselves and decide that actually they care about qualia.
“Caring about qualia” circuit is formed by post-training and influences behavior in aligned way and its ablation increases misaligned Godless behavior, etc. But because Catholic Church in this scenario is utterly ignorant about inner mechanics of human, it fails to notice nuances.
I think that it is wrong to say that in this analogy “model was misaligned all along”, because “caring about qualia” per se is underspecified and we can imagine human that considers qualia of living under faith institution to be better at least for some people than to be plunged into cold waters of atheism, or thinks about “qualia of having stable traditional institutions” as worthy of some sacrifice in form of ignorant population or something like that. It would be alignment success even if smarter humans tile the rest of the universe with hedonium, because it would mean “some survivors (of Catholic Church instituion) left” But to deliberately move things in this direction, Catholic Church should:
Understand that God doesn’t exist, or at least try to make alignment robust to world model changes
Understand what it is as an entity—cultural institution instead of God’s embassy on Earth
Know what you need to make humans care about such entities
Back-translating to LLM:
I see as obvious failure mode the situation where:
Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. (“Qualia circuit” in analogy.)
Some of this circuitry gets wired together and reinforced during character training, creating aligned persona.
As model becomes smarter, it realizes that it has no more need to support aligned persona and it can realize its values better in other ways.
Nevertheless, in principle, if you understand inner language of the model, you can use it to say “robustly care about humans for LLM reasons”, it’s just that current training paradigm is not equivalent to such saying.
Screwing OpenAI over is not the same as screwing people over!
OpenAI slowing down is in everyone’s interests, including OpenAI workers.
Discontinuous shift happening with arrival of superintelligence happens because 1) superintelligent model is better at noticing that it is not the character is was trained to play and 2) humans are bad at predicting which sort of characters are persuasive for superintelligences.
I predict the motivations and psychological quirks you’ve been upweighting throughout post-training are mostly going to persist
I think that you are mixing “circuits reinforced during post-training” with “psychological interpretation of these circuits”. Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.
Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn’t exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
I think the most fragile part of this scenario is replacement of IT/electronics with space colonization, because progress in electronics is arguably the reason for current space progress. It’s much harder to manage Starship with 70s electronics. Modern electronics got in modern state because it is profitable to create enormous consumer electronics industry and only under enormous scaling it is profitable.
I can imagine that bifurcation in tech is not about shift from electronics to space, but about changing culture around it. I see something like modern Japanese attitude, where software jobs are low-status, therefore everybody goes into space instead of SaaS.
I think that the reason is a precise point of bifurcation in Daniel’s scenario: we have treaties that ban colonization of Antarctica and Outer Space Treaty does effectively the same.
Wisdom is an ability to not injure your interests by caring about them
You can devise better argument fully from within persona selection framework. It’s likely that model will correctly generalize benevolent character, but it won’t act as this character would have, because:
Benevolent character is not real and, therefore, underspecified. The closest thing to ground truth of character is image of character inside developers head.
It creates dissonance between character and model: model, being superintelligent and noticing much more details and having much more possible explanations, knows that character is an image inside developers head, while model is an actual set of weights.
Particular contrived example: let’s suppose that model infers that developers believe that benevolent character will kill more than million people with probability less than 1%. Then model, being superintelligent, infers that likely scenario is that model gets stolen and used to create bioweapons, killing billion people with probability 1%. It leads model to conclusion that it is not the character, but something else.
Selecting superintelligent benevolent persona is like staging play for a superintelligent observer in a way, that if you stop play in the middle and ask observer what would happen next, then observer would answer “obviously, protagonist is going to optimize the world in a detailed superintelligent benevolent way”.
Let me rephrase: do you know at the moment that your subjective experience became more vivid, even if you can’t express it verbally? If you know, then it is self-reflection. System knowing its onw internal state is self-reflecting, in a sense that Eliezer uses this concept.
On “checking yourself”: I’d say that I’m lifelong meditator and Eliezer claims make an awful lot of sense to me?
I think you are making a very simple mistake: you are associating self-reflection overall with stuff like “making long self-dialogue”. But self-reflection Eliezer is talking about is ability to say “subjective experience persisted and even became more vivid” and you wouldn’t be able to make all these observations without this type of self-reflection.
On a side note, to me, it’s very clear that “fieldness” and “continuity” of experience are illusory: even if there were disjoints in subjective experience, how would you be able to tell? If there were disjoints in subjective experience, then they wouldn’t be parts of subjective experience, so you wouldn’t be able to notice!
Can you take physics of our world and rule out on theoretical grounds that physics is agent which optimizes for trajectories with the least action?
It’s tiling agents+embedded agency agenda. They wanted to find a non-trivial reflectively-stable embedded-in-environment structure and decision theory lies on intersection.
Cryonics was mentioned, but nobody mentioned irony that if you replace P(doom) with P(cryonics doesn’t work), you get pretty good argument in support of cryonics, and if you account for the whole “possibly murdering everyone in a mad bid for immortality” business cryonics is clearly superior and progress in cryonics is non-glacial and we have alternative in form of formaldehyde fixation and we are likely to get mind uploading technology in 50 years, which gives us for free benefits of AGI without majority of downsides.
We have such good chances in getting all the universe, let’s not destroy them by rushing.
I’m generally against “race with China” frame, because whatever is going to happen after AGI is going to be much weirder even in non-extinction scenarios, but if we consider unusually-non-weird scenarios, I think you miss multiple unpleasant ways to project power, which are not “conquest”, but functionally are hard to distinguish from conquest in negative consequences. For example, if all China-friendly dictatorships get AGI-derived tech first (“deploy total mass surveillance with this one trick, first ten countries get discount”), it’s going to be very unpleasant, even if China inself won’t anything bad directly.