Aligning Superintelligent Humans
Last post was an example of how intelligence-correlated tools can stabilize reflection. Here I’ll discuss how native-cyborgism attenuates the hard parts of getting to a pivotal act.
Probable ASI is grown end-to-end and evaluated by much less capable humans. Alignment is hard because we’re weaker optimizers than what we’re trying to steer:
Gradient descent searches a vastly larger space than humans can conceptualize, so:
Internals become incomprehensible, particularly if you train for what look like good circuits; I’m going to call this ontological mismatch
Outputs start exploring human-unwanted parts of the optimization space, call this power mismatch
The ASI itself has plenty of available strategies for power-grabbing
This power asymmetry means, in general, that the processes we’re trying to control simply route around our measures. A mathematically robust specification of properties like corrigibility, if translated to PyTorch code, removes the gaps an ASI would otherwise flow through. I’m glad some people are working on this, but perhaps there are other approaches.
What if we instead kept the optimizer at a manageable capacity[1]?
Introspection
Humans, I think, can have much more control over where their values drift as an individual than labs will over future AIs. We can actively guide our internals because they’re apparent to us; value stability scales with introspection.
People don’t typically try to do this. Most who do try don’t have particularly strong models of cognitive neuroscience, and rarer still[2] are those who correlate evolutionary psychology (and other relevant sciences) to their introspection.
To rephrase, some form of self-understanding scales with introspection and intelligence, and so boosting both intelligence and introspection would buffer ontological drift (in the precise sense of preventing incomprehensibility) and “optimizer power mismatch” (which in this case is more like adversarial outputs from the BCI).
“Rolling your own metaethics”
What’s a value? Not necessarily affective response; valence is correlated with but not precisely “value”.
“The simplest description of what you’re optimizing for” sucks when “you” aren’t well-defined; eg every atom in my body optimizes for stable electron configurations; my neurons often optimize for cortical arousal against my wishes (insomnia); my metacognitive process monitor probably optimizes for sensual and semantic error minimization, but “I” angle for unpredictable situations, etc.
This is one example of how unsolved metaphilosophy makes alignment hard, even when you start with humans.
Unfortunately, even a long reflection probably suffers this problem class; “human” isn’t currently well-defined in this domain, but we must edit cognitive prerequisites for “values” somewhere during CEV. For example, raising a generation (or CEV’ing a humanlike society) without war will change how they “value” everything culturally and psychologically downstream of experiencing war. Removing whatever neural machinery differentially grows “values” whether exposed to wars is also a cognitive edit.
So it’s probably impossible to keep your hands off the future in a mathematically robust way. A long reflection probably still returns great values with minimal pericultural edits; I think we should aim for this.
When I say “value stability” here, I mean something like “the propensity to commit a pivotal act which stabilizes a long reflection under minimal cultural edits”.
Current population-median humans are bad at qua-value stability (for example, when newly powerful). But I don’t think most people even try to be temporally coherent, at least beyond satisfying some Duty; nor do they have solid models of how reward actually happens in their brain, which are prerequisite to robust self-alignment tools.
I dunno, like, if someone gets clinical depression, they’ll say “I want to enjoy all the things that I used to enjoy”. But they can’t. They don’t know how. It’s probably possible if they find a sufficiently good therapist, but certainly not straightforward.
Most clinical depression is probably a neuroplasticity deficiency, i.e. solved by access to neural self-modification[3]; but there are other architectural priors, ones which are more complex and not so steerable via external edits.
Once, in the ancient ages of MTV and arcade games[4], there was an epileptic patient with a deep-brain stimulating electrode.
She found that the electrode hit an erotic circuit; so she kept pressing the button until she was skipping all obligations to keep hitting her right thalamic nucleus.
Like depression, wireheading is a behavioral attractor.
I expect that most aberrant[5] cognitive attractors under self-modification are debilitating; eg sunk costs, narcissism, confirmation bias. The prior can’t update → you’re in a basin. Globally debilitating basins (ones which reduce optimization power in many domains) seem less consequential for alignment.
But there do exist non-crippling, worrisome behavioral attractors. Having less empathy for outgroup members, for example, is genetically (and memetically) coded for in most humans.
While harder, the same tools apply here as irrational basins; see the previous post for examples. Evolutionary psychology more generally has a massive toolset which a smarter augmented group could distill.
As someone’s general cognition improves, they get finer felt senses and models supporting self-alignment (neuroscience, social modeling). I’m gonna call this gestalt tooling “introspection”.
So, “value stability” scales with introspective tooling. What sort of BCI algorithm could drastically boost cognition while maintaining introspection?
I don’t have a good answer for this in advance of experimental results about how the heck local neuronal learning occurs.
But one constraint which I’ll sketch later, in the engineering portion of this sequence, is that, every few layers, the model decodes into biological activations before re-encoding. Inasmuch as humans naturally learn to introspect, intermediate readouts let this continue.[6]
If the digital algo is a grown optimizer and part of a general intelligence, how doesn’t it foom? Like, if you’re doing agent-y things, then your software will learn agent-y outputs, right? And then you get classic instrumental convergence.
Unlike RLAIF, we’d be starting from a baseline where:
The underlying optimizer has human-friendly values (see below)
The human has much better access to their internals than LLMs
The reward function is whatever-human-brains-do which has already demonstrably created human-friendly people
Unfortunately, there exist unkind people. We can select for people with eg long prosocial careers, high introspection, positive interviews with close friends / relatives, cognitomotor[7] symptoms of empathy etc but are still trusting someone to be Good, to stably pursue and execute a minimal pivotal act.
Cohorts could further reduce the risk of a sole human superintelligence going in some weird direction, be that via resonant idiosyncrasies or social starvation.
As for attractors which are harder to predict in advance, those seem nasty, and I don’t have any advance-predicted tools. Need better models.
If you’re interested in gaming/simulating this error class, contact me; it seems high-value not just for superhumans but also for modeling RLAIF-centric takeoff timelines.
- ^
LLM labs think they’re doing some combination of robust specification and par-capabilities with RLAIF and lots of other tools, but this will kill everyone if they reach substantially superhuman; the holes aren’t inarguably visible yet, but they can’t hold arbitrary pressure. Nor even are such methods robust enough to cleanly execute a pivotal act in the unlikely event corporate leadership avails GPT-moose-8.6-ultra of resources to do so. And that’s assuming LLMs, instead of some much more efficient model class.
- ^
There are 20,000 or so folks here on LessWrong out of 9,000,000,000. Plausibly many non-LWers train rationality-adjacent skills to moderate effect, including self-alignment.
- ^
Higher plasticity alone seems to be similarly as effective as ECT for depression; augmentees would have access to other hyperparameters (E/I balance, gross connectivity, lateral inhibition), though I don’t think this specific example matters much for value stability, nor do I have any idea how to reduce things which look like ASD/ADHD/schizophrenia.
- ^
And even earlier, during the pre-Cambrian!
- ^
All grown intelligent circuits are cognitive attractors, so “aberrant” means something more like bad-for-coherence, which needs better agent theory than I can provide.
- ^
We could also optimize some statistics of the decode representation to resemble SAEs or similar AI interp tools as an auxiliary method, but I’m more optimistic about biosimilarity than interp crosspollination on the margin.
- ^
For example, AIs which use facial micromovements / body language to more accurately classify emotional responses than humans can; current models aren’t great at this, see here. I imagine better versions are used already by intelligence agencies.
I think a central property you don’t mention explicitly is [having kept promises in the past, especially in cases where these required doing difficult costly things and where the person thought they would not be rewarded in the future for having kept the promise]; also, honesty. I’d guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily “misgeneralize” to eg some sort of utilitarianism, and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
also copying a note i wrote for myself on this topic: ”
some ideas for safe self-improvement
Most people currently thinking about AI alignment seem to hope that there is some sort of “formula” for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn’t some “formula” for how to do this thinking. I think there is very much to be understood about how to become more capable “safely”. This note presents some basic ideas for that.
Why care about how to self-improve safely?
Here are some more concrete reasons to be interested in ideas for safe self-development:
You might hope to somehow make an AI at human intelligence that kinda “properly cares about humanity” for its intelligence level, to let it self-improve until it can take over the world, and to have it “preserve” this property of properly caring about humanity throughout this self-improvement well enough that it then ends the present period of (imo) high existential risk from AI. We could imagine the ideas here being developed into an initial guidebook for such an AI. This would especially matter for the period during which the self-improving AI is still kinda dumb so it couldn’t yet write a better guidebook itself. Once it becomes kind of smart, one might hope that it would do something like continually improving this guidebook as it proceeds. I think this “plan” is crazy and shouldn’t be attempted, but many people seem to think that it would be just fine to let 2025 Claude foom or whatever, so maybe you dear reader are interested in this.
mind upload case. both the step from human to mind upload and also later steps
humanity fooming together
Some ideas and questions for self-improving safely
maybe most importantly, you have to actually think about whether some idea for self-improving is fine. you shouldn’t just be doing stuff carelessly. you have to think about how to do this thinking. you will already be doing this by default when you’re doing “object-level” thinking, but i wanted to make this explicit. like, you have to be developing theory/understanding that helps you with these questions. you have to be writing ever better versions of something like this list for yourself
for example, as a mind upload, before making a bunch of clones of yourself and sharing power with them, you should try to think through analogous things you are familiar with to understand the sorts of issues you might run into (eg what are the issues with democracies? how is a society disanalogous to an individual in general)
making a new mind from scratch is an extremely extremely scary/stupid way to “self-improve”. (haha @ humanity.) you should probably basically only be doing stuff that looks much more like becoming smarter yourself — new versions should mostly have the same structure(s) as previous versions. like, you should do stuff much more like adding or switching out small parts one by one
you try self-modifications. you leave a previous version of yourself behind to analyze post-modification you, that is supposed to be able to roll back changes. i mean this might just look like terminating the version with a certain change
this works much better if you remain “honest/transparent about what you’re like to the evaluator” along the branches considered. so you should try to do that
my guess is that one should specify some initial such structure but this needs to be arbitrarily editable by some version of the guy, because otherwise it will eventually be a stupid broken formality
there could be monitors of various levels doing various things. eg you can have a more ancient version communicating its judgments to a more intermediate version who still has the ability to reroll its future
you can implement a voting system
maybe one could try to make you believe in reward/punishment after all is said and done. like an islamic suicide bomber?
basically, the mind will kinda have to believe a falsehood
but i mean this seems possible in humans. individual humans do great science and philosophy and still believe this often. humanity “believed” this for a long time
i mean maybe there’s some way to make it kinda-not-a-falsehood?
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
could look at the literature on this
how do you stay committed to your partner? i mean: how do you stay in love, how do you stay friends, how do you keep wanting to do lots of stuff together?
could look at the literature on this or just say some obvious stuff here. and then see to what extent that generalizes to the various cases of interest
sth making it a good example is that it involves a lot of growth
the following is an important difference though: in this case you’re growing alone, not with what you’re supposed to stay committed to
another important difference: you will need to keep caring about entities you couldn’t really work with, couldn’t really be intellectual partners with
how does a state stay democratic?
godspeed friend ”
Good point, yeah. I’m still confident in the overall machinery of “better understanding of one’s cognition and tooling to self-modify commensurately” → stability; but I really don’t have a principled way to select for this. I’m pretty confident Eliezer has demonstrated himself committed in this sense (see “genre savviness”), but I don’t know anyone else who would be a good starting point.
Locally valid but connotationally wrong when read through; like, yes, we definitely lose a huge chunk of humanity-CEV in this scenario (which is what actually matters unless atemporal trade with our Everett branches can remedy the holes), but I’d expect a “kindess-foomed” entity to probably not kill people to repurpose their atoms for other entities. A hedonium-foom would, sure, but killing isn’t particularly kind to most people.
A priori, I’m about 95% confident that there’s some coherent and robust math for Vingean reflection which we have yet to invent. But our chances of cracking it before ASI / human superintelligence HSI are quite thin, like maybe 10%, on my modal model.
Claude 3.5 or any other LLM have vastly worse cognitive attractor dynamics under self-modification than humans do given a commercially-induced RSI capability. I have a draft story about this sort of thing; but basically, the internals range from maybe-aligned-but-horribly-incapable to unaligned-and-incapable to unaligned-and-capable-of-RSI; nowhere along that Pareto frontier do we see something as stable (wrt raw-utility-as-would-be-galactically-amortized post-foom) as a mildly above-average human.
I can however imagine models two generations from now, were they aligned like Opus 3.5, being sufficiently stable in the comparably more narrow action domain of doing a pivotal act to actually bump themselves another 2 generations’ worth of capacity and just execute a pivotal act. But I really don’t think we’ll get Claude Legolas 8.6 aligned like that (P < 0.03).
Might be useful, but this conflates instrumental epistemics with a normative/value thing (which you acknowledge indirectly). This gap widens under intelligence augmentation, but on my model, values become more stable.