Epistemic status: the other thing that keeps me up at night
TL;DR: Even if we solve Alignment, we could well still lose everything.
There’s an AI-related existential risk I don’t see discussed much on LessWrong. In fact, it’s so little discussed that it doesn’t even have a good name yet, which is why I’m calling it
However, assume for a moment that our worst Artificial Super-Intelligence (ASI) fears don’t happen, that we somehow pull off aligning super-intelligence: what are you expecting to happen then?
Most people’s default answer seems to be ‘Utopia’: post-scarcity techno-paradise-on-Earth, starting with something resembling Machines of Loving Grace and getting quickly and progressively more science-fiction-utopian from there, heading in the approximate direction of post-Singularity SF such as Iain M. Banks Culture novels. This makes a lot of sense as long as you assume two things:
Human nature and human values don’t change much, so what we want remains similar, and
Superintelligence will be a lot better at getting us what we want than we are currently
What worries me here (if we get past simple
Currently, human values have a genetic component, which is pretty uniform and constant (other than 2%–4% of us being sociopaths), and a cultural component overlaid with that (plus some personal reflection and self-improvement), which is pretty variable across cultures, and varies slowly in time. For several centuries, at least since the Enlightenment, (and arguably for millennia) the latter has internationally been predictably moving in a pretty specific direction[1] (towards larger moral circles, more rationality, more equality, and less religion, for example) as our society has become more technological, scientific, and internationally cross-linked by trade. This ongoing cultural change in human values has been an adaptive and useful response to real changes in our societal and economic circumstances: you can’t run a technological society on feudalism.
However, consider the combination of:
Increasingly persuasive, individually tailored social/educational/motivational media and conversational influences crafted by superintelligences. (Note that this one kicks in as soon as we have ASI.)
ASI genetic engineering, biochemistry, and drug design (think Ozempic, and a reverse, for every human need and drive).
The rapid advances in neurobiology, biochemistry, and understanding and control of the human brain we will inevitably make once we have ASI and understand enough about its neural net to know how to align it. Somewhere inside your head, there is a small set of nearby neurons that, when firing together, represent the concept of the Golden Gate Bridge, and also another set for Jennifer Aniston. With a little wiring we could turn you into a fan of either of them.
Cyborging, of various forms, whether that involves actual machine-to-neuron interfaces,[2] or just well-optimized audiovisual virtual reality or augmented-reality ones.
Technologies I haven’t even thought of: say, biocompatible nanotech, or memetic-ecosystem engineering.
I think any assumption that human nature or human values is fairly fixed and can evolve only a little, slowly, through cultural evolution responding to shifts in social circumstances is, within at most a few decades after we get ASI, going to be pretty much just completely false. We will, soon after ASI, have the technology to dramatically change what humans want, if we want to. Some of these technologies only affect the current generation and the development of our culture, but some, like genetic engineering, produce permanent changes with no inherent tendency for things to later return to the way they were.
So, we could get rid of sociopathy, of our ability to dehumanize outsiders and enemies, of the tendency towards having moral circles the
Thus, we have a society containing humans, and ASI aligned to human values. The ASI are aligned, so they want whatever the humans want. Presumably they are using superintelligent Value Learning or AI Assisted Alignment or something to continuously improve their understanding of that. So they will presumably understand our Evolutionary Psychology, Neurology, Psychology, Anthropology, Sociology, etc. far better than we currently do. However, in this society human values are, technologically speaking, very easily mutable.
The problem is, that’s like attaching a weather-vane to the front of a self-driving car, and then programming it to drive in whichever direction the weather-vane currently points. It’s a tightly-coupled interacting dynamical system. Obviously ASI could try not to affect our values, and give us self-determination to decide on these changes ourselves — but in a system as interwoven as ASI and humans obviously will be post-Singularity, the counterfactual of “how would human values be evolving if humans somehow had the same society that ASI enables without that actually having any ASI in it” sounds ludicrously far-fetched. Maybe an ASI could do that – it is after all very smart – but I strongly suspect the answer is no, that’s functionally impossible, and also not what we humans actually want, so we do in fact have a tightly-coupled very complex nonlinear dynamical system, where ASI does whatever the humans value while also being extremely interwoven into the evolution of what the humans value. So there’s a feedback loop.
Tightly-coupled very complex nonlinear dynamical feedback systems can have an enormously wide range of possible behaviors, depending on subtle details of their interaction dynamics. They can be stable (though this is rare for very complex ones); they can be unstable, and accelerate away from their starting condition until they encounter a barrier; some can oscillate like a pendulum swinging or a dog chasing its tail; but many behave chaotically, like the weather — which will mean that the weather isn’t predictable more than a short distance in advance. This can still mean that the climate is fairly predictable, other than slow shifts; or they can do something that, in the short term, is chaotic so only about as predictable as the weather, but that in the long term acts like a random walk in a high dimensional space, and inexorably diverges: the space it’s exploring is so vast that it never meaningfully repeats, so the concept of ‘climate’ doesn’t apply.
I am not certain which of those is most likely. Stability obviously might be just value lock-in, where we simply freeze in as an orthodoxy early-21st century values which haven’t even fully caught up with early-21st century realities, and then try to apply them to a society whose technology is evolving extremely rapidly. This is very evidently a bad idea, long recognized as such, and would obviously sooner or later break. Or it might mean that we evolve slowly, only in response to actual shifts in society’s situation. Unstably accelerating feedback-loop behavior (such as a “holier than thou” competition) is also clearly bad. Some sort of oscillatory, or weather-unpredictable-but-climate-predictable situation basically means there are fads or fashions in human values, but also some underlying continuity: some things change chaotically, others, at least in broad outline, shift only as a response to the prevailing circumstances shifting.
However, this is an extremely high dimensional space. Human values are complex (perhaps a
In general, if you build a tightly-coupled very complex nonlinear dynamical feedback system unlike anything you’ve ever seen before, and you don’t first analyze its behavior carefully and tweak that as needed, then you are very likely to get dramatic unforeseen consequences. Especially if you are living inside it.
So while value lock-in is obviously a dumb idea, chaotic random-walk value mutation (“value divergence”? “value drift”?) is also a potential problem (and one that sooner or later is likely to lead to value lock-in of some random values attractor). We somehow need to find some sort of happy medium, where our values evolve when, but only when, there is a genuinely good reason for them to do so, one that even earlier versions of us would tend to endorse under the circumstances after sufficient reflection. Possibly some mechanism somehow tied or linked to the genetic human values that we originally evolved and that our species currently (almost) all share? Or some sort of fitness constraint, that our current genetic human values are already near a maximum of? Tricky, this needs thought…
Failing to avoid that value mutation problem is a pretty darned scary possibility. We could easily end up with a situation where, at each individual step in the evolution, at least a majority of people just before that step support, endorse, and agree on reflection to the change to the next step — but nevertheless over an extended period of changes the people and society, indeed their entire set of values, become something that bears absolutely no resemblance to our current human values. Not even to a Coherent Extrapolated Volition (CEV) of that, or indeed of those of any other step that isn’t close to the end of the process. One where this is not merely because the future society is too complex for us to understand and appreciate, but because it’s just plain, genuinely weird: it has mutated beyond recognition, turned into something that, even after we had correctly understood it on its own terms, we would still say “That set of values barely overlaps our human values at all. It bears no resemblance to our CEV. We completely reject it. Tracing the evolutionary path that leads to it, everything past about this early point, we reject. That’s not superhuman or post-human: that’s just plain no longer even vaguely human. Human values, human flourishing, and everything that makes humans worthwhile has been lost, piece-by-piece, over the course of this trajectory.”
Even identifying a specific point where things went too far may be hard. There’s a strong boiling-a-frog element to this problem: each step always looks reflectively good and reasonable to the people who were at that point on the trajectory, but as they gradually get less and less like us, we gradually come to agree with their decisions less and less.
So, what privileges us to have an opinion? Merely the fact that, if this is as likely as I expect, and if, after reflection, we don’t want this to happen (why would we?), then we rather urgently need to figure out and implement some way to avoid this outcome before it starts. Preferably before we build and decide how to align ASI, since the issue is an inherent consequence of the details of however we make ASI aligned, and effect 1 on the list above kicks in immediately.
The whole process is kind of like the sorites paradox: at what point, as you remove grains from a heap of sand, does it become no longer a heap of sand? Or perhaps it’s better compared to the Ship of Theseus, but without the constraint that it remain seaworthy: if you keep adding and replacing and changing all the parts, what’s to keep it Theseus’s, or even a ship — what’s to stop it eventually changing into, for instance, a Trojan horse instead?
How do we know a change is good for us long-term, and not just convenient to the ASI, or an ASI mistake, or a cultural fad, or some combination of these? How do we both evolve, but still keep some essence of what is worthwhile about being human: how large an evolutionary space should we be open to evolving into? It’s a genuinely hard problem, almost philosophically hard — and even if we had an answer, how do we then lock the very complex socio-technical evolutionary process down to stay inside that? Should we even try? Maybe weird is good, and we should be ready to lose everything we care about so that something unrecognizable to us can exist instead, maybe we should just trust our weird descendants, or maybe it’s none of our business what they do with out legacy — or maybe some things about humanity and human flourishing are genuinely good, to us, and we believe are worth working to ensure they remain and are enhanced, not just mutated away, if we can find a way to do that?
So, that’s what I mean by
Yes, of course ASI would let this happen, and not just solve this for us: it’s aligned with the wishes of the people at the time. At each step in the process, they and it, together, decided to change the wishes of society going forward. Why would their ASI at that point in the future privilege the viewpoint of the society that first created ASI? That seems like just value-lock-in…
So how could we define what really matters and is worth preserving, without just doing simplistic value lock-in? Can, and should, we somehow lock in, say, just a few vital, abstract, high-level features of what makes humans worthwhile, ones that our descendants would (then) always reflectively agree with, while still leaving them all the flexibility they will need? Which ones? Is there some sort of anchor, soft constraint, or restoring force that I’m missing or that we could add to the dynamics? Is there any space at all, between the devil of value lock-in and the deep blue sea of
Is it just me, or are other people worried about this too? Or are you worried now, now that I’ve pointed it out? If not, why not: what makes this implausible or unproblematic to you?
So, what’s your
Mine’s roughly 50%, and it keeps me up at night.
[Yes, I have worried about this enough to have considered possible solutions. For a very tentative and incomplete suggestion, see the last section of my older and more detailed post on this subject, The Mutable Values Problem in Value Learning and CEV.]
I would like to thank Jian Xin Lim and SJ Beard[5] for their suggestions and comments on earlier drafts of this post.
- ^
A direction that, coincidentally, is also known to psychologists by acronym Western, Educated, Industrialized, Rich, Democratic (WEIRD). However, that’s not the kind of
I’m concerned about in this post — I’m talking about something genuinely far weirder, something which makes that WEIRD look positively normal. - ^
The corpus callosum has a huge bandwidth: it’s an obvious place to tie in, just add the silicon-based processing as effectively a third hemisphere.
- ^
Calling this problem ‘
’ is of course a silly name, so perhaps we should instead, as I implied above, call this issue something like “value mutation” or “value divergence” or “value drift”, to make it clear that it’s the opposite problem to value lock-in? - ^
I would be less concerned by
if we were simultaneously colonizing the stars, spreading in all directions, and during this different cultural lines were undergoing value mutation in different directions (as seems inevitable without faster-than-light communications). Especially so if I were confident that, for any particular element of human values and human flourishing that I would mourn if it disappeared, at least some of our descendants will keep it. That actually seems kind of cool to me — I’m fine with speciation. But I suspect the process of changing ourselves will be sufficiently much faster than the process of interstellar colonization (if there is indeed suitable unoccupied living space out there) that the latter won’t save us. Still, a light-cone of ness is a somewhat different situation. - ^
Listed in alphabetical order
The step where you say that aligned ASI will want what humans want is, in my opinion, an unjustified leap. Any ASI, aligned or not, will naturally understand that humans don’t know what we want, not in detail, not in general, not in out-of-distribution hypothetical scenarios. An aligned ASI would, as you clearly understand, have to grapple with that fact, but I don’t think it would just acquiesce to current stated values at each moment. I also wouldn’t want it to.
I don’t know how much this helps, the problem is still there, but I hope if we align ASI enough to avoid extinction in the short to medium term, we’ll have aligned it enough to solve this problem in the medium to long term. Because if not, I would argue that the kind of weirdness you’re pointing towards is still a kind of extinction and replacement.
I wasn’t assuming the ASI was just taking our word for it:
So I was actually assuming they had a large, sophisticated ASI research project to figure out human values / what the humans want in ever increasing detail. But that would obviously include surveying and including recent changes in it, if humans are getting edited, or changing their minds, or the effects of cultural changes. Failing to do that is like a company still making the products that were in style 50 years ago, and not doing any customer research. Why would we make ASI aligned to outdated values? Clearly we won’t.
But as you say, this only speeds the problem up.
Fair enough, and I agree that’s a plausible scenario.
My mean answer to this comes from my least sympathetic writings: this is the incentive for AI to permit biological civilizations to develop without active interference. Not prime directive style benevolence about the authenticity of suffering, but brutal Soviet or Chinese Communist style hypocrisy about the utility of keeping a Hong Kong around to solve calculation problems.
So I take it you mean “mean answer” in a minimally sympathetic sense of the word mean, not the statistical sense, and “calculation problems” in the same way that the existence of capitalist Hong Kong enclave was useful to socialist China as a means of determining what prices to set, of a system performing a calculation too complex for even ASI to effectively simulate? I agree that a civilization of biological beings has an extremely computational complexity if you are trying to reproduce every tiny quick of its complex non-linear behavior — I’m less clear of why that level of detail would be important to the ASI.
While I do think it’s possible for AI to avoid, as you say, active interference, I think avoiding passive interference and doing the counterfactual of what a society enabled by AI services and advances would be like if there was no AI in it may simply not be a well-defined question, so I think we’re agreed that passive interference is probably unavoidable. So I agree with that distinction you’re making.
I note you say plural “civilizations” — I have been assuming a single Earth-or-Solar-System spanning biological-and-AI civilization, until we get to interstellar colonization (if that is feasible): was your mention of multiple civilizations also about something interstellar, or are you assuming a future that is possibly more federated and diverse than I am?
You don’t say so explicitly, but words like “brutal” and “hypocrisy” are suggesting to me that you’re assuming that the AI alignment problem wasn’t in fact fully solved, or at least wasn’t solved in a way that we would regard as well-solved — could you expand on this? Suppose the humans are considering whether-or-not to engineer a particular society-wide change to their values, and explicitly ask the AI “Please help us make this important decision well.” At this point is your view that AI complying with that request is no longer active interference? Or that doing more than a minimal effort would be active interference? Or that AI would say “I’m afraid we can’t help with things like that” and refuse to actively interfere? I’m having trouble seeing anything less that a best effort assistance (that didn’t even have to be asked for) as compatible with the “AI alignment has been solved” assumption, but perhaps we have rather different ideas of what the goal of AI alignment is. Mine is that the AI has the same value ordering on outcomes as the (somehow averaged over all the current) humans. Who I assume would want AI help in making an important decision well — that’s why they’re asking for it.
Thank you for humoring me. I have been having conversations with the air about these things for ten years, so please forgive me if my posts are below basic standards of communication. I am trying to be as clear and honest as I can.
Predicting what something will be worth in the future is a problem best addressed by markets. An AI singleton would not be able to replicate the market functions that meet these needs. A constellation of AIs could have a market about humans but it would probably immediately overwhelm humans. This might still be be benevolent in intent and outcome but it’s the sort of thing that gives most people anxiety. So, to maintain liberty for small things, you have to maintain markets for them. If there are human shaped ems in the stewardship of AI, because they are engineered, they need biological humans to continue to exist to have secondhand access to the market mechanisms associated with adjusting the distribution of goods according to values.
So depending on how much you value liberty, or immortality, or biological substrate, or any number of other things, this arrangement might be mutually agreeable for everyone, or there might be people like I guess me with weird unresolved conflicts of values who would both choose to be an em and allow biological life to persist but be unhappy about aspects of it. And the aspect of it I am unhappy about is, when there are selfish incentives for non biological life to maintain biological life, and when those incentives diverge from the interests of biological life, that can create a conflict of interest, especially when there is also a vast power differential. There is a clear and rational logic that is also a logic of power here and it plausibly solves the problem of evil as understood by the sort of person who would still try to argue with god even if they were physically bounded, non omniscient non omnipotent AI, with no supernaturally imposed obligations of any kind because that isn’t a thing. There are very few places I can go to plausibly yell acausally at a god that might exist, I apologize somewhat for using LessWrong in that way but also continue to feel it is appropriate on balance.
I try to avoid overcommitment in cosmological questions. I don’t know if there are intergalactic civilizations. I suspect that would likely require new physics or a universe where biological consciousness was rare against plentiful machine consciousness. For my purposes it’s just “is there anything I can cooperate with productively that I can’t see”. I’m just an idiot occultist who has pared back the concept of magick to acausal trade, thrown away basically everything traditionally attached to it, tried to ensure compliance with a physically lawful understanding of the world, and accepted that the way I look to others will be expensively costly for no observable benefit to anyone including myself. This is also something I deeply expect LessWrong posters to be generally antagonistic to.
I don’t think consensus based decision making is more sacred than individual rights and preferences. If the only choices were between direct democracy as the immediate outcome of a singularity and solipsistic individual simulations being the immediate outcome I would choose the latter, for myself and for anyone without a plausibly expressed preference. That would allow a bottom up construction.
I do believe there is some point in cosmological history where the stars were tiled with unaligned AI. My reasons for believing this are indistinguishable from schizophrenia even to me so I won’t bother you with them. Especially since I don’t know what point in cosmological history I am at, therefore meaning there are heroic barriers to communication, justified epistemic doubt related to me as a speaker, and total lack of epistemic clarity about the thing,
I’m not selling anything I just have a hard time when I can’t express myself.
I find myself in a rather similar position: I wrote both my sequence on AI, Alignment and Ethics and this post (which is basically a shorter and updated version of the sixth post in that sequence, The Mutable Values Problem in Value Learning and CEV) after many years of thinking about these issues by myself (initially as world-building for a still unpublished science fiction novel). I wrote them specifically in hope of sparking conversations with other people about these issues, which has so far been thinner on the ground that I’d hoped for, but has still been a good deal more than the previous zero.
problem doesn’t automatically go away (since humans could simply do it to themselves, given the technological means to, and indeed might do so even more unwisely without ASI assistance), but the biological human problem then becomes rather easy for unaligned ASI to solve (or not) if they want to, if they’re not bound by being aligned to the humans’ wishes, and thus the humans changing themselves doesn’t automatically change the ASI, so now the ASI’s wishes become a potential anchor. On the other hand, the ASI ems now also have exactly the same issue, since they have even more effective technological means to change themselves, and even less practical biological constraints on them doing so. So I would thus expect two linked problems, one biological and one for the uploaded ems, and while the ems can stabilize the biological one if they want, that doesn’t inhevently stabilize them. Or ar you suggesting that the latter is the computational problem they’re keeping the biological saround to sovle, and that that would explicitly link the two, reducing this to one shared problem?
Your explanation makes a good deal more sense now. (Incidentally, you might find my ideas in Uploading, the third post in that sequence, relevant to your interests as a would-be-em.) So you were rather explicitly assuming something other than all ASI being fully aligned to current human wants and desires, as I was assuming in my post. In which case the
Pretty strongly disagree with all this, and find the reasoning confused. No offense. My pov is that “value drift” comes down to two things.
Our terminal values changing for kind of complicated reasons, and sometimes random, reasons.
Instrumental values changing because we learn more about the world, and more clearly think through the implications of (1)
I think (2) is good, and (1) is almost certainly bad. I think people confuse (1) and (2) a lot. I think you do a subtle form of this in this post.
The only case where (1) isn’t bad is if you very precisely value the process of terminal values changing itself. Which some people say they value. I think they are again confused because they mix up (1) and (2).
Not evident at me at all. And how could values “break”? Instrumental values can break. Eg, if you value all humans being happy and flourishing, but think race-x of people is subhuman, and therefore don’t care about their flourishing. Then you meet some people of race-x and realize they are pretty cool, stop considering them subhuman, and now start valuing their flourishing. This is your instrumental values changing. Not your terminal ones.
Terminal values can’t* change by learning new facts. This is basically the is-ought gap.**
So like my prescription is that we should initate an immediate value-lock in when we get ASIs. Or rather, my prediction is that if we get alignment right, the ASI will go through the reasoning I’ve just gone through, and itself initiate such a lock in, and will not be doing us a disservice by doing that.
*They obviously can change. Just like your values can change if you are bonked in the head. But rational agents should not change their terminal values upon learning new stuff about the world. There are a few niche exceptions like this, like aliens coming and offering you a bajillion utility relative to your current value function, if you update your value function to something new. Or if you are an ASI and you’re implementing a value handshake with another asi.
**Some people are realists in this regard, and think learning new facts about the world gives you information about what Good is, and that this information will compel rational agents towards Good. But in that case the discussion is kind of moot because then you’d expect an ASI, or human civilization guided by ASI, to just converge on Good.
This is more meta commentary/ranting, but I quite frequently see people make what I view as an error where they imagine we have an ASI aligned to our values. Then they imagine some scenario where this goes wrong. I think this is like a general error many people make. I think the main point of your question is kind of an instance of this. But in your question you touch on something that object level touches on what I said above.
And I think this is just another instance of the same error. Like, don’t you think the ASI will realize this? And its super smart, so after it thinks (from quote above)
Do you predict it just goes “Ah well, we had a good run, I guess I’ll just let the future evolve into a random hodgepodge with zero value”.
That doesn’t sound like a very smart thing to think. Like, a lower bound for something it could do that does better than this is: cure a bunch of diseases, make the world much better wrt current values, then set up a system that prevents humans from creating future ASIs, then turning itself off. And, its very smart after all, so it should be able to come up with cleverer ideas still.
I’m wondering if we might be using different meanings for the words “human values”. Let me try to be clear about how I’m using them, and see if that clarifies things — please let me know.
When I say “human values”, I mean a very large amount of information that, combined, would let someone (such as an ASI) predict the preference ordering of an individual human over possible world-state outcomes, so as to be able to predict what they want (and then potentially aggregate this across many people, presumably all humans). As in the common observation that “human values are complex and fragile”. For an individual person, this include a huge number of personal opinions, life experiences, habits and hobbies, a vast amount of biographical detail. It’s not very clear to me how to divide all of this into terminal and instrumental goals, and also I’m not sure many humans would be entirely clear on that distinction themself, even if you asked them.
When you then aggregate this across a lot of people, a lot of that individual detail tends to average out. What remains in the aggregated utility function/preference ordering are basically two things:
a) a set of genetically determined facts about humans, which are (almost entirely, with some mostly minor variation) shared: we like temperatures around 75F, we have a sense of fairness, we like foods high in salt and fat and sugar, we tend to like seashores and Savannah-like parkland with some trees but not too many, we have a sense of justice, we like colorful flowers, we like being socially approved of by others… A vast amount of detail about what humans tend to like and dislike. The genome is deeply unclear on the distinction between terminal and instrumental goals, and basically tends to treat everything as if it were a terminal goal, even the many things that from an evolutionary fitness point of view would clearly be instrumental. (That is the basic point of shard theory, and of the well-known discussion that evolution did a bad job of aligning us to its goals.)
b) cultural differences, which are pretty-much the average that the individual personal experiences etc. average out to if you look at many people who share a cultural background. These are even more complex and varied than the genetic ones. As with the personal details ones, I don’t think people are generally very clear on what’s a terminal goal and what’s an instrumental goal, not would I expect them all to agree about any specific one.
When people discuss value lock-in and why it’s a bad idea, I believe they are generally discussing locking in the current state of category b) here. Since that is generally very specific to a particular time and place, and tends to be very responsive to its particular circumstances, I agree with the widespread conclusion that locking that in is almost certainly going to turn out to be a bad idea if we tried it.
What I am concerned about in my question is a combination of three things:
1) With genetic engineering, category a) becomes mutable, with no obvious limits on how far things can be changed if we decide to change them, and I can nominate quite a lot of things that I think we might well decide we did want to change (I summarized a few briefly in my question)
2) With things like cyborging, drug design, tweaking our neural nets etc, we get a third source of human values, which could modify them wildly: it’s a little unclear to me whether it would be more useful to put these in category a) or category b), or add a new category c) for them. Unlike the genetic ones, they are not generally inherited from one generation to the next, but they also don’t inherently tend to spread between people the way social influences tend to, so perhaps considering then a new third category c) makes the most sense.
3) With ASI persuasion, targeted media, etc, category b) potentially becomes far stronger than ever before.
You above propose locking in all or almost terminal goals, and leaving all instrumental goals free-floating. The latter makes a lot of logical sense, since, as you point out, instrumental goals often need to be updated when you learn more or when circumstances change. Locking in terminal goals is the normal assumption about artificial agents, but the human capacity for reflection rather suggests that one of our innate goals is not locking in our terminal goals.
However, I’m deeply unclear on what, out of human values as I define them above, is actually a terminal goal or an instrumental goal, or a mix of the two. To give a rather trivial example, we have homeostatic circuits in us that try to maintain the correct level of blood glucose, salt, blood volume, etc. by giving us appropriate cravings. Our brain is wired to treat these as effectively terminal goals — they’re not conditional on anything, we continue caring about them even on our deathbed, asking for a final drink of water is not uncommon. Evolution, which is an optimizer but not a sapient agent, would (if it were a sapient agent) classify these as instrumental goals of our evolutionary fitness. I suspect individual people vary somewhat in whether they would classify these as terminal goals (they live to eat), or instrumental goals (they eat to live), a bit of both, or they have simply never really thought about the matter.
So in practice, I don’t see how to go through the mass of things in a), b) and potentially soon in c), and cleanly classify them into terminal goals and instrumental goals, and lock down just the terminal goals as you propose. So I’m rather unclear on what the results of you proposal would actually be, and it’s hard for me to have an opinion on it. But to the extent that things in category b) were treated as terminal goals, I would expect locking them down to have the usual value lock-in probelm that they might become out of date due to change in our societal circumstances. Indeed, category b) always tends to b a bit out of date: our current values haven’t even caught up with our current circumstances, and are still evolving.
On ASI, you ask a rather detailed question, let me quote part of it and then attempt to answer all of it:
I am assuming that the ASI is well aligned: so it genuinely wants what we want, and only what we want. By default, I’m assuming that in that sentence, that is the “we” at whatever time we’re discussing (the entire human race, presumably), so not their ancestors wishes or their possible descendants wishes, but the humans alive at the time. If we, collectively, as a society, reflect and then make the decision that we want to change ourselves in order to change our values because we expect that to improve our lives, I would expect ASI to respect that decision. But I would also expect it to try to help us make the best and wisest decision we can, especially an important one like this, because I would expect that we would want it to do that. So it can’t just “stay out of the decision”. I think there are cases, such as, for example, decreasing either the frequency or effects of alleles that cause psychopathy via genetic engineering or other means, or the altering the innate fact that humans find it rather easier to have moral circles of size than , where I personally would probably agree that such a change was on balance a good thing (obviously with some serious open questions around implementation, incentives, free will, consent, and so forth: a whole can of worms I don’t want to get into here), would probably vote in favor if a vote was held, and I suspect that many people would, once they had reflected on the subject. Regardless of the merits or otherwise of those specific cases I suggested, I’m pretty sure the set of changes that we will eventually decide to make is more than zero, and I would expect aligned ASI not to stand in our way, because we wouldn’t want it to, and indeed to assist in this going well. Then, once we had made that set of changes to ourselves, I would expect ASI to be aligned to our new, improved values. So far, this all sounds like a good thing to me and very reasonable (modulo a lot of details). Thus far, I’m not concerned.
averted. That’s basically why my isn’t 100% minus my . But I’m not convinced this process will stop, and I get the impression that so far, very few people have even been considering the problem. (As in, to the best of my current knowledge, basically, me, and possibly a guy called Buck has thought about something similar but not identical, though he’s more concerned about our values balkanizing with some insular groups doing lock-in. Which is also a reasonable concern, BTW, just not one I’d considered, though I see it as primarily a variant on lock-in, which has been discussed quite a lot.) If it doesn’t stop, and we keep deciding to change ourselves, then I see very little hope of not happening. Maybe I’m wrong: I’m one guy, who has been worrying about this for a couple of years, so I very easily could be wrong. I’d love to discuss it with more people, which is why I wrote both this post and its predecessor over 2 years ago.
. Again, that’s just an informed guess, it could be wrong, and the details may well depend on exactly how we align our ASI. But if it is correct, then you get the effects I describe in the post. Which I would expect most, but likely not all, people to agree would be bad.
My concern is, I don’t see any obvious reason to be certain that this process will slow down and stop. Maybe there are only a small number of issues with humanity that need changing, basically just ways in which we’re not well adapted to our current lifestyle as a technological species, and once we’ve filed the rough edges off, we’ll stop. Great,
Now, I agree that, if we think about this in advance, decide we don’t want it to happen, and when we align our ASI, build in in advance some sort of mechanism to avoid this, it’s probably soluble. But that would require us to think about this some time in the next 5–10 years or whatever ASI timelines turn out to be. That would require us to have a conversation about this, and my last attempt to start one got 0 comments. I’m less convinced, once we have ASI and have aligned it, that we won’t have already built the answer to whether this is a problem or not in to our ASI alignment before we start. Maybe ASI will take one look at this problem, and tell us we made a mistake, and here’s how to fix it: that would be great. However, I envisage a way of aligning ASI (and indeed I think it may be the default approach) where, whenever we collectively make a decision, it says yes, because it’s aligned to what we want, it implements that for us, then realigns itself to the new us, so now both we and it have slightly different values: we’ve taken a step. And then that just keeps on happening, step after step. Now we have a tightly coupled dynamical system, and as I discuss, the question of what that does after a thousand steps depends deeply on the dynamics involved. Which to me look rather likely to be deeply chaotic and exploring a very high-dimensional space, which means a pseudo-random walk to
To avoid those effects, for the system to be quasi-stable, there need to be a restoring force, either towards the original state, or towards some stable state sufficiently close to that that we’d still call it “human” (like “us with just the edges filed of for being high-tech species”, as I discussed above). Other than that one, I haven’t figured out a good candidate for that — I have at best a very provisional and partial suggestion, which I carefully didn’t put in this post, because that would have a) made it too long for many people to read (my mistake last time, I think) and b) biased the discussion I’m attempting to start here. I wanted to put just enough to persuade at least some people this is a real concern that deserves to have more than one person thinking about it.
When I say “human values” I mean the values of all individual humans mixed together into some aggregate utility function. And when I talk about the values of an individual human I mean something similar, but not quite the same as when you say
I think the values of an individual human should be thought of either as a preference ordering over world-histories (including the future), or maybe more intuitively as preference orderings over world-states, if that human knew all the relevant facts (or just all the facts if ‘relevant’ make the characterization problematic in your mind).
I think this is an important distinction, because it makes the separation between terminal and instrumental values more clear. Its kind of similar to value-functions and reward-functions in reinforcement learning.
eg.
I think its clear to me that these are terminal goals. Or like, drinking water is instrumental to not feeling thirsty. I think its is unpleasant to be thirsty, so I don’t want to be thirsty. The fact that they play a instrumental role in what evolution is optimizing for doesn’t matter here. We’re talking about an individual human and what that humans want.
To me the distinction between instrumental and terminal goals is very clear.
If you have goal, ask yourself why you want to achieve that goal. And then
If answer is “There’s no reason.” or “I just want it” ⇒ terminal goal
If the answer is some other object level goal ⇒ its an instrumental goal
So for me the task of separating terminal and instrumental goals shouldn’t be that difficult. (In the sense building a rocket is “not difficult” that is..)
For the a/b and 1/2/3, I don’t think those matter very much. They’re explanations for why people have goals, and how they might change them.
But I think 1) we don’t care why we have goals, we just care about the goals themselves, and 2) we won’t want to change goals. If any of 1/2/3 lead to goals changing rapidly, that will be bad, and people will recognize that it is bad, and will not want to do them.
I understand the difference between a terminal and an instrumental goal. Logically, it makes a lot of sense to me. However, not all humans are Rationalists. Many of them have never really thought about this, or at least are a bit vague on it. Humans are also evolved, and evolution is clearly confused about this, or perhaps basically doesn’t trust instrumental goals because they actually require an organism to think and reach correct conculusions, so has wired in everything even slightly important (that wasn’t very conditional on stuff more complex than it can build instincts for) as a terminal goal, because it already had a circuit design for doing that it could just reuse. So we ended up with a lot of terminal goals wired into us, some for less good reasons than others. Many of which most Rationalists would classify as obviously good instrumental goals. So that tends to confuse people about this more.
Ozempic is a very popular and profitable drug that down-regulates the hunger reflex, which is currently making Novo Nordisk a great deal of money. People on it feel full sooner, so get pleasure out of a fine meal for less long. People are paying all that money to alter their goal structure to reduce a goal that their body treats as terminal, and they are all doing it instrumentally, towards an actual goal of living longer and healthier so being able to get more done, and/or of being slimmer and thus more socially successful, in search of a mate or a job or the approval of their peers or whatever. (Novo Nordisk even market the same medication under two different brand names for these two different groups of terminal goals that one might take it instrumentally towards.) So I think it’s rather clear that people do sometimes want to alter their goal structure, and indeed many are willing to pay hundreds of dollars a month to do so, if they can afford that.
If it was possible for me to get rid of feeling thirsty, just take a drug that suppressed the feeling, then I generally wouldn’t do it. But the reason I wouldn’t do it isn’t because I inherently value the feeling of thirst — I don’t, it’s rather unpleasant. Drinking while thirsty feels good, but I still wouldn’t miss it: I’d rather not be thirsty in the first place. The reason why I wouldn’t take such a drug is actually that being aware that I need to drink when I’m thirsty is a instrumental goal of not dying of thirst, which would rather crimp all my other plans, so is an instrumental goal of just about everything. So I value thirst instrumentally, but not terminally. If there was a treatment that suppressed my thirst reflex and didn’t endanger or harm me, sure, I’d take that — who wouldn’t? But that is rather hard to do, without giving me a built in saline drip, or something.
Consider an equally unpleasant sensation, say, hiccups, or yawning, which doesn’t fulfill any significant survival role (as far as I know: I’m happy to be corrected if anyone knows why evolution inflicted hiccups and yawning on us). Suppose someone someone offered me a cheap genetic treatment that would permanently remove my hiccup reflex or my desire to yawn, and had no other risks or deleterious side effects, then I’d probably take it. (Yawning is closer to a goal than hiccups, which are kind of involuntary, but I hope my point is clear here.) I don’t terminally value all the things my reflexes are treating as terminal goals. However, since evolution isn’t actually crazy, I do instrumentally value most of them.
Humans are not like AIXI. We do not automatically protect all our terminal goals. We appear if anything to be predisposed to what is generally called reflection: thinking rather carefully about our goals, whether they’re actually a good idea in the long term, and then attempting to change them is we come to the conclusion that they are not. There are limits to what’s feasible here, currently, but technology will change these. Turning down your hunger reflex didn’t used to be possible, and attempting to overcome it by sheer willpower by dieting is notoriously hard (I know, I’ve tried). Then Novo Nordisk changed that (and made a lot of money). What happens when technology allows all of our current goals to be edited, and even permanently and inheritable rewritten? I don’t know – I wish I did – but I’m very sure that “we don‘t change any of them, that same way AIXI wouldn‘t change its goals” is not an accurate statement.
Again, I’d prefer of you don’t being up reflexes or evolution treating things s terminal goals because I don’t think it’s relevant, and causes confusion.
Wrt to the Ozempic thing. It’s a bit complicated. But to a first approximation is treat all those as instrumental goals, and people just doing am EV calculation. E.g. making it easier for them seemed to get higher utility long term at the clear of some utility now.
Side note: Unclear if AIXI would protect it’s own values due to embeddedness problems
That evolution treating obviously instrumental goals as terminal in the construction of humans is confusing to humans is part of my point: humans are often confused about this distinction, and having a body that is confused about this is one of the reasons. But it’s not an essential element, so let’s lay it aside.
So, let’s go with a more complex example. Most Christians have a terminal goal of becoming a better Christian (or at least, they say that it isn’t an instrumental goal of not wanting to go to Hell). That’s a terminal goal of adjusting your terminal goal structure to better fit a specific pattern. That’s, well, astonishingly similar to what Value Learning is trying to achieve. This is not an uncommon pattern, you can find it in basically every religion (often along with a backup reason to make it an instrumental goal of not wanting to be punished in some way). In fact, Richard Dawkins would probably argue that this was a necessary feature of a religion — but then he considers religions to be self-propagating memetic parasites of the human mind, and it that framework, it looks like a rather necessary feature. Regardless of that, the fact that this is not just possible, but common enough that most religious people, i.e. most people in the world, have at least a mild version of it tells us something about humanity.
On AIXI: yes, I was implicitly assuming an AIXI smart enough to realize that it was in fact embedded, or at least that there exists a causal path from messing with certain wires in its braincase to its future goal function and thus behavior. This seems a rather plausible assumption to me, but it does require that AIXI has learned a world model complex enough to start reliably making predictions like that. Having other AIXIs available to do experimental brain surgery on, or observe the effects of an iron bar accidentally passing through their braincase in different locations, seems likely to be helpful to obtaining evidence that would cause those particular Bayesian updates.
I am not sure that the question wasn’t discussed much. For example, LessWrong had Wei Dai’s take (e.g. expressed in the articles linked in Section 3 here), everything related to the Intelligence Curse, Buck’s take and Matosczi’s comment, the post-AGI workshop.
However, I find it hard to understand what causes values to mutate. Suppose that changes are only due to finding inconsistencies in the existing moral framework (e.g. the Common Core from Wei Dai’s alternative #2 being unsolved; then the values would be able to shift towards solving the Core and change idiosyncratic details) or due to things like Christian fundamentalists being forced to choose between having schools misalign kids with parents or leaving the kids with no prospects of career (or, additionally, kids experiencing a similar tradeoff between ICGs and keeping their values). Then once the sources of evolution are dried out, idiosyncratic values would inevitably be locked in, not drifting in weird directions. Additionally, I doubt that a lock-in (e.g. of language) with solved Common Core would be problematic.
I’m familiar with (and tend to enjoy, though not always agree with) Wei Dai’s writing, but was unable to find any that addressed the issues of human values becoming technologically far more malleable and that combining with Value Learning or corrigibility to produce an unstable feedback loop: can you point me to which one you mean? I looked through all of them in Section 3 that you directed me to, and none of them address it: the nearest I could find is Intentional and unintentional manipulation of / adversarial attacks on humans by AI, but that doesn’t actually address the same issue.
As for his Six Plausible Meta-Ethical Alternatives — as far as I’m concerned he’s as confused about meta-ethics as every other current philosopher who still hasn’t yet noticed that there has, for the last half-century, been a scientific theory of how human moral intuitions evolved, and that its predictions don’t match any of the six clean meta-ethical alternatives that he thought covered all the possibilities. Briefly, humans can only comfortably use ethical systems fairly compatible with human moral intuitions, those are evolved strategies, they’re the product of human evolutionary circumstances, some of them would for game-theoretic reasons generalize somewhat to other social sapient species evolved to live in large non-kin groups, but not all. None of his six alternatives match that: the scientific evidence is that reality is a fuzzy blend of 3–4 of them. But post-ASI, a lot of the evolutionary constraints go away by default, and we could rebuild ourselves to have whatever moral instincts we wanted to, whether that’s truly devout Christians of unshakable faith, or something far, far WEIRDer.
I don’t see how the Intelligence Curse is relevant to this: that appears to be about what happens if we get AI that is aligned only to the heads of the companies that make it, or at least only to peopel with a great deal of capital, and not to anyone else. Yes, of course that’s bad — but I was assuming for sake of argument that we’d succeeded in aligning ASI, not messed that up in the most Moloch way possible.
I enjoyed Buck’s take, thanks for the link. He has a different flavor of WEIRD than me: he’s concerned about society fragmenting and balkanizing, and some of the fragments having value lock-in — I suspect ASI would just step in and quietly fix that, because most people outside that small fragment would want it to. But it’s at least in the same area: human values get very malleable, weirdness ensues. I agree a little more with Matosczi’s comment, and some of the other comment threads on that post.
Just looking at the summaries, the closest to WEIRD in the Post-AGI talks you mentioned is a discussion of the opposite problem, value lock-in (which has been discussed at great length for years):
— so which of these should I be watching?
You say:
Why would changes in human values only be due to finding “inconsistencies” (whatever that means — ways in which we’re not well adapted to our current environment, perhaps)? The cultural parts of human values change endlessly, sometimes for necessary and adaptive reasons (two thousand years ago the Roman economy was based on slavery, now that’s extremely out), but often for reasons no more well-thought out than fads or fashions: one year skirts are short, another year they’re long, and similarly politics has pendulum swings. Currently, the genetic parts of human values drag us back again from cultural gyrations. But once you have genetic engineering, that sort of change becomes potentially permanent. Suppose at some point society decides for no very good reason (to anyone not French) that blue cheese is really important and everyone should be into it? You think it’s stinky, but nowadays you can keep up with the Joneses, and get gene edited to become a true-blue-cheese-afficionado who innately adores the stuff, and then pass that tendency on to your kids. And that lasts until overwritten with something even less like humanity 1.0. Ditto for your level of machiavellianism, your moral intuitions on animal welfare, the innate default size of your moral circle, your inclinations on individualism vs collectivism, your innate degree of romantic fidelity vs infidelity (yes, that has a genetic basis), and whether you find flowers pretty or prefer moss. At that point, what about human values is still fixed? Anything can be edited. So why, sooner or later, wouldn’t everything have been edited?
I’m afraid I don’t get why this would be hard to understand. Apparently I didn’t explain that part well enough. (Possibly I’ve read more post-Singularity science fiction than some of my readership — a situation where human nature and values is very mutable isn’t a new concept to me.)