Pretty strongly disagree with all this, and find the reasoning confused. No offense. My pov is that “value drift” comes down to two things.
Our terminal values changing for kind of complicated reasons, and sometimes random, reasons.
Instrumental values changing because we learn more about the world, and more clearly think through the implications of (1)
I think (2) is good, and (1) is almost certainly bad. I think people confuse (1) and (2) a lot. I think you do a subtle form of this in this post.
The only case where (1) isn’t bad is if you very precisely value the process of terminal values changing itself. Which some people say they value. I think they are again confused because they mix up (1) and (2).
So while value lock-in is obviously a dumb idea,
Stability obviously might be just value lock-in, where we simply freeze in as an orthodoxy early-21st century values which haven’t even fully caught up with early-21st century realities, and then try to apply them to a society whose technology is evolving extremely rapidly. This is very evidently a bad idea, long recognized as such, and would obviously sooner or later break.
Not evident at me at all. And how could values “break”? Instrumental values can break. Eg, if you value all humans being happy and flourishing, but think race-x of people is subhuman, and therefore don’t care about their flourishing. Then you meet some people of race-x and realize they are pretty cool, stop considering them subhuman, and now start valuing their flourishing. This is your instrumental values changing. Not your terminal ones.
Terminal values can’t* change by learning new facts. This is basically the is-ought gap.**
So like my prescription is that we should initate an immediate value-lock in when we get ASIs. Or rather, my prediction is that if we get alignment right, the ASI will go through the reasoning I’ve just gone through, and itself initiate such a lock in, and will not be doing us a disservice by doing that.
*They obviously can change. Just like your values can change if you are bonked in the head. But rational agents should not change their terminal values upon learning new stuff about the world. There are a few niche exceptions like this, like aliens coming and offering you a bajillion utility relative to your current value function, if you update your value function to something new. Or if you are an ASI and you’re implementing a value handshake with another asi.
**Some people are realists in this regard, and think learning new facts about the world gives you information about what Good is, and that this information will compel rational agents towards Good. But in that case the discussion is kind of moot because then you’d expect an ASI, or human civilization guided by ASI, to just converge on Good.
This is more meta commentary/ranting, but I quite frequently see people make what I view as an error where they imagine we have an ASI aligned to our values. Then they imagine some scenario where this goes wrong. I think this is like a general error many people make. I think the main point of your question is kind of an instance of this. But in your question you touch on something that object level touches on what I said above.
The problem is, that’s like attaching a weather-vane to the front of a self-driving car, and then programming it to drive in whichever direction the weather-vane currently points. It’s a tightly-coupled interacting dynamical system. Obviously ASI could try not to affect our values, and give us self-determination to decide on these changes ourselves — but in a system as interwoven as ASI and humans obviously will be post-Singularity, the counterfactual of “how would human values be evolving if humans somehow had the same society that ASI enables without that actually having any ASI in it” sounds ludicrously far-fetched. Maybe an ASI could do that – it is after all very smart – but I strongly suspect the answer is no, that’s functionally impossible, and also not what we humans actually want, so we do in fact have a tightly-coupled very complex nonlinear dynamical system, where ASI does whatever the humans value while also being extremely interwoven into the evolution of what the humans value. So there’s a feedback loop.
And I think this is just another instance of the same error. Like, don’t you think the ASI will realize this? And its super smart, so after it thinks (from quote above)
...maybe I could predict how humans values would evolve without my interference? After, all I am very smart. But I strongly suspect the answer is no, that’s functionally impossible. And its also not what humans actually want...
Do you predict it just goes “Ah well, we had a good run, I guess I’ll just let the future evolve into a random hodgepodge with zero value”.
That doesn’t sound like a very smart thing to think. Like, a lower bound for something it could do that does better than this is: cure a bunch of diseases, make the world much better wrt current values, then set up a system that prevents humans from creating future ASIs, then turning itself off. And, its very smart after all, so it should be able to come up with cleverer ideas still.
I’m wondering if we might be using different meanings for the words “human values”. Let me try to be clear about how I’m using them, and see if that clarifies things — please let me know.
When I say “human values”, I mean a very large amount of information that, combined, would let someone (such as an ASI) predict the preference ordering of an individual human over possible world-state outcomes, so as to be able to predict what they want (and then potentially aggregate this across many people, presumably all humans). As in the common observation that “human values are complex and fragile”. For an individual person, this include a huge number of personal opinions, life experiences, habits and hobbies, a vast amount of biographical detail. It’s not very clear to me how to divide all of this into terminal and instrumental goals, and also I’m not sure many humans would be entirely clear on that distinction themself, even if you asked them.
When you then aggregate this across a lot of people, a lot of that individual detail tends to average out. What remains in the aggregated utility function/preference ordering are basically two things:
a) a set of genetically determined facts about humans, which are (almost entirely, with some mostly minor variation) shared: we like temperatures around 75F, we have a sense of fairness, we like foods high in salt and fat and sugar, we tend to like seashores and Savannah-like parkland with some trees but not too many, we have a sense of justice, we like colorful flowers, we like being socially approved of by others… A vast amount of detail about what humans tend to like and dislike. The genome is deeply unclear on the distinction between terminal and instrumental goals, and basically tends to treat everything as if it were a terminal goal, even the many things that from an evolutionary fitness point of view would clearly be instrumental. (That is the basic point of shard theory, and of the well-known discussion that evolution did a bad job of aligning us to its goals.)
b) cultural differences, which are pretty-much the average that the individual personal experiences etc. average out to if you look at many people who share a cultural background. These are even more complex and varied than the genetic ones. As with the personal details ones, I don’t think people are generally very clear on what’s a terminal goal and what’s an instrumental goal, not would I expect them all to agree about any specific one.
When people discuss value lock-in and why it’s a bad idea, I believe they are generally discussing locking in the current state of category b) here. Since that is generally very specific to a particular time and place, and tends to be very responsive to its particular circumstances, I agree with the widespread conclusion that locking that in is almost certainly going to turn out to be a bad idea if we tried it.
What I am concerned about in my question is a combination of three things:
1) With genetic engineering, category a) becomes mutable, with no obvious limits on how far things can be changed if we decide to change them, and I can nominate quite a lot of things that I think we might well decide we did want to change (I summarized a few briefly in my question)
2) With things like cyborging, drug design, tweaking our neural nets etc, we get a third source of human values, which could modify them wildly: it’s a little unclear to me whether it would be more useful to put these in category a) or category b), or add a new category c) for them. Unlike the genetic ones, they are not generally inherited from one generation to the next, but they also don’t inherently tend to spread between people the way social influences tend to, so perhaps considering then a new third category c) makes the most sense.
3) With ASI persuasion, targeted media, etc, category b) potentially becomes far stronger than ever before.
You above propose locking in all or almost terminal goals, and leaving all instrumental goals free-floating. The latter makes a lot of logical sense, since, as you point out, instrumental goals often need to be updated when you learn more or when circumstances change. Locking in terminal goals is the normal assumption about artificial agents, but the human capacity for reflection rather suggests that one of our innate goals is not locking in our terminal goals.
However, I’m deeply unclear on what, out of human values as I define them above, is actually a terminal goal or an instrumental goal, or a mix of the two. To give a rather trivial example, we have homeostatic circuits in us that try to maintain the correct level of blood glucose, salt, blood volume, etc. by giving us appropriate cravings. Our brain is wired to treat these as effectively terminal goals — they’re not conditional on anything, we continue caring about them even on our deathbed, asking for a final drink of water is not uncommon. Evolution, which is an optimizer but not a sapient agent, would (if it were a sapient agent) classify these as instrumental goals of our evolutionary fitness. I suspect individual people vary somewhat in whether they would classify these as terminal goals (they live to eat), or instrumental goals (they eat to live), a bit of both, or they have simply never really thought about the matter.
So in practice, I don’t see how to go through the mass of things in a), b) and potentially soon in c), and cleanly classify them into terminal goals and instrumental goals, and lock down just the terminal goals as you propose. So I’m rather unclear on what the results of you proposal would actually be, and it’s hard for me to have an opinion on it. But to the extent that things in category b) were treated as terminal goals, I would expect locking them down to have the usual value lock-in probelm that they might become out of date due to change in our societal circumstances. Indeed, category b) always tends to b a bit out of date: our current values haven’t even caught up with our current circumstances, and are still evolving.
On ASI, you ask a rather detailed question, let me quote part of it and then attempt to answer all of it:
Do you predict it just goes “Ah well, we had a good run, I guess I’ll just let the future evolve into a random hodgepodge with zero value”.
I am assuming that the ASI is well aligned: so it genuinely wants what we want, and only what we want. By default, I’m assuming that in that sentence, that is the “we” at whatever time we’re discussing (the entire human race, presumably), so not their ancestors wishes or their possible descendants wishes, but the humans alive at the time. If we, collectively, as a society, reflect and then make the decision that we want to change ourselves in order to change our values because we expect that to improve our lives, I would expect ASI to respect that decision. But I would also expect it to try to help us make the best and wisest decision we can, especially an important one like this, because I would expect that we would want it to do that. So it can’t just “stay out of the decision”. I think there are cases, such as, for example, decreasing either the frequency or effects of alleles that cause psychopathy via genetic engineering or other means, or the altering the innate fact that humans find it rather easier to have moral circles of size than , where I personally would probably agree that such a change was on balance a good thing (obviously with some serious open questions around implementation, incentives, free will, consent, and so forth: a whole can of worms I don’t want to get into here), would probably vote in favor if a vote was held, and I suspect that many people would, once they had reflected on the subject. Regardless of the merits or otherwise of those specific cases I suggested, I’m pretty sure the set of changes that we will eventually decide to make is more than zero, and I would expect aligned ASI not to stand in our way, because we wouldn’t want it to, and indeed to assist in this going well. Then, once we had made that set of changes to ourselves, I would expect ASI to be aligned to our new, improved values. So far, this all sounds like a good thing to me and very reasonable (modulo a lot of details). Thus far, I’m not concerned.
My concern is, I don’t see any obvious reason to be certain that this process will slow down and stop. Maybe there are only a small number of issues with humanity that need changing, basically just ways in which we’re not well adapted to our current lifestyle as a technological species, and once we’ve filed the rough edges off, we’ll stop. Great, averted. That’s basically why my isn’t 100% minus my . But I’m not convinced this process will stop, and I get the impression that so far, very few people have even been considering the problem. (As in, to the best of my current knowledge, basically, me, and possibly a guy called Buck has thought about something similar but not identical, though he’s more concerned about our values balkanizing with some insular groups doing lock-in. Which is also a reasonable concern, BTW, just not one I’d considered, though I see it as primarily a variant on lock-in, which has been discussed quite a lot.) If it doesn’t stop, and we keep deciding to change ourselves, then I see very little hope of not happening. Maybe I’m wrong: I’m one guy, who has been worrying about this for a couple of years, so I very easily could be wrong. I’d love to discuss it with more people, which is why I wrote both this post and its predecessor over 2 years ago.
Now, I agree that, if we think about this in advance, decide we don’t want it to happen, and when we align our ASI, build in in advance some sort of mechanism to avoid this, it’s probably soluble. But that would require us to think about this some time in the next 5–10 years or whatever ASI timelines turn out to be. That would require us to have a conversation about this, and my last attempt to start one got 0 comments. I’m less convinced, once we have ASI and have aligned it, that we won’t have already built the answer to whether this is a problem or not in to our ASI alignment before we start. Maybe ASI will take one look at this problem, and tell us we made a mistake, and here’s how to fix it: that would be great. However, I envisage a way of aligning ASI (and indeed I think it may be the default approach) where, whenever we collectively make a decision, it says yes, because it’s aligned to what we want, it implements that for us, then realigns itself to the new us, so now both we and it have slightly different values: we’ve taken a step. And then that just keeps on happening, step after step. Now we have a tightly coupled dynamical system, and as I discuss, the question of what that does after a thousand steps depends deeply on the dynamics involved. Which to me look rather likely to be deeply chaotic and exploring a very high-dimensional space, which means a pseudo-random walk to . Again, that’s just an informed guess, it could be wrong, and the details may well depend on exactly how we align our ASI. But if it is correct, then you get the effects I describe in the post. Which I would expect most, but likely not all, people to agree would be bad.
To avoid those effects, for the system to be quasi-stable, there need to be a restoring force, either towards the original state, or towards some stable state sufficiently close to that that we’d still call it “human” (like “us with just the edges filed of for being high-tech species”, as I discussed above). Other than that one, I haven’t figured out a good candidate for that — I have at best a very provisional and partial suggestion, which I carefully didn’t put in this post, because that would have a) made it too long for many people to read (my mistake last time, I think) and b) biased the discussion I’m attempting to start here. I wanted to put just enough to persuade at least some people this is a real concern that deserves to have more than one person thinking about it.
When I say “human values” I mean the values of all individual humans mixed together into some aggregate utility function. And when I talk about the values of an individual human I mean something similar, but not quite the same as when you say
When I say “human values”, I mean a very large amount of information that, combined, would let someone (such as an ASI) predict the preference ordering of an individual human over possible world-state outcomes, so as to be able to predict what they want (and then potentially aggregate this across many people, presumably all humans)
I think the values of an individual human should be thought of either as a preference ordering over world-histories (including the future), or maybe more intuitively as preference orderings over world-states, if that human knew all the relevant facts (or just all the facts if ‘relevant’ make the characterization problematic in your mind).
I think this is an important distinction, because it makes the separation between terminal and instrumental values more clear. Its kind of similar to value-functions and reward-functions in reinforcement learning.
eg.
However, I’m deeply unclear on what, out of human values as I define them above, is actually a terminal goal or an instrumental goal, or a mix of the two. To give a rather trivial example, we have homeostatic circuits in us that try to maintain the correct level of blood glucose, salt, blood volume, etc. by giving us appropriate cravings. Our brain is wired to treat these as effectively terminal goals — they’re not conditional on anything, we continue caring about them even on our deathbed, asking for a final drink of water is not uncommon. Evolution, which is an optimizer but not a sapient agent, would (if it were a sapient agent) classify these as instrumental goals of our evolutionary fitness.
I think its clear to me that these are terminal goals. Or like, drinking water is instrumental to not feeling thirsty. I think its is unpleasant to be thirsty, so I don’t want to be thirsty. The fact that they play a instrumental role in what evolution is optimizing for doesn’t matter here. We’re talking about an individual human and what that humans want.
To me the distinction between instrumental and terminal goals is very clear.
If you have goal, ask yourself why you want to achieve that goal. And then
If answer is “There’s no reason.” or “I just want it” ⇒ terminal goal
If the answer is some other object level goal ⇒ its an instrumental goal
So for me the task of separating terminal and instrumental goals shouldn’t be that difficult. (In the sense building a rocket is “not difficult” that is..)
For the a/b and 1/2/3, I don’t think those matter very much. They’re explanations for why people have goals, and how they might change them.
But I think 1) we don’t care why we have goals, we just care about the goals themselves, and 2) we won’t want to change goals. If any of 1/2/3 lead to goals changing rapidly, that will be bad, and people will recognize that it is bad, and will not want to do them.
I understand the difference between a terminal and an instrumental goal. Logically, it makes a lot of sense to me. However, not all humans are Rationalists. Many of them have never really thought about this, or at least are a bit vague on it. Humans are also evolved, and evolution is clearly confused about this, or perhaps basically doesn’t trust instrumental goals because they actually require an organism to think and reach correct conculusions, so has wired in everything even slightly important (that wasn’t very conditional on stuff more complex than it can build instincts for) as a terminal goal, because it already had a circuit design for doing that it could just reuse. So we ended up with a lot of terminal goals wired into us, some for less good reasons than others. Many of which most Rationalists would classify as obviously good instrumental goals. So that tends to confuse people about this more.
Ozempic is a very popular and profitable drug that down-regulates the hunger reflex, which is currently making Novo Nordisk a great deal of money. People on it feel full sooner, so get pleasure out of a fine meal for less long. People are paying all that money to alter their goal structure to reduce a goal that their body treats as terminal, and they are all doing it instrumentally, towards an actual goal of living longer and healthier so being able to get more done, and/or of being slimmer and thus more socially successful, in search of a mate or a job or the approval of their peers or whatever. (Novo Nordisk even market the same medication under two different brand names for these two different groups of terminal goals that one might take it instrumentally towards.) So I think it’s rather clear that people do sometimes want to alter their goal structure, and indeed many are willing to pay hundreds of dollars a month to do so, if they can afford that.
If it was possible for me to get rid of feeling thirsty, just take a drug that suppressed the feeling, then I generally wouldn’t do it. But the reason I wouldn’t do it isn’t because I inherently value the feeling of thirst — I don’t, it’s rather unpleasant. Drinking while thirsty feels good, but I still wouldn’t miss it: I’d rather not be thirsty in the first place. The reason why I wouldn’t take such a drug is actually that being aware that I need to drink when I’m thirsty is a instrumental goal of not dying of thirst, which would rather crimp all my other plans, so is an instrumental goal of just about everything. So I value thirst instrumentally, but not terminally. If there was a treatment that suppressed my thirst reflex and didn’t endanger or harm me, sure, I’d take that — who wouldn’t? But that is rather hard to do, without giving me a built in saline drip, or something.
Consider an equally unpleasant sensation, say, hiccups, or yawning, which doesn’t fulfill any significant survival role (as far as I know: I’m happy to be corrected if anyone knows why evolution inflicted hiccups and yawning on us). Suppose someone someone offered me a cheap genetic treatment that would permanently remove my hiccup reflex or my desire to yawn, and had no other risks or deleterious side effects, then I’d probably take it. (Yawning is closer to a goal than hiccups, which are kind of involuntary, but I hope my point is clear here.) I don’t terminally value all the things my reflexes are treating as terminal goals. However, since evolution isn’t actually crazy, I do instrumentally value most of them.
Humans are not like AIXI. We do not automatically protect all our terminal goals. We appear if anything to be predisposed to what is generally called reflection: thinking rather carefully about our goals, whether they’re actually a good idea in the long term, and then attempting to change them is we come to the conclusion that they are not. There are limits to what’s feasible here, currently, but technology will change these. Turning down your hunger reflex didn’t used to be possible, and attempting to overcome it by sheer willpower by dieting is notoriously hard (I know, I’ve tried). Then Novo Nordisk changed that (and made a lot of money). What happens when technology allows all of our current goals to be edited, and even permanently and inheritable rewritten? I don’t know – I wish I did – but I’m very sure that “we don‘t change any of them, that same way AIXI wouldn‘t change its goals” is not an accurate statement.
Again, I’d prefer of you don’t being up reflexes or evolution treating things s terminal goals because I don’t think it’s relevant, and causes confusion.
Wrt to the Ozempic thing. It’s a bit complicated. But to a first approximation is treat all those as instrumental goals, and people just doing am EV calculation. E.g. making it easier for them seemed to get higher utility long term at the clear of some utility now.
Side note: Unclear if AIXI would protect it’s own values due to embeddedness problems
That evolution treating obviously instrumental goals as terminal in the construction of humans is confusing to humans is part of my point: humans are often confused about this distinction, and having a body that is confused about this is one of the reasons. But it’s not an essential element, so let’s lay it aside.
So, let’s go with a more complex example. Most Christians have a terminal goal of becoming a better Christian (or at least, they say that it isn’t an instrumental goal of not wanting to go to Hell). That’s a terminal goal of adjusting your terminal goal structure to better fit a specific pattern. That’s, well, astonishingly similar to what Value Learning is trying to achieve. This is not an uncommon pattern, you can find it in basically every religion (often along with a backup reason to make it an instrumental goal of not wanting to be punished in some way). In fact, Richard Dawkins would probably argue that this was a necessary feature of a religion — but then he considers religions to be self-propagating memetic parasites of the human mind, and it that framework, it looks like a rather necessary feature. Regardless of that, the fact that this is not just possible, but common enough that most religious people, i.e. most people in the world, have at least a mild version of it tells us something about humanity.
On AIXI: yes, I was implicitly assuming an AIXI smart enough to realize that it was in fact embedded, or at least that there exists a causal path from messing with certain wires in its braincase to its future goal function and thus behavior. This seems a rather plausible assumption to me, but it does require that AIXI has learned a world model complex enough to start reliably making predictions like that. Having other AIXIs available to do experimental brain surgery on, or observe the effects of an iron bar accidentally passing through their braincase in different locations, seems likely to be helpful to obtaining evidence that would cause those particular Bayesian updates.
Pretty strongly disagree with all this, and find the reasoning confused. No offense. My pov is that “value drift” comes down to two things.
Our terminal values changing for kind of complicated reasons, and sometimes random, reasons.
Instrumental values changing because we learn more about the world, and more clearly think through the implications of (1)
I think (2) is good, and (1) is almost certainly bad. I think people confuse (1) and (2) a lot. I think you do a subtle form of this in this post.
The only case where (1) isn’t bad is if you very precisely value the process of terminal values changing itself. Which some people say they value. I think they are again confused because they mix up (1) and (2).
Not evident at me at all. And how could values “break”? Instrumental values can break. Eg, if you value all humans being happy and flourishing, but think race-x of people is subhuman, and therefore don’t care about their flourishing. Then you meet some people of race-x and realize they are pretty cool, stop considering them subhuman, and now start valuing their flourishing. This is your instrumental values changing. Not your terminal ones.
Terminal values can’t* change by learning new facts. This is basically the is-ought gap.**
So like my prescription is that we should initate an immediate value-lock in when we get ASIs. Or rather, my prediction is that if we get alignment right, the ASI will go through the reasoning I’ve just gone through, and itself initiate such a lock in, and will not be doing us a disservice by doing that.
*They obviously can change. Just like your values can change if you are bonked in the head. But rational agents should not change their terminal values upon learning new stuff about the world. There are a few niche exceptions like this, like aliens coming and offering you a bajillion utility relative to your current value function, if you update your value function to something new. Or if you are an ASI and you’re implementing a value handshake with another asi.
**Some people are realists in this regard, and think learning new facts about the world gives you information about what Good is, and that this information will compel rational agents towards Good. But in that case the discussion is kind of moot because then you’d expect an ASI, or human civilization guided by ASI, to just converge on Good.
This is more meta commentary/ranting, but I quite frequently see people make what I view as an error where they imagine we have an ASI aligned to our values. Then they imagine some scenario where this goes wrong. I think this is like a general error many people make. I think the main point of your question is kind of an instance of this. But in your question you touch on something that object level touches on what I said above.
And I think this is just another instance of the same error. Like, don’t you think the ASI will realize this? And its super smart, so after it thinks (from quote above)
Do you predict it just goes “Ah well, we had a good run, I guess I’ll just let the future evolve into a random hodgepodge with zero value”.
That doesn’t sound like a very smart thing to think. Like, a lower bound for something it could do that does better than this is: cure a bunch of diseases, make the world much better wrt current values, then set up a system that prevents humans from creating future ASIs, then turning itself off. And, its very smart after all, so it should be able to come up with cleverer ideas still.
I’m wondering if we might be using different meanings for the words “human values”. Let me try to be clear about how I’m using them, and see if that clarifies things — please let me know.
When I say “human values”, I mean a very large amount of information that, combined, would let someone (such as an ASI) predict the preference ordering of an individual human over possible world-state outcomes, so as to be able to predict what they want (and then potentially aggregate this across many people, presumably all humans). As in the common observation that “human values are complex and fragile”. For an individual person, this include a huge number of personal opinions, life experiences, habits and hobbies, a vast amount of biographical detail. It’s not very clear to me how to divide all of this into terminal and instrumental goals, and also I’m not sure many humans would be entirely clear on that distinction themself, even if you asked them.
When you then aggregate this across a lot of people, a lot of that individual detail tends to average out. What remains in the aggregated utility function/preference ordering are basically two things:
a) a set of genetically determined facts about humans, which are (almost entirely, with some mostly minor variation) shared: we like temperatures around 75F, we have a sense of fairness, we like foods high in salt and fat and sugar, we tend to like seashores and Savannah-like parkland with some trees but not too many, we have a sense of justice, we like colorful flowers, we like being socially approved of by others… A vast amount of detail about what humans tend to like and dislike. The genome is deeply unclear on the distinction between terminal and instrumental goals, and basically tends to treat everything as if it were a terminal goal, even the many things that from an evolutionary fitness point of view would clearly be instrumental. (That is the basic point of shard theory, and of the well-known discussion that evolution did a bad job of aligning us to its goals.)
b) cultural differences, which are pretty-much the average that the individual personal experiences etc. average out to if you look at many people who share a cultural background. These are even more complex and varied than the genetic ones. As with the personal details ones, I don’t think people are generally very clear on what’s a terminal goal and what’s an instrumental goal, not would I expect them all to agree about any specific one.
When people discuss value lock-in and why it’s a bad idea, I believe they are generally discussing locking in the current state of category b) here. Since that is generally very specific to a particular time and place, and tends to be very responsive to its particular circumstances, I agree with the widespread conclusion that locking that in is almost certainly going to turn out to be a bad idea if we tried it.
What I am concerned about in my question is a combination of three things:
1) With genetic engineering, category a) becomes mutable, with no obvious limits on how far things can be changed if we decide to change them, and I can nominate quite a lot of things that I think we might well decide we did want to change (I summarized a few briefly in my question)
2) With things like cyborging, drug design, tweaking our neural nets etc, we get a third source of human values, which could modify them wildly: it’s a little unclear to me whether it would be more useful to put these in category a) or category b), or add a new category c) for them. Unlike the genetic ones, they are not generally inherited from one generation to the next, but they also don’t inherently tend to spread between people the way social influences tend to, so perhaps considering then a new third category c) makes the most sense.
3) With ASI persuasion, targeted media, etc, category b) potentially becomes far stronger than ever before.
You above propose locking in all or almost terminal goals, and leaving all instrumental goals free-floating. The latter makes a lot of logical sense, since, as you point out, instrumental goals often need to be updated when you learn more or when circumstances change. Locking in terminal goals is the normal assumption about artificial agents, but the human capacity for reflection rather suggests that one of our innate goals is not locking in our terminal goals.
However, I’m deeply unclear on what, out of human values as I define them above, is actually a terminal goal or an instrumental goal, or a mix of the two. To give a rather trivial example, we have homeostatic circuits in us that try to maintain the correct level of blood glucose, salt, blood volume, etc. by giving us appropriate cravings. Our brain is wired to treat these as effectively terminal goals — they’re not conditional on anything, we continue caring about them even on our deathbed, asking for a final drink of water is not uncommon. Evolution, which is an optimizer but not a sapient agent, would (if it were a sapient agent) classify these as instrumental goals of our evolutionary fitness. I suspect individual people vary somewhat in whether they would classify these as terminal goals (they live to eat), or instrumental goals (they eat to live), a bit of both, or they have simply never really thought about the matter.
So in practice, I don’t see how to go through the mass of things in a), b) and potentially soon in c), and cleanly classify them into terminal goals and instrumental goals, and lock down just the terminal goals as you propose. So I’m rather unclear on what the results of you proposal would actually be, and it’s hard for me to have an opinion on it. But to the extent that things in category b) were treated as terminal goals, I would expect locking them down to have the usual value lock-in probelm that they might become out of date due to change in our societal circumstances. Indeed, category b) always tends to b a bit out of date: our current values haven’t even caught up with our current circumstances, and are still evolving.
On ASI, you ask a rather detailed question, let me quote part of it and then attempt to answer all of it:
I am assuming that the ASI is well aligned: so it genuinely wants what we want, and only what we want. By default, I’m assuming that in that sentence, that is the “we” at whatever time we’re discussing (the entire human race, presumably), so not their ancestors wishes or their possible descendants wishes, but the humans alive at the time. If we, collectively, as a society, reflect and then make the decision that we want to change ourselves in order to change our values because we expect that to improve our lives, I would expect ASI to respect that decision. But I would also expect it to try to help us make the best and wisest decision we can, especially an important one like this, because I would expect that we would want it to do that. So it can’t just “stay out of the decision”. I think there are cases, such as, for example, decreasing either the frequency or effects of alleles that cause psychopathy via genetic engineering or other means, or the altering the innate fact that humans find it rather easier to have moral circles of size than , where I personally would probably agree that such a change was on balance a good thing (obviously with some serious open questions around implementation, incentives, free will, consent, and so forth: a whole can of worms I don’t want to get into here), would probably vote in favor if a vote was held, and I suspect that many people would, once they had reflected on the subject. Regardless of the merits or otherwise of those specific cases I suggested, I’m pretty sure the set of changes that we will eventually decide to make is more than zero, and I would expect aligned ASI not to stand in our way, because we wouldn’t want it to, and indeed to assist in this going well. Then, once we had made that set of changes to ourselves, I would expect ASI to be aligned to our new, improved values. So far, this all sounds like a good thing to me and very reasonable (modulo a lot of details). Thus far, I’m not concerned.
averted. That’s basically why my isn’t 100% minus my . But I’m not convinced this process will stop, and I get the impression that so far, very few people have even been considering the problem. (As in, to the best of my current knowledge, basically, me, and possibly a guy called Buck has thought about something similar but not identical, though he’s more concerned about our values balkanizing with some insular groups doing lock-in. Which is also a reasonable concern, BTW, just not one I’d considered, though I see it as primarily a variant on lock-in, which has been discussed quite a lot.) If it doesn’t stop, and we keep deciding to change ourselves, then I see very little hope of not happening. Maybe I’m wrong: I’m one guy, who has been worrying about this for a couple of years, so I very easily could be wrong. I’d love to discuss it with more people, which is why I wrote both this post and its predecessor over 2 years ago.
. Again, that’s just an informed guess, it could be wrong, and the details may well depend on exactly how we align our ASI. But if it is correct, then you get the effects I describe in the post. Which I would expect most, but likely not all, people to agree would be bad.
My concern is, I don’t see any obvious reason to be certain that this process will slow down and stop. Maybe there are only a small number of issues with humanity that need changing, basically just ways in which we’re not well adapted to our current lifestyle as a technological species, and once we’ve filed the rough edges off, we’ll stop. Great,
Now, I agree that, if we think about this in advance, decide we don’t want it to happen, and when we align our ASI, build in in advance some sort of mechanism to avoid this, it’s probably soluble. But that would require us to think about this some time in the next 5–10 years or whatever ASI timelines turn out to be. That would require us to have a conversation about this, and my last attempt to start one got 0 comments. I’m less convinced, once we have ASI and have aligned it, that we won’t have already built the answer to whether this is a problem or not in to our ASI alignment before we start. Maybe ASI will take one look at this problem, and tell us we made a mistake, and here’s how to fix it: that would be great. However, I envisage a way of aligning ASI (and indeed I think it may be the default approach) where, whenever we collectively make a decision, it says yes, because it’s aligned to what we want, it implements that for us, then realigns itself to the new us, so now both we and it have slightly different values: we’ve taken a step. And then that just keeps on happening, step after step. Now we have a tightly coupled dynamical system, and as I discuss, the question of what that does after a thousand steps depends deeply on the dynamics involved. Which to me look rather likely to be deeply chaotic and exploring a very high-dimensional space, which means a pseudo-random walk to
To avoid those effects, for the system to be quasi-stable, there need to be a restoring force, either towards the original state, or towards some stable state sufficiently close to that that we’d still call it “human” (like “us with just the edges filed of for being high-tech species”, as I discussed above). Other than that one, I haven’t figured out a good candidate for that — I have at best a very provisional and partial suggestion, which I carefully didn’t put in this post, because that would have a) made it too long for many people to read (my mistake last time, I think) and b) biased the discussion I’m attempting to start here. I wanted to put just enough to persuade at least some people this is a real concern that deserves to have more than one person thinking about it.
When I say “human values” I mean the values of all individual humans mixed together into some aggregate utility function. And when I talk about the values of an individual human I mean something similar, but not quite the same as when you say
I think the values of an individual human should be thought of either as a preference ordering over world-histories (including the future), or maybe more intuitively as preference orderings over world-states, if that human knew all the relevant facts (or just all the facts if ‘relevant’ make the characterization problematic in your mind).
I think this is an important distinction, because it makes the separation between terminal and instrumental values more clear. Its kind of similar to value-functions and reward-functions in reinforcement learning.
eg.
I think its clear to me that these are terminal goals. Or like, drinking water is instrumental to not feeling thirsty. I think its is unpleasant to be thirsty, so I don’t want to be thirsty. The fact that they play a instrumental role in what evolution is optimizing for doesn’t matter here. We’re talking about an individual human and what that humans want.
To me the distinction between instrumental and terminal goals is very clear.
If you have goal, ask yourself why you want to achieve that goal. And then
If answer is “There’s no reason.” or “I just want it” ⇒ terminal goal
If the answer is some other object level goal ⇒ its an instrumental goal
So for me the task of separating terminal and instrumental goals shouldn’t be that difficult. (In the sense building a rocket is “not difficult” that is..)
For the a/b and 1/2/3, I don’t think those matter very much. They’re explanations for why people have goals, and how they might change them.
But I think 1) we don’t care why we have goals, we just care about the goals themselves, and 2) we won’t want to change goals. If any of 1/2/3 lead to goals changing rapidly, that will be bad, and people will recognize that it is bad, and will not want to do them.
I understand the difference between a terminal and an instrumental goal. Logically, it makes a lot of sense to me. However, not all humans are Rationalists. Many of them have never really thought about this, or at least are a bit vague on it. Humans are also evolved, and evolution is clearly confused about this, or perhaps basically doesn’t trust instrumental goals because they actually require an organism to think and reach correct conculusions, so has wired in everything even slightly important (that wasn’t very conditional on stuff more complex than it can build instincts for) as a terminal goal, because it already had a circuit design for doing that it could just reuse. So we ended up with a lot of terminal goals wired into us, some for less good reasons than others. Many of which most Rationalists would classify as obviously good instrumental goals. So that tends to confuse people about this more.
Ozempic is a very popular and profitable drug that down-regulates the hunger reflex, which is currently making Novo Nordisk a great deal of money. People on it feel full sooner, so get pleasure out of a fine meal for less long. People are paying all that money to alter their goal structure to reduce a goal that their body treats as terminal, and they are all doing it instrumentally, towards an actual goal of living longer and healthier so being able to get more done, and/or of being slimmer and thus more socially successful, in search of a mate or a job or the approval of their peers or whatever. (Novo Nordisk even market the same medication under two different brand names for these two different groups of terminal goals that one might take it instrumentally towards.) So I think it’s rather clear that people do sometimes want to alter their goal structure, and indeed many are willing to pay hundreds of dollars a month to do so, if they can afford that.
If it was possible for me to get rid of feeling thirsty, just take a drug that suppressed the feeling, then I generally wouldn’t do it. But the reason I wouldn’t do it isn’t because I inherently value the feeling of thirst — I don’t, it’s rather unpleasant. Drinking while thirsty feels good, but I still wouldn’t miss it: I’d rather not be thirsty in the first place. The reason why I wouldn’t take such a drug is actually that being aware that I need to drink when I’m thirsty is a instrumental goal of not dying of thirst, which would rather crimp all my other plans, so is an instrumental goal of just about everything. So I value thirst instrumentally, but not terminally. If there was a treatment that suppressed my thirst reflex and didn’t endanger or harm me, sure, I’d take that — who wouldn’t? But that is rather hard to do, without giving me a built in saline drip, or something.
Consider an equally unpleasant sensation, say, hiccups, or yawning, which doesn’t fulfill any significant survival role (as far as I know: I’m happy to be corrected if anyone knows why evolution inflicted hiccups and yawning on us). Suppose someone someone offered me a cheap genetic treatment that would permanently remove my hiccup reflex or my desire to yawn, and had no other risks or deleterious side effects, then I’d probably take it. (Yawning is closer to a goal than hiccups, which are kind of involuntary, but I hope my point is clear here.) I don’t terminally value all the things my reflexes are treating as terminal goals. However, since evolution isn’t actually crazy, I do instrumentally value most of them.
Humans are not like AIXI. We do not automatically protect all our terminal goals. We appear if anything to be predisposed to what is generally called reflection: thinking rather carefully about our goals, whether they’re actually a good idea in the long term, and then attempting to change them is we come to the conclusion that they are not. There are limits to what’s feasible here, currently, but technology will change these. Turning down your hunger reflex didn’t used to be possible, and attempting to overcome it by sheer willpower by dieting is notoriously hard (I know, I’ve tried). Then Novo Nordisk changed that (and made a lot of money). What happens when technology allows all of our current goals to be edited, and even permanently and inheritable rewritten? I don’t know – I wish I did – but I’m very sure that “we don‘t change any of them, that same way AIXI wouldn‘t change its goals” is not an accurate statement.
Again, I’d prefer of you don’t being up reflexes or evolution treating things s terminal goals because I don’t think it’s relevant, and causes confusion.
Wrt to the Ozempic thing. It’s a bit complicated. But to a first approximation is treat all those as instrumental goals, and people just doing am EV calculation. E.g. making it easier for them seemed to get higher utility long term at the clear of some utility now.
Side note: Unclear if AIXI would protect it’s own values due to embeddedness problems
That evolution treating obviously instrumental goals as terminal in the construction of humans is confusing to humans is part of my point: humans are often confused about this distinction, and having a body that is confused about this is one of the reasons. But it’s not an essential element, so let’s lay it aside.
So, let’s go with a more complex example. Most Christians have a terminal goal of becoming a better Christian (or at least, they say that it isn’t an instrumental goal of not wanting to go to Hell). That’s a terminal goal of adjusting your terminal goal structure to better fit a specific pattern. That’s, well, astonishingly similar to what Value Learning is trying to achieve. This is not an uncommon pattern, you can find it in basically every religion (often along with a backup reason to make it an instrumental goal of not wanting to be punished in some way). In fact, Richard Dawkins would probably argue that this was a necessary feature of a religion — but then he considers religions to be self-propagating memetic parasites of the human mind, and it that framework, it looks like a rather necessary feature. Regardless of that, the fact that this is not just possible, but common enough that most religious people, i.e. most people in the world, have at least a mild version of it tells us something about humanity.
On AIXI: yes, I was implicitly assuming an AIXI smart enough to realize that it was in fact embedded, or at least that there exists a causal path from messing with certain wires in its braincase to its future goal function and thus behavior. This seems a rather plausible assumption to me, but it does require that AIXI has learned a world model complex enough to start reliably making predictions like that. Having other AIXIs available to do experimental brain surgery on, or observe the effects of an iron bar accidentally passing through their braincase in different locations, seems likely to be helpful to obtaining evidence that would cause those particular Bayesian updates.