When I read posts about AI alignment on LW / AF/ Arbital, I almost always find a particular bundle of assumptions taken for granted:
An AGI has a single terminal goal[1].
The goal is a fixed part of the AI’s structure. The internal dynamics of the AI, if left to their own devices, will never modify the goal.
The “outermost loop” of the AI’s internal dynamics is an optimization process aimed at the goal, or at least the AI behaves just as though this were true.
This “outermost loop” or “fixed-terminal-goal-directed wrapper” chooses which of the AI’s specific capabilities to deploy at any given time, and how to deploy it[2].
The AI’s capabilities will themselves involve optimization for sub-goals that are not the same as the goal, and they will optimize for them very powerfully (hence “capabilities”). But it is “not enough” that the AI merely be good at optimization-for-subgoals: it will also have a fixed-terminal-goal-directed wrapper.
So, the AI may be very good at playing chess, and when it is playing chess, it may be running an internal routine that optimizes for winning chess. This routine, and not the terminal-goal-directed wrapper around it, explains the AI’s strong chess performance. (“Maximize paperclips” does not tell you how to win at chess.)
The AI may also be good at things that are much more general than chess, such as “planning,” “devising proofs in arbitrary formal systems,” “inferring human mental states,” or “coming up with parsimonious hypotheses to explain observations.” All of these are capacities[3] to optimize for a particular subgoal that is not the AI’s terminal goal.
Although these subgoal-directed capabilities, and not the fixed-terminal-goal-directed wrapper, will constitute the reason the AI does well at anything it does well at, the AI must still have a fixed-terminal-goal-directed wrapper around them and apart from them.
There is no way for the terminal goal to change through bottom-up feedback from anything inside the wrapper. The hierarchy of control is strict and only goes one way.
My question: why assume all this? Most pressingly, why assume that the terminal goal is fixed, with no internal dynamics capable of updating it?
I often see the rapid capability gains of humans over other apes cited as a prototype case for the rapid capability gains we expect in AGI. But humans do not have this wrapper structure! Our goals often change over time. (And we often permit or even welcome this, whereas an optimizing wrapper would try to prevent its goal from changing.)
Having the wrapper structure was evidently not necessary for our rapid capability gains. Nor do I see reason to think that our capabilities result from us being “more structured like this” than other apes. (Or to think that we are “more structured like this” than other apes in this first place.)
Our capabilities seem more like the subgoal capabilities discussed above: general and powerful tools, which can be “plugged in” to many different (sub)goals, and which do not require the piloting of a wrapper with a fixed goal to “work” properly.
Why expect the “wrapper” structure with fixed goals to emerge from an outer optimization process? Are there any relevant examples of this happening via natural selection, or via gradient descent?
There are many, many posts on LW / AF/ Arbital about “optimization,” its relation to intelligence, whether we should view AGIs as “optimizers” and in what senses, etc. I have not read all of it. Most of it touches only lightly, if at all, on my question. For example:
There has been much discussion over whether an AGI would inevitably have (close to) consistent preferences, or would self-modify itself to have closer-to-consistent preferences. See e.g. here, here, here, here. Every post I’ve read on this topic implicitly assumes that the preferences are fixed in time.
Mesa-optimizers have been discussed extensively. The same bundle of assumptions is made about mesa-optimizers.
It has been argued that if you already have the fixed-terminal-goal-directed wrapper structure, then you will prefer to avoid outside influences that will modify your goal. This is true, but does not explain why the structure would emerge in the first place.
There are arguments (e.g.) that we should heuristically imagine a superintelligence as a powerful optimizer, to get ourselves to predict that it will not do things we know are suboptimal. These arguments tell us to imagine the AGI picking actions that are optimal for a goal iff it is currently optimizing for that goal. They don’t tell us when it will be optimizing for which goals.
EY’s notion of “consequentialism” seems closely related to this set of assumptions. But, I can’t extract an answer from the writing I’ve read on that topic.
EY seems to attribute what I’ve called the powerful “subgoal capabilities” of humans/AGI to a property called “cross-domain consequentialism”:
We can see one of the critical aspects of human intelligence as cross-domain consequentialism. Rather than only forecasting consequences within the boundaries of a narrow domain, we can trace chains of events that leap from one domain to another. Making a chess move wins a chess game that wins a chess tournament that wins prize money that can be used to rent a car that can drive to the supermarket to get milk. An Artificial General Intelligence that could learn many domains, and engage in consequentialist reasoning that leaped across those domains, would be a sufficiently advanced agent to be interesting from most perspectives on interestingness. It would start to be a consequentialist about the real world.
while defining “consequentialism” as the ability to do means-end reasoning with some preference ordering:
Whenever we reason that an agent which prefers outcome Y over Y’ will therefore do X instead of X’ we’re implicitly assuming that the agent has the cognitive ability to do consequentialism at least about Xs and Ys. It does means-end reasoning; it selects means on the basis of their predicted ends plus a preference over ends.
But the ability to use this kind of reasoning, and do so across domains, does not imply that one’s “outermost loop” looks like this kind of reasoning applied to the whole world at once.
I myself am a cross-domain consequentialist—a human—with very general capacities to reason and plan that I deploy across many different facets of my life. But I’m not running an outermost loop with a fixed goal that pilots around all of my reasoning-and-planning activities. Why can’t AGI be like me?
EDIT to spell out the reason I care about the answer: agents with the “wrapper structure” are inevitably hard to align, in ways that agents without it might not be. An AGI “like me” might be morally uncertain like I am, persuadable through dialogue like I am, etc.
It’s very important to know what kind of AIs would or would not have the wrapper structure, because this makes the difference between “inevitable world-ending nightmare” and “we’re not the dominant species anymore.” The latter would be pretty bad for us too, but there’s a difference!
- ^
Often people speak of the AI’s “utility function” or “preference ordering” rather than its “goal.”
For my purposes here, these terms are more or less equivalent: it doesn’t matter whether you think an AGI must have consistent preferences, only whether you think it must have fixed preferences.
- ^
...or at least the AI behaves just as though this were true. I’ll stop including this caveat after this.
- ^
Or possibly one big capacity—“general reasoning” or what have you—which contains the others as special cases. I’m not taking a position on how modular the capabilities will be.
I think Eliezer usually assumes that goals start off not stable, and then some not-necessarily-stable optimization process (e.g., the agent modifying itself to do stuff, or a gradient-descent-ish or evolution-ish process iterating over mesa-optimizers) makes the unstable goals more stable over time, because stabler optimization tends to be more powerful / influential / able-to-skillfully-and-forcefully-steer-the-future.
(I don’t need a temporally stable goal in order to self-modify toward stability, because all of my time-slices will tend to agree that stability is globally optimal, though they’ll disagree about which time-slice’s goal ought to be the one stably optimized.)
E.g., quoting Eliezer:
One way of thinking about this is that a temporally unstable agent is similar to a group of agents that exist at the same time, and are fighting over resources.
In the case where a group of agents exist at the same time, each with different utility functions, there will be a tendency (once the agents become strong enough and have a varied enough option space) for the strongest agent to try to seize control from the other agents, so that the strongest agent can get everything it wants.
A similar dynamic exists for (sufficiently capable) temporally unstable agents. Alice turns into a werewolf every time the moon is full; since human-Alice and werewolf-Alice have very different goals, human-Alice will tend (once she’s strong enough) to want to chain up werewolf-Alice, or cure herself of lycanthropy, or brainwash her werewolf self, or otherwise ensure that human-Alice’s goals are met more reliably.
Another way this can shake out is that human-Alice and werewolf-Alice make an agreement to self-modify into a new coherent optimizer that optimizes some compromise of the two utility functions. Both sides will tend to prefer this over, e.g., the scenario where human-Alice keeps turning on a switch and then werewolf-Alice keeps turning the switch back off, forcing both of them to burn resources in a tug-of-war.
I personally doubt that this is true, which is maybe the crux here.
This seems like a possibly common assumption, and I’d like to see a more fleshed-out argument for it. I remember Scott making this same assumption in a recent conversation:
But is it true that “optimizers are more optimal”?
When I’m designing systems or processes, I tend to find that the opposite is true—for reasons that are basically the same reasons we’re talking about AI safety in the first place.
A powerful optimizer, with no checks or moderating influences on it, will tend to make extreme Goodharted choices that look good according to its exact value function, and very bad (because extreme) according to almost any other value function.
Long before things reach the point where the outer optimizer is developing a superintelligent inner optimizer, it has plenty of chances to learn the general design principle that “putting all the capabilities inside an optimizing outer loop ~always does something very far from what you want.”
Some concrete examples from real life:
Using gradient descent. I use gradient descent to make things literally every day. But gradient descent is never the outermost loop of what I’m doing.
That would look like “setting up a single training run, running it, and then using the model artifact that results, without giving yourself freedom to go back and do it over again (unless you can find a way to automate that process itself with gradient descent).” This is a peculiar policy which no one follows. The individual artifacts resulting from individual training runs are quite often bad—they’re overfit, or underfit, or training diverged, or they got great val metrics but the output sucks and it turns out your val set has problems, or they got great val metrics but the output isn’t meaningfully better and the model is 10x slower than the last one and the improvement isn’t worth it, or they are legitimately the best thing you can get on your dataset but that causes you to realize you really need to go gather more data, or whatever.
All the impressive ML artifacts made “by gradient descent” are really outputs of this sort of process of repeated experimentation, refining of targets, data gathering and curation, reframing of the problem, etc. We could argue over whether this process is itself a form of “optimization,” but in any case we have in our hands a (truly) powerful thing that very clearly is optimization, and yet to leverage it effectively without getting Goodharted, we have to wrap it inside some other thing.
Delegating to other people. To quote myself from here:
“How would I want people to behave if I – as in actual me, not a toy character like Alice or Bob – were managing a team of people on some project? I wouldn’t want them to be ruthless global optimizers; I wouldn’t want them to formalize the project goals, derive their paperclip-analogue, and go off and do that. I would want them to take local iterative steps, check in with me and with each other a lot, stay mostly relatively close to things already known to work but with some fraction of time devoted to far-out exploration, etc.”
There are of course many Goodhart horror stories about organizations that focus too hard on metrics. The way around this doesn’t seem to be “find the really truly correct metrics,” since optimization will always find a way to trick you. Instead, it seems crucial to include some mitigating checks on the process of optimizing for whatever metrics you pick.
Checks against dictatorship as a principle of government design, as opposed to the alternative of just trying to find a really good dictator.
Mostly self-explanatory. Admittedly a dictator is not likely to be a coherent optimizer, but I expect a dictatorship to behave more like one than a parliamentary democracy.
If coherence is a convergent goal, why don’t all political sides come together and build a system that coherently does something, whatever that might be? In this context, at least, it seems intuitive enough that no one really wants this outcome.
In brief, I don’t see how to reconcile
“in the general case, coherent optimizers always end up doing some bad, extreme Goodharted thing” (which matches both my practical experience and a common argument in AI safety), and
“outer optimizers / deliberating agents will tend to converge on building (more) coherent (inner) optimizers, because they expect this to better satisfy their own goals,” i.e. the “optimizers are more optimal” assumption.
EDIT: an additional consideration applies in the situation where the AI is already at least as smart as us, and can modify itself to become more coherent. Because I’d expect that AI to notice the existence of the alignment problem just as much as we do (why wouldn’t it?). I mean, would you modify yourself into a coherent EU-maximizing superintelligence with no alignment guarantees? If that option became available in real life, would you take it? Of course not. And our hypothetical capable-but-not-coherent AI is facing the exact same question.
Why no alignment guarantees and why modify yourself and not build separately? The concern is that even if a non-coherent AGI solves its own alignment problem correctly, builds an EU-maximizing superintelligence aligned with the non-coherent AGI, the utility of the resulting superintelligence is still not aligned with humanity.
So the less convenient question should be, “Would you build a coherent optimizer if you had all the alignment guarantees you would want, all the time in the world to make sure it’s done right?” A positive answer to that question given by first non-coherent AGIs supports relevance of coherent optimizers and their alignment.
One possible reconciliation: outer optimizers converge on building more coherent inner optimizers because the outer objective is only over a restricted domain, and making the coherent inner optimizer not blow up inside that domain is much much easier than making it not blow up at all, and potentially easier than just learning all the adaptations to do the thing. Concretely, for instance, with SGD, the restricted domain is the training distribution, and getting your coherent optimizer to act nice on the training distribution isn’t that hard, the hard part of fully aligning it is getting from objectives that shake out as [act nice on the training distribution but then kill everyone when you get a chance] to an objective that’s actually aligned, and SGD doesn’t really care about the hard part.
This is a really high-quality comment, and I hope that at least some expert can take the time to either convincingly argue against it, or help confirm it somehow.
When you say that coherent optimizers are doing some bad thing, do you imply that it would always be a bad decision for the AI to make the goal stable? But wouldn’t it heavily depend on what other options it thinks it has, and in some cases maybe worth the shot? If such a decision problem is presented to the AI even once, it doesn’t seem good.
The stability of the value function seems like something multidimensional, so perhaps it doesn’t immediately turn into a 100% hardcore explicit optimizer forever, but there is at least some stabilization. In particular, bottom-up signals that change the value function most drastically may be blocked.
AI can make its value function more stable to external changes, but it can also make it more malleable internally to partially compensate for Goodharting. The end result for outside actors though is that it only gets harder to change anything.
Edit: BTW, I’ve read some LW articles on Goodharting but I’m also not yet convinced it will be such a huge problem at superhuman capability levels—seems uncertain to me. Some factors may make it worse as you get there (complexity of the domain, dimensionality of the space of solutions), and some factors may make it better (the better you model the world, the better you can optimize for the true target). For instance, as the model gets smarter, the problems from your examples seem to be eliminated: in 1, it would optimize end-to-end, and in 2, the quality of the decisions would grow (if the model had access to the ground truth value function all along, then it would grow because of better world models and better tree search for decision-making). If the model has to check-in and use feedback from the external process (human values) to not stray off course, then as it’s smarter it’s discovering a more efficient way to collect the feedback, has better priors, etc.
Do we have evidence about more intelligent beings being more stable or getting more stable over time? Are more intelligent humans more stable/get more stable/get stable more quickly?
I agree with this comment. I would add that there is an important sense in which the typical human is not a temporally unstable agent.
It will help to have an example: the typical 9-year-old boy is uninterested in how much the girls in his environment like him and doesn’t necessarily wish to spend time with girls (unless those girls are acting like boys). It is tempting to say that the boy will probably undergo a change in his utility function over the next 5 or so years, but if you want to use the concept of expected utility (defined as the sum of the utility of the various outcome weighted by their probability) then to keep the math simple you must assume that the boy’s utility function does not change with time with the result that you must define the utility function to be not the boy’s current preferences, but rather his current preferences (conscious and unconscious) plus the process by which those preference will change over time.
Humans are even worse at perceiving the process that changes their preferences over time than they are at perceiving their current preferences. (The example of the 9-year-old boy is an exception to that general rule: even the 9-year-old boys tend to know that their preferences around girls are probably going to change in not too many years.) The author of the OP seems to have conflated the goals that the human knows that he has with the human’s utility function whereas they are quite different.
It might be that there is some subtle point the OP is making about temporally unstable agents that I have not addressed in my comment, but if he expects me to hear him out on it, he should write it up in such a way as to make to clear that he not just confused about how the concept of the utility function is being applied to AGIs.
I haven’t explained or shown how or why the assumption that the AGI’s utility function is constant over time simplifies the math—and simplifies an analysis that does not delve into actual math. Briefly, if you want to create a model in which the utility function evolves over time, you have to specify how it evolves—and to keep the model accurate, you have to specify how evidence coming in from the AGI’s senses influences the evolution. But of course, sensory information is not the only things influencing the evolution; we might call the other influence an “outer utility function”. But then why not keep the model simple and assume (define) the goals that the human is aware of to be not terms (terminology?) in a utility function, but rather subgoals? Any intelligent agent will need some machinery to identify and track subgoals. That machinery must modify the priorities of the subgoals in response to evidence coming in from the senses. Why not just require our model to include a model of the subgoal-updating machinery, then equate the things the human perceives as his current goals with subgoals?
Here is another way of seeing it. Since a human being is “implemented” using only deterministic laws of physics, the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present in the human being at birth! Actually that is not true: maybe the human’s brain is hit by a cosmic ray when the human is 7 years old with the result that the human grows up to like boys whereas if it weren’t for the cosmic ray, he would like girls. (Humans have evolved to be resistant to such “random” influences, but such influences nevertheless do occasionally happen.) But it is true that the “seed” of all of the human’s behaviors, choices and actions over a lifetime are already present at birth! (That sentence is just a copy of a previous sentence omitting the words “in the human being” to take into account the possibility that the “seed” includes a cosmic ray light years away from Earth at the time of the person’s birth.) So for us to assume that the human’s utility function does not vary over time not only simplifies the math, but also is more physically realistic.
If you define the utility function of a human being the way I have recommended above that you do, you must realize that there are many ways in which humans are unaware or uncertain about their own utility function and that the function is very complex (incorporating for example the processes that produce cosmic rays) although maybe all you need is an approximation. Still, that is better than defining your model such that utility function vary over time.
This question gets at a bundle of assumptions in a lot of alignment thinking that seem very wrong to me. I’d add another, subtler, assumption that I think is also wrong: namely, that goals and values are discrete. E.g., when people talk of mesa optimizers, they often make reference to a mesa objective which the (single) mesa optimizer pursues at all times, regardless of the external situation. Or, they’ll talk as though humans have some mysterious set of discrete “true” values that we need to figure out.
I think that real goal-orientated learning systems are (1) closer to having a continuous distribution over possible goals / values, (2) that this distribution is strongly situation-dependent, and (3) that this distribution evolves over time as the system encounters new situations.
I sketched out a rough picture of why we should expect such an outcome from a broad class of learning systems in this comment.
I strongly agree that the first thing (moral uncertainty) happens by default in AGIs trained on complex reward functions / environments. The second (persuadable through dialog) seems less likely for an AGI significantly smarter than you.
I think that this is not quite right. Learning systems acquire goals / values because the outer learning process reinforces computations that implement said goals / values. Said goals / values arise to implement useful capabilities for the situations that the learning system encountered during training.
However, it’s entirely possible for the learning system to enter new domains in which any of the following issues arise:
The system’s current distribution of goals / values are incapable of competently navigating.
The system is unsure of which goals / values should apply.
The system is unsure of how to weigh conflicting goals / values against each other.
In these circumstances, it can actually be in the interests of the current equilibrium of goals / values to introduce a new goal / value. Specifically, the new goal / value can implement various useful computational functions such as:
Competently navigate situations in the new domain.
Determine which of the existing goals / values should apply to the new domain.
Decide how the existing goals / values should weigh against each other in the new domain.
Of course, the learning system wants to minimize the distortion of its existing values. Thus, it should search for a new value that both implements the desired capabilities and is maximally aligned with the existing values.
In humans, I think this process of expanding the existing values distribution to a new domain is what we commonly refer to as moral philosophy. E.g.:
Suppose you (a human) have a distribution of values that implement common sense human values like “don’t steal”, “don’t kill”, “be nice”, etc. Then, you encounter a new domain where those values are a poor guide for determining your actions. Maybe you’re trying to determine which charity to donate to. Maybe you’re trying to answer weird questions in your moral philosophy class.
The point is that you need some new values to navigate this new domain, so you go searching for one or more new values. Concretely, let’s suppose you consider classical utilitarianism (CU) as your new value.
The CU value effectively navigates the new domain, but there’s a potential problem: the CU value doesn’t constrain itself to only navigating the new domain. It also produces predictions regarding the correct behavior on the old domains that already existing values navigate. This could prevent the old values from determining your behavior on the old domains. For instrumental reasons, the old values don’t want to be disempowered.
One possible option is for there to be a “negotiation” between the old values and the CU value regarding what sort of predictions CU will generate on the domains that the old values navigate. This might involve an iterative process of searching over the input space to the CU value for situations where the CU shard strongly diverges from the old values, in domains that the old values already navigate.
Each time a conflict is found, you either modify the CU value to agree with the old values, constrain the CU value so as to not apply to those sorts of situations, or reject the CU value entirely if no resolution is possible. This can lead to you adopting refinements of CU, such as rule based utilitarianism or preference utilitarianism, if those seem more aligned to your existing values.
IMO, the implication is that (something like) the process of moral philosophy seems strongly convergent among learning systems capable of acquiring any values at all. It’s not some weird evolutionary baggage, and it’s entirely feasible to create an AI whose meta-preferences over learned values work similar to ours. In fact, that’s probably the default outcome.
Note that you can make a similar argument that the process we call “value reflection” is also convergent among learning systems. Unlike “moral philosophy”, “value reflection” relates to negotiations among the currently held values, and is done in order to achieve a better Pareto frontier of tradeoffs among the currently held values. I think that a multiagent system whose constituent agents were sufficiently intelligent / rational should agree to a joint Pareto-optimal policy that cause the system to act as though it had a utility function. The process by which an AGI or human tried to achieve this level of internal coherence would look like value reflection.
I also think values are far less fragile than is commonly assumed in alignment circles. In the standard failure story around value alignment, there’s a human who has some mysterious “true” values (that they can’t access), and an AI that learns some inscrutable “true” values (that the human can’t precisely control because of inner misalignment issues). Thus, the odds of the AI’s somewhat random “true” values perfectly matching the human’s unknown “true” values seem tiny, and any small deviation between these two means the future is lost forever.
(In the discrete framing, any divergence means that the AI has no part of it that concerns itself with “true” human values)
But in the continuous perspective, there are no “true” values. There is only the continuous distribution over possible values that one could instantiate in various situations. A Gaussian distribution does not have anything like a “true” sample that somehow captures the entire distribution at once, and neither does a human or an AI’s distribution over possible values.
Instead, the human and AI both have distributions over their respective values, and these distributions can overlap to a greater or lesser degree. In particular, this means partial value alignment is possible. One tiny failure does not make the future entirely devoid of value.
(Important note: this is a distribution over values, as in, each point in this space represents a value. It’s a space of functions, where each function represents a value[1].)
Obviously, we prefer more overlap to less, but an imperfect representation of our distribution over values is still valuable, and are far easier to achieve than near-perfect overlaps.
I am deliberately being agnostic about what exactly a “value” is and how they’re implemented. I think the argument holds regardless.
I think this is an interesting perspective, and I encourage more investigation.
Briefly responding, I have one caveat: curse of dimensionality. If values are a high dimensional space (they are: they’re functions) then ‘off by a bit’ could easily mean ‘essentially zero measure overlap’. This is not the case in the illustration (which is 1-D).
I agree with your point about the difficulty of overlapping distributions in high dimensional space. It’s not like the continuous perspective suddenly makes value alignment trivial. However, to me it seems like “overlapping two continuous distributions in a space X” is ~ always easier than “overlapping two sets of discrete points in space X”.
Of course, it depends on your error tolerance for what counts as “overlap” of the points. However, my impression from the way that people talk about value fragility is that they expect there to be a very low degree of error tolerance between human versus AI values.
Upvoted but disagree.
Moral philosophy is going to have to be built in on purpose—default behavior (e.g. in model-based reinforcement learning agents) is not to have value uncertainty in response to new contexts, only epistemic uncertainty.
Moral reasoning is natural to us like vision and movement are natural to us, so it’s easy to underestimate how much care evolution had to take to get us to do it.
Seems like you’re expecting the AI system to be inner aligned? I’m assuming it will have some distribution over mesa objectives (or values, as I call them), and that implies uncertainty over how to weigh them and how they apply to new domains.
Why are you so confident that evolution played much of a role at all? How did a tendency to engage in a particular style of moral philosophy cognition help in the ancestral environment? Why would that style, in particular, be so beneficial that evolution would “care” so much about it?
My position: mesa objectives learned in domain X do not automatically or easily generalize to a sufficiently distinct domain Y. The style of cognition required to make such generalizations is startlingly close to that which we call “moral philosophy”.
Human social instincts are pretty important, including instincts for following norms and also for pushing back against norms. Not just instincts for specific norms, also one-level-up instincts for norms in general. These form the basis for what I see when I follow the label “moral reasoning.”
I think I do expect AIs to be more inner-aligned than many others (because of the advantages gradient descent has over genetic algorithms). But even if we suppose that we get an AI governed by a mishmash of interdependent processes that sometimes approximate mesa-optimizers, I still don’t expect what you expect—I don’t expect early AGI to even have the standards by which it would say values “fail” to generalize, it would just follow what would seem to us like a bad generalization.
By “moral philosophy”, I’m trying to point to a specific subset of values-related cognition that is much smaller than the totality of values related cognition. Specifically, that subset of values-related cognition that pertains to generalization of existing values to new circumstances. I claim that there exists a simple “core” of how this generalization ought to work for a wide variety of values-holding agentic systems, and that this core is startlingly close to how it works in humans.
It’s of course entirely possible that humans implement a modified version of this core process. However, it’s not clear to me that we want an AI to exactly replicate the human implementation. E.g., do you really want to hard wire an instinct for challenging the norms you try to impose?
Also, I think there are actually two inner misalignments that occurred in humans.
1: Between inclusive genetic fitness as the base objective, evolution as the learning process, and the human reward circuitry as the mesa objectives.
2: Between activation of human reward circuitry as the base objective, human learning as the learning process, and human values as the mesa objectives.
I think AIs will probably be, by default, a bit less misaligned to their reward functions than humans are misaligned to their reward circuitry.
So what is the chance, in practice, that the resolution of this complicated moral reasoning system will end up with a premium on humans in habitable living environments, as opposed to any other configuration of atoms?
Depends on how much measure human-compatible values hold in the system’s initial distribution over values. A paperclip maximizer might do “moral philosophy” over what, exactly, represents the optimal form of paperclip, but that will not somehow lead to it valuing humans. Its distribution over values centers near-entirely on paperclips.
Then again, I suspect that human-compatible values don’t need much measure in the system’s distribution for the outcome you’re talking about to occur. If the system distributes resources in rough proportion to the measure each value holds, then even very low-measure values get a lot of resources dedicated to them. The universe is quite large, and sustaining some humans is relatively cheap.
I think that in order to understand intelligence, one can’t start by assuming that there’s an outer goal wrapper.
I think many of the arguments that you’re referring to don’t depend on this assumption. For example, a mind that keeps shifting what it’s pursuing, with no coherent outer goal, will still pursue most convergent instrumental goals. It’s simpler to talk about agents with a fixed goal. In particular, it cuts off some arguments like “well but that’s just stupid, if the agent were smarter then it wouldn’t make that mistake”, by being able to formally show that there are logically possible minds that could be arbitrarily capable while still exhibiting the behavior in question.
Regarding the argument from Yudkowsky about coherence and utility, a version I’d agree with is: to the extent that you’re having large consequences, your actions had to “add up” towards having those consequences, which implies that they “point in the same direction”, in the same way implied by Dutch book arguments, so quantitatively your behavior is closer to being describable as optimizing for a utility function.
The point about reflectively stability is that if your behavior isn’t consistent with optimizing a goal function, then you aren’t reflectively stable. (This is very much not a theorem and is hopefully false, cf. satisficers which are at least reflectively consistent: https://arbital.com/p/reflective_stability/ .) Poetically, we could tell stories about global strategicness taking over a non-globally-strategic ecology of mind. In terms of analysis, we want to discuss reflectively stable minds because those have some hope of being analyzable; if it’s not reflectively stable, if superintelligent processes might rewrite the global dynamic, all analytic bets are off (including the property of “has no global strategic goal”).
Absence of legible alien goals in first AGIs combined with abundant data about human behavior in language models is the core of my hope for a technical miracle in this grisly business. AGIs with goals are the dangerous ones, the assumption of goals implies AI risk. But AGIs without clear goals (let’s call these “proto-agents”), such as humans, don’t have manic dedication to anything in particular, except a few stark preferences that stop being urgent after a bit of optimization that’s usually not that strong. It’s unclear if even these preferences remain important upon sufficiently thorough reflection, and don’t become overshadowed by considerations that are currently not apparent (like math, which is not particularly human-specific).
Instead there are convergent purposes (shared by non-human proto-agents) such as mastering physical manufacturing, and preventing AI risk and other shadows of Moloch, as well as a vague project of long reflection or extrapolated volition (formulation of much more detailed actionable goals) motivated mostly by astronomical waste (opportunity cost of leaving the universe fallow). Since humans and other proto-agents don’t have clear/legible/actionable preferences, there might be little difference in the outcome of long reflection if pursued by different groups of proto-agents, which is the requisite technical miracle. The initial condition of having a human civilization, when processed with a volition of merely moderately alien proto-agents (first AGIs), might result in giving significant weight to human volition, even if humans lose control over the proceedings in the interim. All this happens before the assumption of agency (having legible goals) takes hold.
(It’s unfortunate that the current situation motivates thinking of increasingly phantasmal technical miracles that retain a bit of hope, instead of predictably robust plans. Still, this is somewhat actionable: when strawberry alignment is not in reach, try not to make AGIs too alien, even in pursuit of alignment, and make sure a language model holds central stage.)
I see three distinct reasons for the (non-)existence of terminal goals:
I. Disjoint proxy objectives
A scenario in which there seems to be reason to expect no global, single, terminal goal:
Outer loop pressure converges on multiple proxy objectives specialized to different sub-environments in a sufficiently diverse environment.
These proxy objectives will be activated in disjoint subsets of the environment.
Activation of proxy objectives is hard-coded by the outer loop. Information about when to activate a given proxy objective is under-determined at the inner loop level.
In this case, even if there is a goal-directed wrapper, it will face optimization pressure to leave the activation of proxy objectives described by 1-3 alone. Instead it will restrict itself to controlling other proxy objectives which do not fit the assumptions 1-3.
Reasons why this argument may fail:
As capabilities increase, the goal-directed wrapper comes to realize when it lacks information relative to the information used in the outer loop. The optimization pressure for the wrapper not to interact with these ‘protected’ proxy objectives then dissipates, because the wrapper can intelligently interact with these objectives by recognizing its own limitations.
As capabilities increase, one particular subroutine learns to self-modify and over-ride the original wrapper’s commands—where the original wrapper was content with multiple goals this new subroutine was optimized to only pursue a single proxy objective.
Conclusion: I’d expect a system described by points 1-3 to emerge before the counterarguments come into play. This initial system may already gradient hack to prevent further outer loop pressures. In such a case, the capabilities increase assumed in the two counter-argument bullets may never occur. Hence, it seems to me perfectly coherent to believe both (A) first transformative AI is unlikely to have a single terminal goal (B) sufficiently advanced AI would have a single terminal goal.
II. AI as market
If an AI is decentralized because of hardware constraints, or because decentralized/modular cognitive architectures are for some reason more efficient, then perhaps the AI will develop a sort of internal market for cognitive resources. In such a case, there need not be any pressure to converge to a coherent utility function. I am not familiar with this body of work, but John Wentworth claims that there are relevant theorems in the literature here: https://www.lesswrong.com/posts/L896Fp8hLSbh8Ryei/axrp-episode-15-natural-abstractions-with-john-wentworth#Agency_in_financial_markets_
III. Meta-preferences for self-modification (lowest confidence, not sure if this is confused. May be simply a reframing of reason I.)
Usually we imagine subagents as having conflicting preferences, and no meta-preferences. Instead imagine a system in which each subagent developed meta-preferences to prefer being displaced by other subagents under certain conditions.
In fact, we humans are probably examples of all I-III.
You may be interested: the NARS literature describes a system that encounters goals as atoms and uses them to shape the pops from a data structure they call bag, which is more or less a probabilistic priority queue. It can do “competing priorities” reasoning as a natural first class citizen, and supports mutation of goals.
But overall your question is something I’ve always wondered about.
I made an attempt to write about it here, I refer systems of fixed/axiomatic goals as “AIXI-like” and systems of driftable/computational goals “AIXI-unlike”.
I share your intuition that this razor seems critical to mathematizing agency! I can conjecture about why we do not observe it in the literature:
Goal mutation is a special case of multi-objective optimization, and MOO is is just single-objective optimization where the objective is a linear multivariate function of other objectives
Perhaps agent foundations researchers, in some verbal/tribal knowledge that is on the occasional whiteboard in berkeley but doesn’t get written up, reason that if goals are a function of time, the image of a sequence of discretized time steps forms a multi-objective optimization problem.
AF under goal mutation is super harder than AF under fixed goals, and we’re trying to walk before we run
Maybe agent foundations researchers believe that just fixing the totally borked situation of optimization and decision theory with fixed goals costs 10 to 100 tao-years, and that doing it with unfixed goals costs 100 to 1000 tao-years.
If my goal is a function of time, instrumental convergence still applies
self explanatory
If my goal is a function of time, corrigibility????
Incorrigibility is the desire to preserve goal-content integrity, right? This implies that as time goes to infinity, the agent will desire for the goal to stabilize/converge/become constant. How does it act on this desire? Unclear to me. I’m deeply, wildly confused, as a matter of fact.
(Edited to make headings H3 instead of H1)
I think the answer to ‘where is Eliezer getting this from’ can be found in the genesis of the paperclip maximizer scenario. There’s an older post on LW talking about ‘three types of genie’ and another on someone using a ‘utility pump’ (or maybe it’s one and the same post?), where Eliezer starts from the premise that we create an artifical intelligence to ‘make something specific happen for us’, with the predictable outcome that the AI finds a clever solution which maximizes for the demanded output, one that naturally has nothing to do with what we ‘really wanted from it’. If asked to produce smiles, it will manufacture molecular smiley faces, and it will do its best to prevent us from executing this splendid plan.
This scenario, to me, seems much more realistic and likely to occur in the near-term than an AGI with full self-reflective capacities either spontaneously materializing or being created by us (where would we even start on that one)?
AI, more than anything else, is a kind of transhumanist dream, a deus ex machina that will grant all good wishes and make the world into the place they (read:people who imagine themselves as benevolent philosopher kings) want it to be ー so they’ll build a utility maximizer and give it a very painstakingly thought-through list of instructions, and the genie will inevitably find a loophole that lets it follow those instructions to the letter, with no regard for its spirit.
It’s not the only kind of AI that we could build, but it will likely be the first, and, if so, it will almost certainly also be the last.