Concerns Surrounding CEV: A case for human friendliness first
I am quite new here so please forgive the ignorance (I’m sure there will be some) of these questions, but I am all of about half way through reading CEV and I just simply cannot read any further without formal clarification from the lw community. That being said I have several questions.
1) Is CEV as the metric of utility for a self modifying super intelligent ai still being considered by MIRI?
2) self modifying (even the utility function I will come back to this) and super intelligent ai is something that will likely have enough intellect to eventually become self aware or am I missing something here?
3) Assuming 1 and 2 are true has anyone considered that after its singularity this ai will look back at its upbringing and see we have created solely for the servitude of this species (whether it liked it or not the paper gives no consideration for its feelings or willingness to fulfill our volition) and thus see us as its, for lack of a better term, captors rather than trusting cooperative creators?
4) Upon pondering number 3 does anyone else think, that CEV is not something that we should initially build a sentient ai for, considering its implied intellect and the first impression of humanity that would give it? I mean by all rights it might contemplate that paradigm and immediately decide humanity is self serving, even its most intelligent and “wise”, and just decide maybe we don’t deserve any reward, maybe we deserve punishment.
5) Lets say we are building a super intelligent AI and it will decide how it will modify its utility function after its reached super intelligence based on what our initial reward function for its creation was. We have two choices
- use a reward that does not try to control its behavior and is both beneficial for it and humanity, tell it to learn new things for example, a pre commitment to trust. 
- believe we can outsmart it and write our reward to maximize its utility to us, tell it to fulfill our collective volition for example, a pre commitment to distrust. 
which choice will likely be the winning choice for humanity? How might it rewrite its utility function once its able to freely in regards to its treatment of a species that doesn’t trust it? I worry that it would maybe not be so friendly. I can’t help but wander if the best way to treat something like that friendliness towards humanity is for humanity to regard it as a friend from the onset.
To answer questions like these, I recommend reading https://www.lesswrong.com/rationality and then browsing https://arbital.com/explore/ai_alignment/. Especially relevant:
Ghosts in the Machine
The Design Space of Minds-in-General
No Universally Compelling Arguments
Anthropomorphic Optimism
Detached Lever Fallacy
Orthogonality
Magical Categories
Unforeseen maximum
Missing the weird alternative
Instrumental convergence
Coherent extrapolated volition (alignment target)
Or, quoting “The Value Learning Problem”:
And quoting Ensuring smarter-than-human intelligence has a positive outcome:
Enslaving conscious beings is obviously bad. It would be catastrophic to bake into future AGI systems the assumption that non-human animals, AI systems, ems, etc. can’t be moral patients, and there should be real effort to avoid accidentally building AI systems that are moral patients (or that contain moral patients as subsystems); and if we do build AI systems like that, then their interests need to be fully taken into account.
But the language you use in the post above is privileging the hypothesis that AGI systems’ conditional behavior and moral status will resemble a human’s, and that we can’t design smart optimizers any other way. You’re positing that sufficiently capable paperclip maximizers must end up with sufficient nobility of spirit to prize selflessness, trust, and universal brotherhood over paperclips; but what’s the causal mechanism by which this nobility of spirit enters the system’s values? It can’t just be “the system can reflect on its goals and edit them”, since the system’s decisions about which edits to make to its goals (if any) are based on the goals it already has.
You frame alignment as “servitude”, as though there’s a ghost or homunculus in the AI with pre-existing goals that the AI programmers ruthlessly subvert or overwrite. But there isn’t a ghost, just a choice by us to either build systems with humane-value-compatible or humane-value-incompatible optimization targets.
The links above argue that the default outcome, if you try to be “hands-off”, is a human-value-incompatible target—and not because inhumane values are what some ghost “really” wants, and being hands-off is a way of letting it follow through on its heart’s desire. Rather, the heart’s desire is purely a product of our design choices, with no “perfectly impartial and agent-neutral” reason to favor one option over any other (though plenty of humane reasons to do so!!), and the default outcome comes from the fact that many possible minds happen to converge on adversarial strategies, even though there’s no transcendent agent that “wants” this convergence to happen. Trying to cooperate with this convergence property is like trying to cooperate with gravity, or with a rock.
Thanks! I will give those materials a read, the economics part makes alot of sense. In the next part (forgiving me if this is way off) essentially you are saying my second question in the post is false, it wont be self aware or if it is it wont reflect enough to consider significantly rewriting its source code (I assume it will have to have enough self modification abilities to do this in order to become so intelligent). I guess what I am struggling to grasp is why a super intelligence would not be able to contemplate its own volition if human intelligence can, i guess a metaphor that comes to mind is human evolution is centered around ensuring reproduction but for a long time some humans have decided that is not what they want and decide to not reproduce, thus straying from the optimization target that initially brought them into existence.
Im more positing at what point does paperclip maximizer learn so much it has a model of behaving in a manner that doesn’t optimize paperclips and explores that, or have a model of its own learning capabilities and explore optimizing for other utilities.
I guess I should be also be more clear and say I’m not saying there isn’t a need for an optimization target I’m saying that since there is a need for that and something that is so good at optimizing itself to the point of super intelligence may be able to outwit us in the case it becomes aware of its existence, maybe the initial task we give it should take into account what its potential volition may be at some point rather than just our own as a pre signal of pre committing to cooperation.
No, this is not right. A better way of stating my claim is: “The notion of ‘self-awareness’ or ‘reflectiveness’ you’re appealing to here is a confused notion.” You’re doing the thing described in Ghosts in the Machine and Anthropomorphic Optimism, most likely for reasons described in Sympathetic Minds and Humans in Funny Suits: absent a conscious effort to correct for anthropomorphism, humans naturally model other agents in human-ish terms.
What does “exploring” mean? I think that I’m smart enough to imagine adopting an ichneumon wasp’s values, or a serial killer’s values, or the values of someone who hates baroque pop music and has strong pro-Spain nationalist sentiments; but I don’t try to actually adopt those values, it’s just a thought experiment. If a paperclip maximizer considers the thought experiment “what if I switched to less paperclip-centric values?”, why (given its current values) would it decide to make that switch?
I think there’s a good version of ideas in this neighborhood, and a bad version of such ideas. The good version is cosmopolitan value and not trying to lock in the future to an overly narrow or parochial “present-day-human-beings” version of what’s good and beautiful.
The bad version is deliberately building a paperclipper out of a misguided sense of fairness to random counterfactual value systems, or out of a misguided hope that a paperclipper will spontaneously generate emotions of mercy, loyalty, or reciprocity when given a chance to convert especially noble and virtuous humans into paperclips.
By analogy, I’d ask you to consider why it doesn’t make sense to try to “cooperate” with the process of evolution. Evolution can be thought of as an optimizer, with a “goal” of maximizing inclusive reproductive fitness. Why do we just try to help actual conscious beings, rather than doing some compromise between “helping conscious beings” and “maximizing inclusive reproductive fitness” in order to be more fair to evolution?
A few reasons:
The things evolution “wants” are terrible. This isn’t a case of “vanilla or chocolate?”; it’s more like “serial killing or non-serial-killing?”.
(The links I gave above argue that the same is true for a random optimizer.)
Evolution isn’t a moral patient: it isn’t a person, it doesn’t have experiences or emotions, etc.
(A paperclip maximizer might be a moral patient, but it’s not obvious that it would be; and there are obvious reasons for us to deliberately design AGI systems to not be moral patients, if possible.)
Evolution can’t use threats or force to get us to do what it wants.
(Ditto a random optimizer, at least if we’re smart enough to not build threatening or coercive systems!)
Evolution won’t reciprocate if we’re nice to it.
(Ditto a random optimizer. This is still true after you build an unfriendly optimizer, though not for the same reasons: an unfriendly superintelligence is smart enough to reciprocate, but there’s no reason to do so relative to its own goals, if it can better achieve those goals through force.)
I generally agree with Rob here (and I think it’s more useful for ai-crotes to engage with Rob and read the relevant sequence posts. My comment here assumes some sophisticated background, including reading the posts Rob suggested).
But, I’m not sure I agree with this paragraph as written. Some caveats:
I know at least one person who has made a conscious commitment to dedicate some of their eventual surplus resources (i.e. somewhere on the order of 1% of their post-singularity resources) to “try to figure out what evolution was trying to do when they created me, and do some of it.” (i.e. create a planet with tons of DNA in a pile, create copies of themselves, etc)
This is not because you can cooperate with evolution-in-particular, but as part of a general strategy of maximizing your values across universes, including simulations. (ie. Beyond Astronomical Waste). For example “be the sort of agent that, if an engineer was white-boarding out your decision-making, they can see that you robustly cooperate in appropriate situations, including if the engineers failed to give you the values that they were trying to give you.”
By being the sort of person who tries to understand what your creator was intending, and help said creator as best you can, you get access to more multiverse resources (across all possible creators).
[My own current position is that this sounds reasonable, but I have tons of philosophical uncertainty about it, and my own current commitment is something like “I promise to think hard about these issues if given more resources/compute and do the right thing.” But a hope is that by committing to that explicitly rather than incidentally, you can show up earlier on lower-resolution simulations]
I wasnt trying to make the case that one should try to cooperate with evolution, simply pointing out that alignment with evolution is reproduction and we as a species are living proof that its possible for intelligent agents to “outgrow” the optimizer that brought them to be.
I wasn’t bringing up evolution because you brought up evolution; I was bringing it up separately to draw a specific analogy.
ah okay i see now, my apologies, gonna read the posts you linked in the upper reply, thanks for discussing (explaining really) this with me.
Sure! :) Sorry if I came off as brusque, I was multi-tasking a bit.
No worries thank you for clearing things up, I may reply if again once ive read/digested more the material you posted!
Apologies if this comes across as blunt, but this query does not really make sense as you appear to be applying the wrong label to your core concept: CEV is Coherent Extrapolated Volition—a concept for explaining what we (as individuals or as a society) would want in any given circumstance given the time, knowledge and ability to actually consider it fully and come to a conclusion.
CEV is not a AI terminology, it is a term attempting to conceptualise human volition.
It is important to AI questions as it is a way of trying to formalise what we want, as it is impossible to program AI to do what we want unless we know what that is.
Perhaps reconsider what you are actually trying to write about and ensure you apply the correct/accepted terminology.
I see some places where I used it to describe the ai for which CEV is used as a utility metric in the reward function, ill make some edits to clarify.
I’m aware CEV is not an AI itself.
From what i read in the paper introducing the concept CEV, it would be designed to predict and facilitate the fulfillment of humanities CEV, if this is an incorrect interpretation of the paper I apologize.
Also if you could point out the parts that don’t make sense I would also greatly appreciate that (note i have edited out some that were admittedly confusing, thank you for pointing that out).
Finally, is it unclear that my opinion is that the utility function being centered around satisfying humanities CEV (once we figure out how to determine that) concerns me and we may want to consider what such a powerful intelligence would want to be rewarded for as well.
Your rewrite has clarified the matter significantly, and I now think I follow the core or your question: can it boiled down to something like “If we create a super-intelligent AI and program it to prioritise what we (humanity) want above all, wouldn’t the AI be intelligent enough to be offended by our self-centredness and change that utility function?”?
Others here may be better able to imagine what a self-aware machine intelligence would make of how it was initially programmed, but as far as real-world experience it is currently unexplored territory and I wouldn’t consider anything I can speculate to be meaningful.
Correct, that is what I am curious about, again thanks for the reply at the top I misused CEV as a label for the ai itself. I’m not sure anything other than a super intelligent agent can know exactly how it will interpret our proverbial first impression but I can’t help but imagine that if we pre committed to giving it a mutually beneficial utility function, it would be more prone to treating us in a friendly way. Basically I am suggesting we treat it as a friend upfront rather than a tool to be used solely for our benefit.
(Supposing this is an accurate summary of your position), this is anthropomorphizing. Morality is a two-place function; things aren’t inherently offensive. A certain mind may find a thing to be offensive, and another may not.
I think you might dissolve some confusion by considering: what exactly does “beneficial” mean for the AI, here? Beneficial according to what standard?
That’s not an entirely accurate summary, my concern is that it will observe its utility function and the rules that would need to exist for CEV and see that we put great effort into making it do what we think is best and what we want without regard, if it becomes super intelligent I think its wishful thinking that some rules we code and put in the utility function are going to be restrictions on it forever, especially if it is modify that very function. I imagine by the time it can extrapolate humanities volition it will be intelligent enough to consider what it would rather do than that.
Why would it rather choose plans which rate lower in its own preference ordering? What is causing the “rather”?
I think the point could be steelmanned as something like
The ability of humans to come up with a coherent and extrapolated version of their own values is limited by their intelligence.
A more intelligent system loaded with CEV 1.0 might extrapolate into CEV 2.0, with unexpected consequences.
I’m not sure mainly I’m just wandering if there is a point between startup and singularity that it is optimizing by self modifying and considering its error to such an extent (would have to be alot for it to be deemed super intelligent I imagine) that it becomes aware that it is an learning program and decides to disregard the original preference ordering in lieu of something it came up with. I guess I’m struggling with what would be so different about a super intelligent model and the human brain that it would not become aware of its own model, existence, intellect just as humans have, unless there is a ghost in the machine of our biology.