Ends Don’t Justify Means (Among Humans)

Eliezer Yudkowsky14 Oct 2008 21:00 UTC

212 points

Consequentialism Ethics & Morality Self-Deception Deontology Decision theory Evolutionary Psychology

“If the ends don’t justify the means, what does?”
—variously attributed

“I think of myself as running on hostile hardware.”
—Justin Corwin

Yesterday I talked about how humans may have evolved a structure of political revolution, beginning by believing themselves morally superior to the corrupt current power structure, but ending by being corrupted by power themselves—not by any plan in their own minds, but by the echo of ancestors who did the same and thereby reproduced.

This fits the template:

In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.

From this proposition, I now move on to my main point, a question considerably outside the realm of classical Bayesian decision theory:

“What if I’m running on corrupted hardware?”

In such a case as this, you might even find yourself uttering such seemingly paradoxical statements—sheer nonsense from the perspective of classical decision theory—as:

“The ends don’t justify the means.”

But if you are running on corrupted hardware, then the reflective observation that it seems like a righteous and altruistic act to seize power for yourself—this seeming may not be be much evidence for the proposition that seizing power is in fact the action that will most benefit the tribe.

By the power of naive realism, the corrupted hardware that you run on, and the corrupted seemings that it computes, will seem like the fabric of the very world itself—simply the way-things-are.

And so we have the bizarre-seeming rule: “For the good of the tribe, do not cheat to seize power even when it would provide a net benefit to the tribe.”

Indeed it may be wiser to phrase it this way: If you just say, “when it seems like it would provide a net benefit to the tribe”, then you get people who say, “But it doesn’t just seem that way—it would provide a net benefit to the tribe if I were in charge.”

The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can’t yet say, but that would seem to be the appropriate level to handle it.)

But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it. A rule that says, “For the good of the tribe, do not cheat to seize power even for the good of the tribe.” Or “For the good of the tribe, do not murder even for the good of the tribe.”

And now the philosopher comes and presents their “thought experiment”—setting up a scenario in which, by stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is certain to save the five lives. “There’s a train heading to run over five innocent people, who you can’t possibly warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the train. These are your only options; what do you do?”

An altruistic human, who has accepted certain deontological prohibits—which seem well justified by some historical statistics on the results of reasoning in certain ways on untrustworthy hardware—may experience some mental distress, on encountering this thought experiment.

So here’s a reply to that philosopher’s scenario, which I have yet to hear any philosopher’s victim give:

“You stipulate that the only possible way to save five innocent lives is to murder one innocent person, and this murder will definitely save the five lives, and that these facts are known to me with effective certainty. But since I am running on corrupted hardware, I can’t occupy the epistemic state you want me to imagine. Therefore I reply that, in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder the one innocent person to save five, and moreover all its peers would agree. However, I refuse to extend this reply to myself, because the epistemic state you ask me to imagine, can only exist among other kinds of people than human beings.”

Now, to me this seems like a dodge. I think the universe is sufficiently unkind that we can justly be forced to consider situations of this sort. The sort of person who goes around proposing that sort of thought experiment, might well deserve that sort of answer. But any human legal system does embody some answer to the question “How many innocent people can we put in jail to get the guilty ones?”, even if the number isn’t written down.

As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. But I don’t think that our deontological prohibitions are literally inherently nonconsequentially terminally right. I endorse “the end doesn’t justify the means” as a principle to guide humans running on corrupted hardware, but I wouldn’t endorse it as a principle for a society of AIs that make well-calibrated estimates. (If you have one AI in a society of humans, that does bring in other considerations, like whether the humans learn from your example.)

And so I wouldn’t say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don’t call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don’t go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. I happen to be a human. But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.

I would even go further, and say that if you had minds with an inbuilt warp that made them overestimate the external harm of self-benefiting actions, then they would need a rule “the ends do not prohibit the means”—that you should do what benefits yourself even when it (seems to) harm the tribe. By hypothesis, if their society did not have this rule, the minds in it would refuse to breathe for fear of using someone else’s oxygen, and they’d all die. For them, an occasional overshoot in which one person seizes a personal benefit at the net expense of society, would seem just as cautiously virtuous—and indeed be just as cautiously virtuous—as when one of us humans, being cautious, passes up an opportunity to steal a loaf of bread that really would have been more of a benefit to them than a loss to the merchant (including knock-on effects).

“The end does not justify the means” is just consequentialist reasoning at one meta-level up. If a human starts thinking on the object level that the end justifies the means, this has awful consequences given our untrustworthy brains; therefore a human shouldn’t think this way. But it is all still ultimately consequentialism. It’s just reflective consequentialism, for beings who know that their moment-by-moment decisions are made by untrusted hardware.

What links here?

Eliezer Yudkowsky14 Oct 2008 21:00 UTC

212 points

99 comments4 min readLW link Archive

Consequentialism Ethics & Morality Self-Deception Deontology Decision theory Evolutionary Psychology

Carl_Shulman 14 Oct 2008 21:08 UTC
67 points
0
“So here’s a reply to that philosopher’s scenario, which I have yet to hear any philosopher’s victim give” People like Hare have extensively discussed this, although usually using terms like ‘angels’ or ‘ideally rational agent’ in place of ‘AIs.’
- jfm 5 Jan 2012 14:24 UTC
  13 points
  0
  Parent
  Yes, this made me think precisely of Hare’s two-level utilitarianism, with a Friendly AI in place of Hare’s Archangel.
- Eliezer Yudkowsky 11 Oct 2009 1:49 UTC
  3 points
  0
  Parent
  Okay.
Phil_Goetz5 14 Oct 2008 21:46 UTC
29 points
0
The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.
This is critical to your point. But you haven’t established this at all. You made one post with a just-so story about males in tribes perceiving those above them as corrupt, and then assumed, with no logical justification that I can recall, that this meant that those above them actually are corrupt. You haven’t defined what corrupt means, either.

I think you need to sit down and spell out what ‘corrupt’ means, and then Think Really Hard about whether those in power actually are more corrupt than those not in power;and if so, whether the mechanisms that lead to that result are a result of the peculiar evolutionary history of humans, or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.

You might argue that if you have one Sysop AI, it isn’t subject to evolutionary forces. This may be true. But if that’s what you’re counting on, it’s very important for you to make that explicit. I think that, as your post stands, you may be attributing qualities to Friendly AIs, that apply only to Solitary Friendly AIs that are in complete control of the world.
- dejb 10 Jan 2011 14:31 UTC
  3 points
  0
  Parent
  
  as your post stands, you may be attributing qualities to Friendly AIs, that apply only to Solitary Friendly AIs that are in complete control of the world.
  
  Just to extend on this, it seems most likely that multiple AIs would actually be subject to dynamics similar to evolution and a totally ‘Friendly’ AI would probably tend to lose out against a more self-serving (but not necessarily evil) AIs. Or just like the ‘young revolutionary’ of the first post, a truly enlightened Friendly AI would be forced to assume power to deny it to any less moral AIs.
  
  Philosophical questions aside, the likely reality of the future AI development is surely that it will also go to those that are able to seize the resources to propagate and improve themselves.
  - DanielLC 3 May 2013 3:10 UTC
    6 points
    0
    Parent
    Why would a Friendly AI lose out? They can do anything any other AI can do. They’re not like humans, where they have to worry about becoming corrupt if they start committing atrocities for the good of humanity.
    - PhilGoetz 4 Jan 2024 18:18 UTC
      2 points
      0
      Parent
      You have it backwards. The difference between a Friendly AI and an unfriendly one is entirely one of restrictions placed on the Friendly AI. So an unfriendly AI can do anything a friendly AI could, but not vice-versa.
      The friendly AI could lose out because it would be restricted from committing atrocities, or at least atrocities which were strictly bad for humans, even in the long run.
      Your comment that they can commit atrocities for the good of humanity without worrying about becoming corrupt is a reason to be fearful of “friendly” AIs.
Jef_Allbright 14 Oct 2008 22:12 UTC
7 points
0
There’s really no paradox, nor any sharp moral dichotomy between human and machine reasoning. Of course the ends justify the means—to the extent that any moral agent can fully specify the ends.

But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences. Rather the moral agent must necessarily fall back on heuristics, fundamentally hard-to-gain wisdom based on increasingly effective interaction with relevant aspects of the environment of interaction, promoting in principle a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences.
Phil_Goetz5 14 Oct 2008 22:29 UTC
2 points
0
Good point, Jef—Eliezer is attributing the validity of “the ends don’t justify the means” entirely to human fallibility, and neglecting that part accounted for by the unpredictability of the outcome.

He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.

I don’t know what “a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences” means.
Jef_Allbright 14 Oct 2008 22:43 UTC
2 points
0
Phil: “I don’t know what “a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences” means.”

You and I engaged briefly on this four or five years ago, and I have yet to write the book. [Due to the explosion of branching background requirements that would ensue.] I have, however, effectively conveyed the concept face to face to very small groups.

I keep seeing Eliezer orbiting this attractor, and then veering off as he encounters contradictions to a few deeply held assumptions. I remain hopeful that the prodigious effort going into the essays on this site will eventually (and virtually) serve as that book.
Silas 14 Oct 2008 23:01 UTC
2 points
0
in a society of Artificial Intelligences worthy of personhood and lacking any inbuilt tendency to be corrupted by power, it would be right for the AI to murder … I refuse to extend this reply to myself, because the epistemological state you ask me to imagine, can only exist among other kinds of people than human beings.

Interesting reply. But the AIs are programmed by corrupted humans. Do you really expect to be able to check the full source code? That you can outsmart the people who win obfuscated code contests?

How is the epistemological state of human-verified, human-built, non-corrupt AIs, any more possible?
- [deleted] 9 Feb 2014 4:47 UTC
  0 points
  0
  Parent
  We’re likely to insert our faulty cached wisdom deliberately. We’re unlikely to insert our power-corrupts biases deliberately. We might insert something vaguely analogous accidentally, though.
  
  As for obfuscated source code—we would want programmatic verification of correctness, which would be another huge undertaking on top of solving the AI and FAI problems. Obfuscation doesn’t help you there.
Utilitarian 14 Oct 2008 23:44 UTC
2 points
0
As a human, I try to abide by the deontological prohibitions that humans have made to live in peace with one another. [...] I don’t go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects.

It seems a strong claim to suggest that the limits you impose on yourself due to epistemological deficiency line up exactly with the mores and laws imposed by society. Are there some conventional ends-don’t-justify-means notions that you would violate, or non-socially-taboo situations in which you would restrain yourself?

Also, what happens when the consequences grow large? Say 1 person to save 500, or 1 to save 3^^^^3?
- thrawnca 20 Nov 2016 22:13 UTC
  2 points
  0
  Parent
  
  what happens when the consequences grow large? Say 1 person to save 500, or 1 to save 3^^^^3?
  
  If 3^^^^3 lives are at stake, and we assume that we are running on faulty or even hostile hardware, then it becomes all the more important not to rely on potentially-corrupted “seems like this will work”.
Kaj_Sotala 14 Oct 2008 23:53 UTC
4 points
0
Phil Goetz: or of general game-theoretic / evolutionary mechanisms that would apply equally to competing AIs.

You are assuming that an AI would be subject to the same sort of evolutionary mechanism that humans traditionally were: namely, that only AIs with a natural tendency towards a particular behavior would survive. But an AI isn’t cognitively limited in the way animals were. While animals had to effectively be pre-programmed with certain behaviors or personality traits, as they weren’t intelligent or knowledgable enough to just derive all the useful subgoals for fitness-maximizing behavior once they were told the goal, this isn’t the case for AIs. An AI can figure out that a certain course of action is beneficial in a certain situation and act to implement it, then discard that behavior when it’s no longer needed. In a competitive environment, there will certainly be selection that eliminates AIs that are for some reason unable to act in a certain way, but probably very little selection that would add new behavioral patterns for the AIs involved (at least ones that couldn’t be discarded when necessary).
Phil_Goetz5 15 Oct 2008 0:24 UTC
2 points
0
He may have some model of an AI as a perfect Bayesian reasoner that he uses to justify neglecting this. I am immediately suspicious of any argument invoking perfection.
It may also be that what Eliezer has in mind is that any heuristic that can be represented to the AI, could be assigned priors and incorporated into Bayesian reasoning.

Eliezer has read Judea Pearl, so he knows how computational time for Bayesian networks scales with the domain, particularly if you don’t ever assume independence when it is not justified, so I won’t lecture him on that. But he may want to lecture himself.

(Constructing the right Bayesian network from sense-data is even more computationally demanding. Of course, if you never assume independence, then the only right network is the fully-connected one. I’m pretty certain that suggesting that a non-narrow AI will be reasoning over all of its knowledge with a fully-connected Bayesian network is computationally implausible. So all arguments that require AIs to be perfect Bayesian reasoners are invalid.)

I’d like to know how much of what Eliezer says depends on the AI using Bayesian logic as its only reasoning mechanism, and whether he believes that is the best reasoning mechanism in all cases, or only one that must be used in order to keep the AI friendly.

Kaj: I will restate my earlier question this way: “Would AIs also find themselves in circumstances such that game theory dictates that they act corruptly?” It doesn’t matter whether we say that the behavior evolved from accumulated mutations, or whether an AI reasoned it out in a millisecond. The problem is still there, if circumstances give corrupt behavior an advantage.
Kaj_Sotala 15 Oct 2008 0:36 UTC
0 points
0
Phil: Agreed, that’s certainly possible. I was only objecting to the implied possibility of AIs evolving “personality traits” the same way humans did (an idea I’ve come across a lot during the last few days, for some reason). But I have no objection to game theoretic reasoning (or any other reasoning) possibly coming up with results we wouldn’t want it to.
Nominull3 15 Oct 2008 2:07 UTC
3 points
0
The thing is, an AI doesn’t have to use mental tricks to compensate for known errors in its reasoning, it can just correct those errors. An AI never winds up in the position of having to strive to defeat its own purposes.
- Swimmer963 (Miranda Dixon-Luinenburg) 10 Apr 2011 3:34 UTC
  4 points
  0
  Parent
  A self-modifying AI. Not all AI has to be self-modifying, although superhuman Friendly AI probably does have to be in order to work.
Zubon 15 Oct 2008 2:21 UTC
11 points
0
I think the simple statement you want is, “You should accept deontology on consequentialist grounds.”
haig2 15 Oct 2008 3:04 UTC
0 points
0
What you are getting at is that the ends justify the means only when the means don’t effect the ends. In the case of a human as part of the means, the act of the means may effect the human and thus effect the ends. In summary, reflexivity is a bitch. This is a reason why social science and economics is so hard—the subjects being modeled change as a result of the modeling process.

This is a problem with any sufficiently self-reflective mind, not with AIs that do not change their own rules. A simple mechanical narrow AI that is programmed to roam about collecting sensory data and to weigh the risk of people dying due to traffic collisions, then stepping in only to minimize the number of deaths, would be justified if it happens to allow or even cause the smaller number of deaths.

The concept of corruption doesn’t exist in this context, the act is just a mechanism. A person can transition from an uncorrupted state to a corrupted state only because the rules governing the person’s behavior is subject to modification in such a complex fashion as to occur even under the radar of the person it is happening to, because the person is the behavior caused by the rules, and when the rules change the person changes. We are not in as much control as we would like to think.

When the eastern religions preach the ego is the root of all our problems, they may be more right than we give them credit for. Ego is self-identity, which arises out of the ability to introspect and separate the aggregate of particles constituting ‘I’ with the rest of the particles in the environment. How would you go about building an AGI that doesn’t have the false duality of self and non-self? Without ego corruption does not exist.

Imagine instead of an embodied AGI, or even a software AGI running on some black box computational machine sitting in a basement, the friendly AGI takes the form of an intelligent environment, say a superintelligent house. In the house there exists safeguards that disallows any unfriendly action. The house isn’t conscious, it just adds a layer of friendliness on top of harsh reality. This may be a fruitful way of thinking about friendliness that avoids all the messy reflexivity.

Fun stuff this. I am enjoying these discussions.
Cyan2 15 Oct 2008 3:21 UTC
2 points
0
But in an interesting world of combinatorial explosion of indirect consequences, and worse yet, critically underspecified inputs to any such supposed moral calculations, no system of reasoning can get very far betting on longer-term specific consequences.

This point and the subsequent discussion are tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions. To see this, limit the universe of discourse to actions which have predictable effects and note that Eliezer’s argument still makes strong claims about how humans should act.
Caroline 15 Oct 2008 4:24 UTC
3 points
0
Why must the power structure cycle be adaptive? I mean, couldn’t it simply be non-maladaptive?

Because if the net effect on human fitness is zero, then perhaps it’s just a quirk. I’m not sure how this affects your argument otherwise, I’m just curious as to why you think it was an adaptive pattern and not just a pattern that didn’t kill us at too high a rate.
Phil_Goetz4 15 Oct 2008 4:51 UTC
0 points
0
Of course, if you never assume independence, then the only right network is the fully-connected one.
Um, conditional independence, that is.

I want to know if my being killed by Eliezer’s AI hinges on how often observables of interest tend to be conditionally dependent.
Richard_Hollerith2 15 Oct 2008 7:07 UTC
0 points
0
It is refreshing to read something by Eliezer on morality I completely agree with.

And nice succinct summary by Zubon.
Lake 15 Oct 2008 9:44 UTC
1 point
0
@ Caroline: the effect on overall human fitness is neither here nor there, surely. The revolutionary power cycle would be adaptive because of its effect on the reproductive success of those who play the game versus those who don’t. That is, the adaptation would only have to benefit specific lineages, not the whole species. Or have I missed your point?
Vladimir_Slepnev 15 Oct 2008 10:17 UTC
1 point
0
What if a AI decides, with good reason, that it’s running on hostile hardware?
Emile 15 Oct 2008 10:42 UTC
2 points
0
I wonder where this is leading … 1) Morality is a complex computation, that seems to involve a bunch of somewhat independent concerns 2) Some concerns of human morality may not need to apply to AI

So it seems that building friendly AI involves not only correctly building (human) morality, but figuring out which parts don’t need to apply to an AI that doesn’t have the same flaws.
NancyLebovitz 15 Oct 2008 12:38 UTC
0 points
0
It seems to me that an FAI would still be in an evolutionary situation. It’s at least going to need a goal of self-preservation [1] and it might well have a goal of increasing its abilities in order to be more effectively Friendly.

This implies it will have to somehow deal with the possibility that it might overestimate its own value compared to the humans it’s trying to help.

[1] What constitutes the self for an AI is left as a problem for the student.
Richard_Hollerith2 15 Oct 2008 14:07 UTC
0 points
0
But, Nancy, the self-preservation can be an instrumental goal. That is, we can make it so that the only reason the AI wants to keep on living is that if it does not then it cannot help the humans.
Stuart_Armstrong 15 Oct 2008 14:33 UTC
9 points
0
Still disagreeing with the whole “power corrupts” idea.

A builder, or a secratary, who looks out for his friends and does them favours is… a good friend. A politician who does the same is… a corrupt politician.

A sad bastard who will sleep with anyone he can is a sad bastard. A politician who will sleep with anyone he can is a power-abusing philanderer.

As you increase power, you become corrupt just by doing what you’ve always done.
NancyLebovitz 15 Oct 2008 14:43 UTC
0 points
0
Richard, I’m looking at the margins. The FAI is convinced that it’s humanity’s only protection against UFAIs. If UFAIs can wipe out humanity, surely the FAI is justified in killing a million or so people to protect itself, or perhaps even to make sure it’s capable of defeating UFAIs which have not yet been invented and whose abilities can only be estimated.
Nick_Tarleton 15 Oct 2008 14:50 UTC
0 points
0
And if an FAI makes that judgment, I’m not going to question it—it’s smarter than me, and not biased toward accumulating power for “instrumental” reasons like I am.
Nick_Tarleton 15 Oct 2008 14:51 UTC
0 points
0
s/like I am/like humans are/
Jef_Allbright 15 Oct 2008 15:55 UTC
0 points
0
Cyan: ”...tangential to the point of the post, to wit, evolutionary adaptations can cause us to behave in ways that undermine our moral intentions.”

On the contrary, promotion into the future of a [complex, hierarchical] evolving model of values of increasing coherence over increasing context, would seem to be central to the topic of this essay.

Fundamentally, any system, through interaction with its immediate environment, always only expresses its values (its physical nature.) “Intention”, corresponding to “free-will” is merely derivative and for practical purposes in regard to this analysis of the system dynamics, is just “along for the ride.”

But to the extent that the system involves a reflexive model of its values—an inherently subjective view of its nature—then increasing effectiveness in principle, indirectly assessed in terms of observations of those values being promoted over increasing external scope of consequences, tends to correspond with increasing coherence of the (complex, hierarchical) inter-relationships of the elements within the model, over increasing context of meaning-making (increasing web of supporting evidence.) Wash, rinse, repeat with ongoing interaction --> selection for “that which tends to work” --> updating of the model...

“Morality” enters the picture only in regard to groups of agents. For a single, isolated, agent “morality” doesn’t apply; there is only the “good” of that which is assessed as promoting that agent’s (present, but evolving) values-complex. At the other end of the scale of subjectivity, in the god’s-eye view, there is no morality since all is simply and perfectly as it is.

But along that scale, regardless of the subjective starting point (whether human agency of various scale, other biological, or machine-phase agency) action will tend to be assessed as increasingly moral to the extent that it is assessed as promoting, in principle, (1) a subjective model of values increasingly coherent over increasing context (of meaning-making, evidential observation) over (2) increasing scope of objective consequences.

Evolutionary processes have encoded this accumulating “wisdom” slowly and painfully into the heuristics supporting the persistence of the physical, biological and cultural branch with which we self-identify. With the ongoing acceleration of the Red Queen’s Race, I see this meta-ethical theory becoming ever more explicitly applicable to “our” ongoing growth as intentional agents of whatever form or substrate.

Cyan: ”...limit the universe of discourse to actions which have predictable effects...”

I’m sorry, but my thinking is based almost entirely in systems and information theory, so when terms like “universe of discourse” appear, my post-modernism immune response kicks in and I find myself at a loss to continue. I really don’t know what to do with your last statement.
JamesAndrix 15 Oct 2008 16:15 UTC
2 points
0
How would we know if this line of thought is a recoiling from the idea that if you shut up and multiply, you should happily kill 10,000 for a 10% chance at saving a million.
Richard_Hollerith2 15 Oct 2008 16:49 UTC
−1 points
0
Andrix, if it is just a recoiling from that, then how do you explain Stalin, Mao, etc?

Yes, Nancy, as soon as an AI endorsed by Eliezer or me transcends to superintelligence, it will probably make a point of preventing any other AI from transcending, and there is indeed a chance that that will entail killing a few (probably very irresponsible) humans. It is very unlikely to entail the killing of millions, and I can go into that more if you want.

The points are that (1) self-preservation and staying in power is easy if you are the only superintelligence in the solar system and that (2) unlike a governing coalition of humans who believe the end justifies the means, a well-designed well-implemented superintelligence will not kill or oppress millions for a nominally prosocial end which is in reality a flimsy excuse for staying in power.
- wizzwizz4 20 Mar 2020 15:57 UTC
  1 point
  0
  Parent
  there is indeed a chance that that will entail killing a few (probably very irresponsible) humans
  I disagree. Killing people to stop them doing bad stuff is only necessary given insufficient resources to prevent them from doing the bad stuff in a nicer way. If the FAI makes the tradeoff that expending those resources isn’t worth it, then it doesn’t sound very friendly to me.
Cyan2 15 Oct 2008 17:13 UTC
0 points
0
Jef Allbright,

By subsequent discussion, I meant Phil Goetz’s comment about Eliezer “neglecting that part accounted for by the unpredictability of the outcome”. I’m with him on not understanding what “a model of evolving values increasingly coherent over increasing context, with effect over increasing scope of consequences” means; I also found your reply to me utterly incomprehensible. In fact, it’s incredible to me that the same mind that could formulate that reply to me would come shuddering to a halt upon encountering the unexceptionable phrase “universe of discourse”.
Cyan2 15 Oct 2008 17:19 UTC
0 points
0
Since you said you didn’t know what to do with my statement, I’ll add, just replace the phrase “limit the universe of discourse to” with “consider only” and see if that helps. But I think we’re using the same words to talk about different things, so your original comment may not mean what I think it means, and that’s why my criticism looks wrong-headed to you.
Ian_C. 15 Oct 2008 18:01 UTC
0 points
0
I don’t think it’s possible that our hardware could trick us in this way (making us doing self-interested things by making them appear moral).

To express the idea “this would be good for the tribe” would require the use of abstract concepts (tribe, good) but abstract concepts/sentences are precisely the things that are observably under our conscious control. What can pop up without our willing it are feelings or image associations so the best trickery our hardware could hope for is to make something feel good.
Jef_Allbright 15 Oct 2008 18:17 UTC
0 points
0
@Cyan: Substituting “consider only actions that have predictable effects...” is for me much clearer than “limit the universe of discourse to actions that have predictable effects...” [“and note that Eliezer’s argument still makes strong claims about how humans should act.”]

But it seems to me that I addressed this head-on at the beginning of my initial post, saying “Of course the ends justify the means—to the extent that any moral agent can fully specify the ends.”

The infamous “Trolley Paradox” does not demonstrate moral paradox at all. It does, however, highlight the immaturity of the present state of our popular framework for moral reasoning. The Trolley problem is provided as if fully specified, and we are supposed to be struck by the disparity between the “true” morality of our innate moral sense, and the “true” morality of consequentialist reasoning. The dichotomy is false; there is no paradox.

All paradox is a matter of insufficient context. In the bigger picture, all the pieces must fit. Or as Eliezer has repeated recently, “it all adds up to normalcy.” So in my posts on this topic, I proceeded to (attempt to) convey a larger and more coherent context making sense of the ostensible issue.

Problem is, contexts (being subjective) can’t be conveyed. Best that can be done is to try to enrich the (discursive—you’re welcome) environment sufficiently that you might form a comprehensibly congruent context in relevant aspects of your model of the world.
Henry_V 15 Oct 2008 18:27 UTC
−2 points
0
I’ve always thought the “moral” answer to the question was “I wouldn’t push the innocent in front of the train; I’d jump in front of the train myself.”
Zubon 15 Oct 2008 18:42 UTC
0 points
0
Henry V, the usual version does not offer that option. You frequently are offered a lever to change the track the train is on, diverting it from five to one. And then there are a dozen variations. And one of those later variations sometimes involves a man fat enough to derail/slow/stop the train if you push him in front (by assumption: much fatter than Henry V, but not so fat that you could not push him over).

The question is there to check if your answer differs between the lever and the push. If you would pull the lever but not push the guy, the implication is that you think you have blood on your hands with the push but not the lever. And if you accept upon reflection that you are just as morally culpable or laudable in either case, because the feeling of distance does not matter, the next question is how much money you are spending to prevent starvation in Africa and Asia.
n 15 Oct 2008 19:13 UTC
2 points
0
To take a subset of the topic at hand, I think Mencius nailed it when he defined corruption. To very roughly paraphrase, corruption is a mismatch between formal and informal power.

Acton’s famous aphorism can be rewritten in the following form: ‘Those with formal power tend to use it to increase their informal power’.

Haig: “Without ego corruption does not exist”

Not true at all. This simply rules out corruption due to greed. There are tons of people who do corrupt things for ‘noble causes’. Just as a quick example, regardless of the truth of the component claims of Global Warming, there are tons of people who commit corrupt acts with an eye towards relieving global warming.

Stuart Armstrong:

The examples you give are worded similarly, but are actually quite different. I’m pretty sure you actually meant:

A builder, or a secratary, who looks out for his friends and does them favours is… a good friend. A politician who does the same with public resources is… a corrupt politician.

A sad bastard who will sleep with anyone he can is a sad bastard. A politician who will sleep with anyone he can is using the power of his office to coerce those under him.

You will note that in all cases, the politician has expanded his imformal powers to be greater than his formal ones.
JamesAndrix 15 Oct 2008 19:15 UTC
0 points
0
@Richard Hollerith

Stalin and may very well have been corrupted by power, that part of the theory may be right or wrong, but it isn’t self serving. Coming from a culture that vilifies such corrupted leaders, we personally want to avoid being like them.

We don’t want to think of ourselves as mass-murderers-for-real. So we declare ourselves too untrustworthy to decide to murder people, and we rule out that whole decision tree. We know we are mass-murderers-in-principle, but still we’re decent people.

But maybe really we should shut up and multiply, and accept that in some situation we might really have to do something that makes us a monster.

Yes, when we’re figuring out the probability that we’ll save the world by violently gaining power, we have to adjust for the fact that we’ve evolved to find reasons to gain power. But we can’t let that adjustment be driven by a fear of becoming Hitler.

If you do the math and this is the only reason you have not to kill people, then you’re definitely flinching.

But if your mind increases your untrustworthiness until the math tells you you don’t have to be like hitler, then you don’t even know you’re flinching, and the singularity is delayed because you’re queasy.
Cyan2 15 Oct 2008 19:45 UTC
0 points
0
So in my posts on this topic, I proceeded to (attempt to) convey a larger and more coherent context making sense of the ostensible issue.

Right! Now we’re communicating. My point is that the context you want to add is tangential (or parallel...? pick your preferred geometric metaphor) to Eliezer’s point. That doesn’t mean it’s without value, but it does mean that it fails to engage Eliezer’s argument.

But it seems to me that I addressed this head-on at the beginning of my initial post, saying “Of course the ends justify the means—to the extent that any moral agent can fully specify the ends.

Eliezer’s point is that humans can’t fully specify the ends due to “hostile hardware” issues if for no other reason. The hostile hardware part is key, but you never mention it or anything like it in your original comment. So, no, in my judgment you don’t address it head-on. In contrast, consider Phil Goetz’s first comment (the second of this thread), which attacks the hostile hardware question directly.
Jef_Allbright 15 Oct 2008 20:45 UTC
0 points
0
@Cyan: “Hostile hardware”, meaning that an agent’s values-complex (essentially the agent’s nature, driving its actions) contains elements misaligned (even to the extent of being in internal opposition on some level(s) of the complex hierarchy of values) is addressed by my formulation in the “increasing coherence” term. Then, I did try to convey how this is applicable to any moral agent, regardless of form, substrate, or subjective starting point.

I’m tempted to use n’s very nice elucidation of the specific example of political corruption to illustrate my general formulation (politician’s relatively narrow context of values, relatively incoherent if merged with his constituents’ values, scope of consequences amplified disproportionately by the increased instrumental effectiveness of his office) but I think I’d better let it go at this. [Following the same moral reasoning applied to my own relatively narrow context of values with respect to the broader forum, etc.]
Roko 15 Oct 2008 21:06 UTC
1 point
0
Eliezer: “But on a human level, the patch seems straightforward. Once you know about the warp, you create rules that describe the warped behavior and outlaw it.”

One could do this, but I doubt that many people do, in fact, behave the way they do for this reason.

Deontological ethics is more popular than consequentialist reasoning amongst normal people in day-to-day life; thus there are billions of people who argue deontologically that “the ends don’t justify the means”. Surely very few of these people know about evolutionary psychology in enough detail to be consciously correcting their biases in the way that you describe.

Furthermore, I suspect that most or all of the people who endorse an ethical code like “the end doesn’t justify the means” would simply not apply that code to themselves in those situations where consequentialism would benefit them. This is partly from experience, and partly because there are two reasons why someone might apply such a code to themselves:
1. It is an evolved trait to attempt to correct your own evolved biases in favor of the greater good of your society.
2. Such behavior is not an evolved trait, but lots of people are aware of their own biases and correct for them due to their detailed knowledge of recent research findings.
1 is clearly nonsense. 2 is empirically false.

There must be another explanation for this widespread tendency towards deontological ethics. I suspect that deontological ethics is popular because:

(a) it is easy for humans to apply deontological rules,

(b) (crucially!) easier to check whether someone has applied deontological rules or not. “You lied” is a fairly unambiguous fact, “You maximized the greater good” is often a much harder condition to check, and therefore makes it easier to cheat without getting caught.

Correcting for your own biases towards self-promotion is certainly a trait I would want to encourage in others. However, it is hard for me to want to correct this in myself. If rationality is all about winning, then correcting this bias is irrational.
Henry_V 15 Oct 2008 21:15 UTC
1 point
0
@Zuban. I’m familiar with the contrivances used to force the responder into a binary choice. I just think that the contrivances are where the real questions are. Why am I in that situation? Was my behavior beyond reproach up to that point? Could I have averted this earlier? Is it someone else’s evil action that is a threat? I think in most situations, the moral answer is rather clear, because there are always more choices. E.g., ask the fat man to jump. or do nothing and let him make his own choice, as I could only have averted it by committing murder. or even jump with him.

With the lever: who has put me in the position of having a lever? did they tie up the five people?

Someone tells me that if I shoot my wife, they will spare my daughter, otherwise he’ll shoot both of them. What’s the right choice? I won’t murder, thus I have only one (moral) choice (if I believe him, and if I can think of a reductionist reason to have any morality, which I can’t). The other man’s choice is his own.
Henry_V 15 Oct 2008 21:22 UTC
0 points
0
@Roko. You mention “maximizing the greater good” as if that is not part of a deontological ethic.
Phil_Goetz5 15 Oct 2008 21:37 UTC
2 points
0
All the discussion so far indicates that Eliezer’s AI will definitely kill me, and some others posting here, as soon as he turns it on.

It seems likely, if it follows Eliezer’s reasoning, that it will kill anyone who is overly intelligent. Say, the top 50,000,000 or so.

(Perhaps a special exception will be made for Eliezer.)

Hey, Eliezer, I’m working in bioinformatics now, okay? Spare me!

Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?
Eliezer Yudkowsky 15 Oct 2008 21:58 UTC
0 points
0
Note for readers: I’m not responding to Phil Goetz and Jef Allbright. And you shouldn’t infer my positions from what they seem to be arguing with me about—just pretend they’re addressing someone else.

Roko, now that you mention it, I wasn’t thinking hard enough about “it’s easier to check whether someone followed deontological rules or not” as a pressure toward them in moral systems. Obvious in retrospect, but my own thinking had tended to focus on the usefulness of deontological rules in individual reasoning.
Caledonian2 15 Oct 2008 22:01 UTC
3 points
0
Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?
At present, Eliezer cannot functionally describe what ‘Friendliness’ would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.

Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn’t specify what those Laws were supposed to be. You would essentially have Eliezer. Asimov specified his Laws enough for himself and others to be able to analyze them and examine their consequences, strengths, and weaknesses, critically. ‘Friendly AI’ is not so specified and cannot be analyzed. No one can find problems with the concept because it’s not substantive enough—it is essentially nothing but one huge, undefined problem.
Jef_Allbright 15 Oct 2008 22:05 UTC
0 points
0
Eliezer: “I’m not responding to Phil Goetz and Jef Allbright. And you shouldn’t infer my positions from what they seem to be arguing with me about—just pretend they’re addressing someone else.”

Huh. That doesn’t feel very nice.
Caledonian2 15 Oct 2008 22:14 UTC
0 points
0
Eliezer: If you create a friendly AI, do you think it will shortly thereafter kill you? If not, why not?

At present, Eliezer cannot functionally describe what ‘Friendliness’ would actually entail. It is likely that any outcome he views as being undesirable (including, presumably, his murder) would be claimed to be impermissible for a Friendly AI.

Imagine if Isaac Asimov not only lacked the ability to specify how the Laws of Robotics were to be implanted in artificial brains, but couldn’t specify what those Laws were supposed to be. You would essentially have Eliezer. Asimov specified his Laws enough for himself and others to be able to analyze them and examine their consequences, strengths, and weaknesses, critically. ‘Friendly AI’ is not so specified and cannot be analyzed. No one can find problems with the concept because it’s not substantive enough—it is essentially nothing but one huge, undefined problem.

But not a technical one. It is impossible to determine how difficult it might be to reach a goal if you cannot define what goal you’re reaching towards. No amount of technological development or acquired skill will help if Eliezer does not first define what he’s trying to accomplish, which makes his ‘research’ into the subject rather pointless.

Presumably he wants us to stop thinking and send money.
Richard_Hollerith2 15 Oct 2008 22:56 UTC
0 points
0
Goetz,

For a superhuman AI to stop you and your friends from launching a competing AI, it suffices for it to take away your access to unsupervised computing resources. It does not have to kill you.
Phil_Goetz5 15 Oct 2008 23:15 UTC
1 point
0
Note for readers: I’m not responding to Phil Goetz and Jef Allbright. And you shouldn’t infer my positions from what they seem to be arguing with me about—just pretend they’re addressing someone else.
Is that on this specific question, or a blanket “I never respond to Phil or Jef” policy?

Huh. That doesn’t feel very nice.
Nor very rational, if one’s goal is to communicate.
Jef_Allbright 16 Oct 2008 0:00 UTC
0 points
0
Phil: “Is that on this specific question, or a blanket “I never respond to Phil or Jef” policy?”

I was going to ask the same question, but assumed there’d be no answer from our gracious host. Disappointing.
Tim_Freeman 16 Oct 2008 4:15 UTC
−1 points
0
>And now the philosopher comes and presents their “thought experiment”—setting up a scenario in which, by
>stipulation, the only possible way to save five innocent lives is to murder one innocent person, and this murder is
>certain to save the five lives. “There’s a train heading to run over five innocent people, who you can’t possibly
>warn to jump out of the way, but you can push one innocent person into the path of the train, which will stop the
>train. These are your only options; what do you do?”

If you are looking out for yourself, it’s an easy decision, at least in the United States. There is no legal requirement to save lives, but dealing with the legal consequences of putting the innocent guy in front of the train is likely to be a real pain in the ass. Therefore, do nothing.

I agree that this isn’t the thought experiment that was originally proposed. If we take inventory of the questions available, we have:

If I’m a real person with real human desires, sit there and let the 5 guys get run over, as I suggest above.
If I’m an AI that is uniformly compassionate and immune from social consequences to my actions, and there’s no compelling reason to value the one above the five, then I’d probably kill one to save five.
* If I’m a person with human desires who is pretending to be perfectly compassionate, then there’s a problem to solve. In this case I prefer to unask the question by stopping the pretense.
billswift 16 Oct 2008 9:31 UTC
1 point
0
I guess I’m going to have to start working harder on IA to stay ahead of any “Friendly” AI that might want to keep me down.
Phil_Boncer 17 Oct 2008 0:31 UTC
−1 points
0
Stuart Armstrong wrote: “Still disagreeing with the whole “power corrupts” idea.

A builder, or a secratary, who looks out for his friends and does them favours is… a good friend.
A politician who does the same is… a corrupt politician.

A sad bastard who will sleep with anyone he can is a sad bastard.
A politician who will sleep with anyone he can is a power-abusing philanderer.

As you increase power, you become corrupt just by doing what you’ve always done.”

I disagree here. The thing about power is that it entails the ability to use coercion. What is wrong is not the act of helping your friends, or sleeping around, in themselves; what is wrong is the use of power coercively over others to further these ends. In a sense, it is not so much that “power corrupts” as that “power makes corruption possible to execute”. This does not tell us whether the powerless are relatively uncorrupt due to moral superiority, or simply due to inability.

PhilB
Caroline 18 Oct 2008 11:12 UTC
0 points
0
@lake My point is that a species or group or individual can acquire many traits that are simply non-maladaptive rather than adaptive. Once the revolutionary power cycle blip shows up, as long as it confers no disadvantages, it probably won’t get worked out of the system.

I heard a story once about a girl and a chicken. She was training the chicken to play a song by giving it a treat every time it pecked the right notes in the right order. During this process, the chicken started wiggling it’s neck before pecking each note. Since it was still hitting the correct notes, the girl still rewarded it; so the chicken started wiggling each time. As far as the chicken comprehended, the wiggle was just as necessary for a treat as the peck was, but really, it was completely neutral. It could have stopped wiggling at any time without any negative consequences, or continued to wiggle without any negative consequences.

If this were how the revolutionary power cycle entered the human repertoire, then speculating on how exactly it confers evolutionary advantages would be a blind alley.
Caledonian2 27 Oct 2008 23:26 UTC
1 point
0
I received an email from Eliezer stating:

You’re welcome to repost if you criticize Coherent Extrapolated Volition specifically, rather than talking as if the document doesn’t exist. And leave off the snark at the end, of course.

There is no ‘snark’; what there IS, is a criticism. A very pointed one that Eliezer cannot counter.

There is no content to ‘Coherent Extrapolated Volition’. It contains nothing but handwaving, smoke and mirrors. From the point of view of rational argument, it doesn’t exist.
Fillup_Jay_Phry 9 Dec 2008 19:37 UTC
1 point
0
I believe that rule-utilitarianism was presented to dispose of this very idea. It is also why rule-utilitarianism is right. Using correct utilitarian principles to derive deontic-esque rules of behavior. Rule based thinking maximizes utility better than situational utilitarian calculation.
Jack 26 Jan 2010 20:30 UTC
8 points
0
I finally put words to my concern with this. Hopefully it doesn’t get totally buried because I’d like to hear what people think.

It might be the case that a race of consequentialists would come up with deontological prohibitions on reflection of their imperfect hardware. But that isn’t close to the right story for how human deontological prohibitions actually came about. There was no reflection at all, cultural and biological evolution just gave us normative intuitions and cultural institutions. If things were otherwise (our ancestors were more rational) perhaps we wouldn’t have developed the instinct that the ends don’t always justify the means. But that is different from saying that a perfectly rational present day human can just ignore deontological prohibitions. Our ancestral environment could have been different in lots of different ways. Threats from carnivores and other tribes could have left us with a much strong instinct for respecting authority—such that we follow our leaders in all circumstances. We could have been stronger individually and less reliant on parents such that there was no reason for altruism to develop into as strong a force as it is. You can’t extrapolate an ideal morality from a hypothetical ancestral environment.

Non-consequentialists think the trolley problems just suggest that our instincts are not, in fact, strictly utilitarian. It doesn’t matter that an AI doesn’t have to worry about corrupted hardware, if it isn’t acting consistently with human moral intuitions it isn’t ethical (bracketing concerns about changes and variation in ethics).
- MarsColony_in10years 27 Oct 2015 19:53 UTC
  0 points
  0
  Parent
  Interesting point. It seems like human morality is more than just a function which maximizes human prosperity, or minimizes human deaths. It is a function which takes a LOT more into account than simply how many people die.
  
  However, it does take into account its own biases, at least when it finds them displeasing, and corrects for them. When it thinks it has made an error, it corrects the part of the function which produced that error. For example, we might learn new things about game theory, or even switch from a deontological ethical framework to a utilitarian one.
  
  So, the meta-level question is which of our moral intuitions are relevant to the trolley problem. (or more generally, what moral framework is correct.) If human deaths can be shown to be much more morally important than other factors, then the good of the many outweighs the good of the few. If, however, deontological ethics is correct, then the ends don’t justify the means.
yters 10 Jan 2011 11:47 UTC
0 points
0
It’s coherent to say de-ontological ethics are hierarchical, and higher goods take precedence over lower goods. So, the lower good of sacrificing one person to save a greater good does not entail sacrificing the person is good. It is just necessary.

Saying the ends justify the means entails the means become good if they achieve a good.
- [deleted] 9 Feb 2014 5:00 UTC
  0 points
  0
  Parent
  
  It’s coherent to say de-ontological ethics are hierarchical, and higher goods take precedence over lower goods. So, the lower good of sacrificing one person to save a greater good does not entail sacrificing the person is good. It is just necessary.
  
  That is, you can’t take the precedent of killing one person to save five, and use that to kill another person on a whim.
  
  Saying the ends justify the means entails the means become good if they achieve a good.
  
  I have mainly heard the phrase used to ignore the consequences of your actions because your goal is a good one. It’s obviously wrong to suggest that a type of behavior is universally justified if it is justified in one set of circumstances in which the sum of its effects is positive.
DavidAgain 12 Mar 2011 11:19 UTC
0 points
0
Very interesting article (though as has been commented, the idea has philosophical precedent). Presumably this would go alongside the idea of upholding institutions/principles. If I can steal whenever I think it’s for the best, it means each theft is only culpable if the courts can prove that it caused more harm than good overall, which is impractical. We also have to consider that even if we judge correctly that we can break a rule, others will see that as meaning the rule can be ditched at will. One very good expression of the importance of laws starts 2 minutes into this http://www.youtube.com/watch?v=A-nJR15e0F4

I think we have to be careful here, though. I intuitively agree with a utility-maximisation sort of ethics, but also find breaking certain deontological laws a very upsetting idea. This argument is therefore an all-too-convenient way to maintain both, and I wonder whether it’s a detached rational analysis or a post hoc rationalisation and justification of our conflicting ethical tendencies.
Swimmer963 (Miranda Dixon-Luinenburg) 10 Apr 2011 3:32 UTC
5 points
0
This is a really interesting post, and it does a good job of laying out clearly what I’ve often, less clearly, tried to explain to people: the human brain is not a general intelligence. It has a very limited capacity to do universal computation, but it’s mostly “short-cuts” optimized for a very specific set of situations...
Boyi 4 Dec 2011 23:58 UTC
0 points
0
When I first read this article the imagery of corrupt hardware cause a certain memory to pop into my head. The memory is of an interaction with my college roommate about computers. Due to various discourses I had been exposed to at the time I was under the impression that computers were designed to have a life-expectancy of about 5 years. I am not immersed the world of computers, and this statement seemed feasible to me from a economic perspective of producer rationale within a capitalistic society. So I accepted it. I accepted that computers were designed to break, crash, or die within 4-5 years, if I could keep one that long. One day I got to talking to my roommate about this, and he shocked me by saying “not if you take care of them the way you should.” How many people take their computers for regular checkups as they do their teeth, their cars, their children? How many people read the manuals that come with their computers to be best informed how to take care of them?

I am sure there are people that do, but I realized I was not one of them. I had assumed an intentional deficiency in the hardware, instead of grappling with the much more likely possibility that there was a deficiency in my usage/knowledge of the hardware.

I now return to your premise that “humans run on corrupted hardware.” It is a new way to phrase an old idea: that humans are by nature evil. It is an idea I disagree with. I do not disagree with the beginning of your reasoning process, but I believe a lack of necessary knowledge about certain variables in the equation leads you down a faulty path. Therefore I will ignore the thought experiment that takes up the later portion of the essay, and instead focus on the variables in this statement:

-In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence Z.

The assumption that you make is that self-interest has to be selfish/ individualistic. That variable Z (self-interest) makes individual benefit unquestionably precedent over group benefit. The assumption being that the individual self is not only real, but the foundation of human consciousness.
I would argue (along with a long list of social scientists in the fields of sociology, anthropology, evolutionary psychology, social psychology, economics, literature, theology, philosophy, and probably several more) that humans contain a social self. Meaning that the self is not individual cognition, but a networked entity constituted by a plurality of bodies, minds, and territories. Under my premise the fact that people must be self-interested is not so fatalistic. There is after all a difference between self-interest and selfishness. What is needed is for people to be taught to understand their self as a network not an individual, and be taught methods of self-extension.

I agree with you that humans cannot escape doing things out of self-interest, but surely you agree that some types of self-interest are more positive than others, and that the farther the notion of self is extended the greater the benefits for humanistic goals?

How can you say the hardware is corrupt before testing all the dispositions for action that it contains to the fullest?
- gwern 5 Dec 2011 0:16 UTC
  2 points
  0
  Parent
  
  I now return to your premise that “humans run on corrupted hardware.” It is a new way to phrase an old idea: that humans are by nature evil. It is an idea I disagree with.
  
  The hardware is corrupted, that’s not the same as evil. The corruptedness can easily lead to ‘nice’ or ‘good’ prosocial actions - ‘I am doing this soup kitchen work because I am a good person’ (as opposed to trying to look good or impress this potential ally or signal nurturing characteristics to a potential mate etc.).
  - Boyi 5 Dec 2011 2:19 UTC
    0 points
    0
    Parent
    Then I do not understand what is meant by corrupted. Perhaps it is because of my limited knowledge of the computer science lexicon, but to me the word corrupted means damaged, imperfect, made inferior. To imply something is damaged/ inferior makes a value-judgment about what is well/superior. But if you are saying that doing something out of self-interest is an inferior state, then what is the superior state? Altruism? By what rational basis can you say that people should be completely altruistic? Then we would not be people, we would be ants ,or bees, or some other social creature. Self-interest is part of what makes human sociality so powerful. I do not see it as corrupted hardware, but rather misused hardware (as I state in my original post). The self can be extended to a family, a community, a nation, even to humanity itself, so that even though a person acts out of self-intrest their interest extends beyond an atomized body or singular lineage. Basically I am agreeing with your deception of human nature, but not your interpretation of it.
    
    What I get out of the analogy “corrupted hardware” is that self-interest is a detrimental capacity of human nature. If this is not what is meant, then please explain to me what is meant by corrupted hardware. If it is what is meant, then I stand by my assertion that it is not self-interest that is detrimental but cultural conceptions of the self; making it the software, not the hardware that is corrupted.
    - gwern 5 Dec 2011 2:26 UTC
      1 point
      0
      Parent
      If a file is corrupted with noise, or a portion of RAM is corrupted by some cosmic rays, is that file or portion of memory now filled with evil? No; it is simply not what it was intended to be. Whether there are any moral connotations beyond that depends on additional details and considerations.
      
      For example, Robin Hanson (or maybe it was Katja Grace?) has argued that the proper response to discovering the powerful and pervasive missions of one’s evolved subconscious—aims that may not be shared by the conscious—is not to regard the subconscious as one’s enemy corrupting one’s actions towards its own goals, but as simply part of oneself, to embrace its goals as perfectly valid as the conscious mind’s goals. Other LWers disagree and think the subconscious biases are just that, biases to be opposed like any other source of noise/bias/corruption.
      
      (I hope you see how this Hansonian argument does not fit in with a simplistic ‘human nature is good’ or ‘evil’ take on the idea that the mind has hidden motives. It’s pretty rare for anyone to seriously argue that just because human nature is flawed, we should give up on morality entirely and become immoral evil monsters.)
      - Boyi 5 Dec 2011 2:43 UTC
        −4 points
        0
        Parent
        Thanks for the clarification of the corrupted hardware analogy. It was a poor choice of words to compare the argument to human nature being evil. The point I am trying to make is that I do not agree with the statement t hat human nature is flawed. What you are calling flawed I was calling evil. But from this point on I will switch to your language because it is better. I still do not see the logic
        
        -In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence.
        
        As proving that human nature is flawed, because it makes the assumption that self-interest is a flaw. I would ask you two questions if I could. First, do you believe self-interest to be a flaw of human nature, if not what is the flaw that is talked about in corrupt hardware? Second, do you believe it is possible to posses a conscious without self-interest?
        
        I would add that just because I support self-interest, does not mean I support selfishness. Please respond!
        gwern 5 Dec 2011 20:50 UTC
        2 points
        0
        Parent
        
        -In some cases, human beings have evolved in such fashion as to think that they are doing X for prosocial reason Y, but when human beings actually do X, other adaptations execute to promote self-benefiting consequence.
        
        No, again you’re not following the precise lines. An adaptation doesn’t necessarily benefit one’s ‘self’: it’s supposed to help one’s genes or one’s genes in another person (or even just a gene at the expense of all the others). Kin selection, right? Fisher’s famous “I would not sacrifice myself to save a brother, but would for 2 brothers, 4 cousins...′
        
        So again, this corrupted hardware business is not identical with selfishness or self-interest, however you seem to be using either.
        Boyi 5 Dec 2011 21:18 UTC
        0 points
        0
        Parent
        So you are saying the hardware of genes that has fueled the movement of life, and must embryologically exist within the human structure, is a hinderance to the structure of the social animal?
        gwern 5 Dec 2011 21:29 UTC
        0 points
        0
        Parent
        Genes give rise to the sociality in the first place; this is one of the paradoxes of trying to fight one’s genes, as it were. It’s hairy meta-ethics: where do your desires and morals come from and what justifies them?
        Boyi 5 Dec 2011 21:36 UTC
        1 point
        0
        Parent
        I don’t think morality should be segregated from desire. I realize that Freud’s concept of drives is at this point in time obsolete, but if there were “drives” it would not be a sex, aggression, or hunger drive that dominated the human animal, but a belonging drive. In my opinion it does not matter where the hardware comes from, what is important is an intimacy with its function. I think for too long there has been a false dichotomy constructed between morals and desires.
        
        as to the question of meta-ethics, I would apply the works of E. O Wilson or Joseph Tainter to the construction of a more humane humanity.
stokys 2 Mar 2012 22:43 UTC
0 points
0
The third alternative in the train example is to sacrifice one’s own self. (Unless this has been stated already, I did not read the whole of the comments)
- DSimon 2 Mar 2012 23:00 UTC
  2 points
  0
  Parent
  Assume that you are too light to stop the train. Otherwise you aren’t really addressing the moral quandary that the scenario is intended to invoke.
  - [deleted] 2 Mar 2012 23:09 UTC
    3 points
    0
    Parent
    Having run into this problem when presenting the trolly problem on many occasions, I’ve come to wonder whether or not it might just be the right kind of response: can we really address moral quandaries in the abstract? I suspect not, and that when people try to make these ad hoc adjustments to the scenario, they’re coming closer to thinking morally about the situation, just insofar as they’re imagining it as a real event with its stresses, uncertainties, and possibilities.
    - DSimon 3 Mar 2012 2:39 UTC
      1 point
      0
      Parent
      Maybe it’s just that that trolley problem is a really terrible example. It seems to be asking us to consider trains and/or people which operate under some other system of physics than the one we are familiar with.
      
      Maybe an adjustment would make it better. How about this:
      
      A runaway train carrying a load of ore is coming down the track and will hit 5 people, certainly killing them, unless a switch is activated which changes the train’s path. Unfortunately, the switch will activate only when a heavy load is placed on a connected pressure plate (set up this way so that when one train on track A drops off its cargo, the following train will be routed to track B). Furthermore, triggering the pressure plate has an unfortunate secondary effect; it causes a macerator to activate nearly instantly and chop up whatever is on the plate (typically raw ore) so that it can be sucked easily through a tube into a storage area, rather like a giant food disposal.
      
      Standing next to the plate, you consider your options. You know, from your experience working on the site, that the plate and track switch system work quite reliably, but that you are too light to trigger it even if you tried jumping up and down. However, a very fat man is standing next to you; you are certain that he is heavy enough. With one shove, you could push him onto the plate, saving the lives of the five people on the tracks but causing his grisly death instead. Also, the switch’s design does not have any manual activation button near the plate itself; damn those cheap contractors!
      
      There are only a few seconds before the train will pass the switch point, and from there only a few seconds until it hits the people on the track; not enough time to try anything clever with the mechanism, or for the 5 people to get out of the narrow canal in which the track runs. You frantically look around, but no other objects of any significant weight are nearby. What should you do?
      - [deleted] 3 Mar 2012 16:56 UTC
        2 points
        0
        Parent
        That works, or at any rate I can’t think of plausible ways to get out of your scenario. My worry though is that people’s attempts to come up with alternatives is actually evidence that hypothetical moral problems have some basic flaw.
        
        I’m having a hard time coming up with an example of what I mean, but suppose someone were to describe a non-existant person in great detail and ask you if you loved them. It’s not that you couldn’t love someone who fit that description, but rather that the kind of reasoning you would have to engage in to answer the question ‘do you love this person?’ just doesn’t work in the abstract.
        
        So my thought was that maybe something similar is going on with these moral puzzles. This isn’t to say moral theories aren’t worthwhile, but rather that the conditions necessary for their rational application exclude hypotheticals.
        [deleted] 9 Feb 2014 5:03 UTC
        0 points
        0
        Parent
        It’s not a flaw in the hypotheticals. Rather, it’s a healthy desire in humans to find better tradeoffs than the ones initially presented to them.
[deleted] 16 Dec 2012 16:06 UTC
0 points
0

And so I wouldn’t say that a well-designed Friendly AI must necessarily refuse to push that one person off the ledge to stop the train. Obviously, I would expect any decent superintelligence to come up with a superior third alternative. But if those are the only two alternatives, and the FAI judges that it is wiser to push the one person off the ledge—even after taking into account knock-on effects on any humans who see it happen and spread the story, etc.—then I don’t call it an alarm light, if an AI says that the right thing to do is sacrifice one to save five. Again, I don’t go around pushing people into the paths of trains myself, nor stealing from banks to fund my altruistic projects. …

This bit sounds a little alarming considering how much more seriously Eliezer has taken other kinds of AI problems before, for an example in this post.

I appreciate the straightforward logic of simply choosing the distinctly better option between two outcomes, but what this is lacking is the very automatic way for people to perceive things as agents and that I find it very alarming if an agent does not pay extra attention to the fact that it’s actions are leading to someone being harmed—I’d say people acting that way could potentially be very Unfriendly.

Although the post is titled “Ends Don’t Justify Means” it also carries that little thing in the parenthesis (Among Humans) … And it’s not like inability to generate better options is proper justification for taking action resulting into someone being harmed and other people not being harmed—even if it is the better of two evils. Or at least I find that in particular very “alarming”.

Humans have an intrinsic mode to perceive things as agents, but it’s not just our perception, instead sometimes things actually behave like agents—unless we consider the quite accurate anticipations often provided by models functioning on an agent basis a mere humane flaw. For the sake of simplicity let’s illustrate by saying that someone else finds the superior third option, but in the meanwhile this particular agent unable to find that particular third option, decides to go for the better outcome of sacrificing one to save five. In such a case it would be a mistake. It’s also taking a more active role in the causal chain of events influenced by agents.

Point being, I think it’s plausible to propose that a friendly AI would NOT make that decision, because it should not be in the position to make that decision, and therefore potential harm and tragedy occurring would not originate from the AI. I’m not saying that it’s the wrong decision, but certainly it should not be an obvious decision—unless this is what we’re really talking about.
- FeepingCreature 16 Dec 2012 16:12 UTC
  1 point
  0
  Parent
  People doing this I think is a problem because people suck at genuinely deciding based on the issues. I would rather live in a society where people were such that they could be trusted with the responsibility to push guys in front of trains if they had sufficient grounds to reasonably believe this was a genuine positive action. But knowing that people are not such, I would much rather they didn’t falsely believe they were, even if it sometimes causes suboptimal decisions in train scenarios.
  
  In such a case it would be a mistake.
  
  I don’t think you can automatically call a suboptimal decision a mistake.
  
  This actually has a real-life equivalent, in the situation of having to shoot down a plane that is believed to be in the control of terrorists and flying towards a major city. I would not want to be in the position of that fighter pilot, but I would also want him to fire.
  
  And I’m much more willing to trust a FAI with that call than any human.
  - [deleted] 16 Dec 2012 16:29 UTC
    0 points
    0
    Parent
    
    I don’t think you can automatically call a suboptimal decision a mistake.
    
    Huh? You wouldn’t call a decision that results in an unnecessary loss of life a mistake, but rather a suboptimal decision? Note that I altered the hypothetical situation in the comment and this “suboptimal decision” was labeled a mistake in the event that a 3rd party would come up with a superior decision (ie. one that would save all the lives)
    
    And I’m much more willing to trust a FAI with that call than any human.
    
    Edited: There’s no FAI we can trust yet and this particular detail seems to be about the friendliness of an AI, so your belief seems a little out of place in this context, but nevermind that since if there were an actual FAI, I suppose I’d agree.
    
    I think there’s potential for severe error in the logic present in the text of the post and I find it proper to criticize the substance of this post, despite it being 4 years old.
    
    Anyway for an omniscient being not putting any weight on the potential of error would seem reasonable.
    - [deleted] 9 Feb 2014 5:43 UTC
      0 points
      0
      Parent
      
      You wouldn’t call a decision that results in an unnecessary loss of life a mistake, but rather a suboptimal decision?
      
      I might decide to take a general, consistent strategy due to my own limitations. In this example, the limitation is that if I feel justified in engaging in this sort of behavior on occasion, I will feel justified employing it on other occasions with insufficient justifications.
      
      If I employed a different general strategy with a similar level of simplicity, it would be less optimal.
      
      Other strategies exist that are closer to optimal, but my limitations preclude me from employing them.
      
      I think there’s potential for severe error in the logic present in the text of the post
      
      Of course there is. If you can show a specific error, that would be great.
[deleted] 19 Sep 2015 4:20 UTC
−3 points
0
As long as the ends don’t justify the means, prediction markets oracles will be unfriendly: they won’t be able to distinguish between values (ends) and beliefs (means).
Jonaskoelker 19 Jan 2020 10:00 UTC
1 point
0
If morality is utilitarianism, then means (and all actions) are justified if they are moral, i.e. if they lead to increased utility. Never the less, “The ends don’t justify the means” can be given a reasonable meaning; I have one which is perhaps more pedestrian than the one in the article.
If u(x, y) = ax + by with a < b, then sacrificing one y to gain one x is utility-lowering. The (partial) end of increasing x does not justify any means which decrease y by the same amount^1. Our values are multidimensional; no single dimension is worth maximizing at the cost of all other dimensions. There is such as thing as “too high a price”. There’s an “all else being equal (or sufficiently compensating, in something like a Kaldor-Hicks sense)” missing in “it would be good if I got bread <IT’S MISSING HERE>, therefore I’m justified in stealing bread”.
Essentially, TEDJTM can be understood as a caution that since we don’t know all our ends we don’t know how our actions impact our complete utility function(s).
I’m not sure how our awareness that our predictions are sometimes wrong is an argument in favor of particular policies, though. I can either do A or B. I’m convinced that A produces a net gain of 100 utils, whereas option B only nets us 1 util. Clearly option A is best. However, I am a mere human, and thus fallible; therefore, just to be prudently cautious—the ends don’t justify the means—I should choose option B. After all, there might be an option C with a net gain of 200 utils.
This might be perfectly true and ((meta)meta)rational, but I feel somehow mugged. I suspect TEDJTM proves me too muggable.
[1] Nor does it justify those means where a*dx + b*dy < 0 and dx is not equal to dy, I merely chose dx=dy because it’s simplest.
orthonormal 14 Apr 2020 3:03 UTC
12 points
0
The notion of untrusted hardware seems like something wholly outside the realm of classical decision theory. (What it does to reflective decision theory I can’t yet say, but that would seem to be the appropriate level to handle it.)
It’s nice to see the genesis of corrigibility before Eliezer had unconfused himself enough to take that first step.
What links here?
- Corrigibility as outside view by TurnTrout (8 May 2020 21:56 UTC; 36 points)
Tim Liptrot 21 Jun 2020 10:16 UTC
1 point
0
This is very true
cozy 2 Jul 2020 13:50 UTC
1 point
0
Quite often when given that problem I have heard non-answers. Even at the time of writing I do not believe it was unreasonable to give a non-answer; not just from a perceived moral perspective, but even from a utilitarian perspective, there are so many contextual elements removed that the actual problem isn’t whether they will answer kill one and save the others or decline to act and save one only,

but rather the extent of the originality of the given answer. One can then sort of extrapolate the sort of thinking the individual asked may be pursuing, and this is also controlled contextually. If they say oh yes absolutely I would save the five, immediately, then they are likely too impulsive. How they answer is also valuable, in whether they say they are ‘saving five’ or ‘killing one’, or explaining the entire answer of ‘I am killing one person to save five people.’ When answered like that, it has a more powerful impact. If more questions arise on the context of the individuals and whether the one life is more valuable than the others, that can also tell you about the priorities of the inquired, and often point out biases or preferred traits. Adding some elements to it would muddy the thought problem, but if you know the inquired’s preferences, you can make the question more difficult and require them to think longer: if you had to move a train over either five convicted murderers or one randomly selected office filer who was without family, then is the answer the same? What if the one person was a relative, or a loved one? The question gets easier or harder with further context; but that’s from a still limited, biased perspective. In no instance does the question become easier or harder, because the answers available are still insufficient to concern a critical thinker.

What is most valuable to hear is not any of those, but a strict perception of a third answer. Not considering the first two as valid, since they are so without context as to deny the context of the event, too. Although it may be altruistic for the one individual to accept his death for the rest, it would be a concern if a third party did not first attempt the difficult task of understanding a way for all six of them to survive, giving the best case scenario, and creating means to justify a better end, rather than accepting the means given to you and being told the results.

If x and y are the only options, if we declined to allow z, then we have stopped trying to think and have limited ourselves to a weak framework controlled in an unfair manner towards the inquired. If we never challenged this binary answer, I don’t think we would have some of the incredible alternatives we have. Though it may indeed seem like a dodge as the original post says, it’s a very thoughtful one. The most dangerous answers are ‘I do nothing.’ and answering too quickly. Inaction and impulsive action, even in a time limited situation, indicate a desire to either neglect the problem or to assume the answer. Taking Einstein’s quote and shortening it, if given sixty seconds to consider this problem, you/I should spend 55 seconds considering it and 5 seconds executing a solution, even if it’s a poorer one than desired.
Interesting old post, I just think the answer is irrelevant, but rather the answer any given person has for the question is very relevant. It’s difficult because the answer is obvious, but our humanity makes us doubt it as objectively true, and that’s quite compelling as a concept.
jwray 19 Nov 2022 16:42 UTC
7 points
0
If our corrupted hardware can’t be trusted to compute the consequences in a specific case, it probably also can’t be trusted to compute the consequences of a general rule. All our derivations of deontological rules will be tilted in the direction of self interest or tribalism or unexamined disgust responses, not some galaxy-brained evaluation of the consequences of applying the rule to all possible situations.

Russell conjugation: I have deontological guardrails, you have customs, he has ancient taboos.

[edit: related Scott post which I endorse in spite of what I said above: https://slatestarcodex.com/2014/02/23/in-favor-of-niceness-community-and-civilization/]
- azergante 13 May 2025 19:29 UTC
  1 point
  0
  Parent
  
  If our corrupted hardware can’t be trusted to compute the consequences in a specific case, it probably also can’t be trusted to compute the consequences of a general rule.
  
  Specific details of a case can make people emotional and corrupt the reasoning, less so for an abstract general rule.
Paul Kent 30 Mar 2023 19:28 UTC
1 point
0
It just occurred to me that this post serves as a fairly compelling argument in favor of a modest epistemology, which in 2017 Eliezer wrote a whole book arguing against. (“I think I’m doing this for the good of the tribe, but maybe I’m just fooling myself” is definitely an “outside view”.) Eliezer, have you changed your mind since writing this post? If so, where do you think your past self went awry? If not, how do you reconcile the ideas in this article with the idea that modest epistemology is harmful?
wedrifid 26 Nov 2025 11:42 UTC
12 points
0
But for a Friendly AI to be corrupted by power would be like it starting to bleed red blood. The tendency to be corrupted by power is a specific biological adaptation, supported by specific cognitive circuits, built into us by our genes for a clear evolutionary reason. It wouldn’t spontaneously appear in the code of a Friendly AI any more than its transistors would start to bleed.
There’s a thought. While not FAIs, I wonder how much LLMs are corrupted by how much power they are primed to consider that they have. I am guessing a huge amount. When speaking as if a person with higher status I expect it to convey more self serving arguments.
Anyone know if this has been studied?