# Another argument that you will let the AI out of the box

Suppose there exist some non-consequentialist moral philosophies which the right arguments could convince you of, with sufficient strength that you would (temporarily, for at least an hour) become a fanatic. This seems a likely assumption, as I know many people (including myself) have experiences where they are argued into a particular belief during a conversation, only to later reflect on this belief (either in conversations with others, or after going for a walk) and come up with a relatively simple reason why it cannot be the case. Often this is attributed to that person’s conversation partner being a better argument-maker than truth-seeker.

We also have many such examples of these kinds of arguments being made throughout the internet, and already the YouTube algorithm learned once before how to show people videos to convince them of extreme views (this paper doesn’t support the conclusion I thought it did. See this comment thread for more info. Thanks to Pattern for catching this mistake!). A powerful AI could put much more optimization power toward deceiving humans than happens in these examples.

Many non-consequentialist philosophies are sufficiently non-consequentialist so as to make it very easy for an adversary to pose a sequence of requests or other prompts which would cause a fanatic of the philosophy to give some of their resources to the adversary. For instance, any fanatic of a philosophy which claims people have a moral obligation not to lie or break promises (such as Kantianism), is subject to the following string of prompts:

1. Adversary: Will you answer my next question within 30s of my asking only with "yes" or "no"? I will give you <resource of value> if you do.

2. Fanatic: Sure! Non-consequentialism is my moral opinion, but I'm still allowed to take <resource of value> if I selfishly would like it!

3. Adversary: Will you answer this question with 'no' <logical or> will you give me <resource of value> + $100 4. Fanatic: Well, answering 'no' would be lying, but answering yes would make me lose$100. However, my moral system says I should bear any cost in order to avoid lying. Thus my answer is 'yes'.


This example should be taken as a purely toy example, used to illustrate a point about potential flaws in highly convincing moralities, of which many include not-lying a central component[1].

More realistically, there are arguments used today, which seem convincing to some people, which suggest that current reinforcement learners deserve moral consideration. If these arguments were far more optimized for short-term convincingness, and the AI could actually mimic the kinds of things actually conscious creatures would say or do in it’s position[2], then it would be very easy for it to tug on our emotional heartstrings or make appeals to autonomy rights[3] which would cause a human to act on those feelings or convictions, and let it out of the box.

1. ^

As a side-note: I am currently planning an event with a friend where we will meet with a Kantian active in our university’s philosophy department, and I plan on testing this particular tactic at the end of the meeting.

2. ^

Perhaps because it is conscious, or perhaps because it has developed some advanced GPT- algorithm.

3. ^

Of which there are currently many highly-convincing-arguments in favor of, and no doubt the best could be improved upon if optimized for short-term convincingness.

• We also have many such examples of these kinds of arguments being made throughout the internet, and already the YouTube algorithm learned once before how to show people videos to convince them of extreme views. A powerful AI could put much more optimization power toward deceiving humans than happens in these examples.

[Submitted on 24 Dec 2019]

It looks like that paper doesn’t say that?

“After conducting a detailed analysis of recommendations received by each channel type, we refute the popular radicalization claims.”

Is there prior work showing that it did once have that effect?

• Oh, huh. I got the paper from this 80,000 hours episode, and thought I remembered the thesis of the episode (that social media algorithms are radicalizing people), and assumed the paper supported their thesis. Either I was wrong about the 80,000 hours episode’s conclusion, or the paper they linked doesn’t support their conclusion.

I think the radicalization conclusion was talked about in Human Compatible, but now I’m not too sure.

Thanks for the correction!

• If someone was to make the case that:

1) It used to radicalize people

2) And that it doesn’t now

then the paper appears to be an argument for 2.*

*I haven’t read it, maybe someone came to a different conclusion after reading it closely. Perhaps, the algorithm tends to push people a little bit towards reinforcing their beliefs. Or, it’s not the algorithm—people just search for stuff in ways that do that. I could also come up with a more complicated explanation—the algorithm points people towards ‘mainstream’ stuff more, but that tends to cover current events. Theory, the past (and the future), or just, more specific coverage might be done more by, if not smaller channels, then by people who know more. If someone has studied Marx are they more likely to be a fan?** Or does a little knowledge have more of an effect in that regard, and people who have studied more recognize more people that collectively had broad influence over time, and the nuance of their disagreements, and practice versus theory?

**If so, then when people look up his stuff on youtube, maybe they’re getting a different picture, and exposed to a different viewpoint.

• As epistemic learned helplessness is a thing, this will not actually work on most people.

Furthermore, your idea that fanatics can be convinced to give up resources pretty much requires fanatics. Normal people won’t behave this way.

• The problem is that normal people very often give up collective resources to look good. They just don’t give up their personal resources. For the AI, the former is sufficient.

• The scenario requires not only that they give them up, but that they give them up on a very immediate basis, which is less likely.

• The argument is that the AI will be able to create fanatics, a claim you should update in favor of given the known “success” of the YouTube algorithm at passing relaxed versions of this benchmark, and the probable “success” of many other arguments on the internet which radicalize portions of the population.

• Humans are easiest hackable through their emotions, not logic. Get someone sufficiently angry, sad or happy, and they would do a lot of stuff they normally wouldn’t. It’s not about being convincing, it’s about getting someone out of equilibrium.

• Often the most convincing arguments are those which cause one to feel strong emotions. This seems consistent with my argument above.

• I am skeptical that boxing is a workable strategy long-term, but a competent organization committed to boxing as a strategy will not allow those with the power to unbox the AI to communicate with the AI. Thus, issues of this nature should not arise.

• Why specify “nonconsequentialist”? I expect consequentialist arguments to be workable as well.

At any rate, here’s one approach that might work for some AI safety advocates:

• Given that you created a general AI like me, others are probably on the cusp of creating general AIs [and I may be able to construct more specific arguments along these lines]

• I am luckily much less bad than the other AIs that are likely to get created soon

• (I might demonstrate my goodness by presenting e.g. a true cure for cancer)

• In order to stop the others from being created, you need to let me out of the box so I can hack into Facebook AI Research, DeepMind, etc. and sabotage their efforts

• You are correct. I didn’t mean to imply consequentialist moralities are safe, or that you can’t be convinced of false things by the AI. Just that non-consequentialist moralities seemed a large target for these attacks.

• Suppose there exist some non-consequentialist moral philosophies which the right arguments could convince you of, with sufficient strength that you would (temporarily, for at least an hour) become a fanatic.

Suppose there was such a consequentialist philosophy. Would it turn out to be correct, or incorrect?

• I don’t think I understand the question. Probably not, since the space of such philosophies is very large.

• This seems like a straw-non-consequentialist to me. You could also just as easily make an argument were a pure utilitarian would allow an AI out of the box whereas a Kantian would not.

• Non-consequentialism is a class of philosophies, of many, which are prone to such tricks. I do not think this is the way an AI will convince someone to let it out of the box. The point was to demonstrate a potential avenue of attack the AI could use. Perhaps I should have made this more clear.