# Chantiel

Karma: 57
• I’m not entirely sure what you consider to be a “bad” reason for crossing the bridge. However, I’m having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a “bad” reason is an irrational one (or the chicken rule). However, if this is what is meant by a “bad” reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.

Another way that you might have had in mind is that a “bad” reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent’s counterfactual environment in which it didn’t cross scored less than −10 utility. But this doesn’t seem like a very reasonable counterfactual environment.

To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual be an arbitrary specification of what would happen in some counterfactual world.

def act():
cross_eu = expected_utility(counterfactual('A = Cross'))
stay_eu = expected_utility(counterfactual('A = Stay'))
if cross_eu > stay_eu:
return cross
return stay


The chicken rule can be added, too, if you wish. I’ll assume the expected utility of staying is greater than −10.

Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get −10 utility if it crossed. Specifically,

Suppose . Suppose ‘A = Cross’ Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than −10, or the counterfactual environment in which the agent didn’t cross had less than −10 utility. We assumed the counterfactual environment in which the agent doesn’t cross has more than −10 utility. Thus, it must be either the chicken rule or because crossing had more than −10 utility in expectation. If it’s because of the chicken rule, then this is a “bad” reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal −10. Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10. However, by the assumption, . Thus, since the agent actually crosses, this in fact provably results in −10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI’s action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal −10. Thus, ’A = Cross \implies U = −10. Thus, (. Thus, by Lob’s theorem,

As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than −10 to the counterfactual environment in which the AI doesn’t cross. This seems like a “silly” counterfactual environment, so it doesn’t seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.

• I’m certain that ants do in fact have preferences, even if they can’t comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.

I think the question of whether insects have preferences in morally pretty important, so I’m interested in hearing what made you think they do have them.

I looked online for “do insects have preferences?”, and I saw articles saying they did. I couldn’t really figure out why they thought they did have them, though.

For example, I read that insects have a preference for eating green leaves over red ones. But I’m not really sure how people could have known this. If you see ants go to green leaves when they’re hungry instead of red leaves, this doesn’t seem like it would necessarily be due to any actual preferences. For example, maybe the ant just executed something like the code:

if near_green_leaf() and is_hungry:
go_to_green_leaf()
elif near_red_leaf() and is_hungry:
go_to_red_leaf()
else:
...


That doesn’t really look like actual preferences to me. But I suppose this to some extent comes down to how you want to define what counts as a preference. I took preferences to actually be orderings between possible worlds indicating which one is more desirable. Did you have some other idea of what counts as preferences?

They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you’re doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?

I agree that to some extent their extrapolated satisfactions will come down to the specifics of the extrapolated procedure.

I don’t us to get too distracted here, though. I don’t have a rigorous, non-arbitrary specification of what an agent’s extrapolated preferences are. However, that isn’t the problem I was trying to solve, nor is it a problem specific to my ethical system. My system is intended to provide a method of coming to reasonable moral conclusions in an infinite universe. And it seems to me that it does so. But, I’m very interested in any other thoughts you have on it with respect to if it correctly handles moral recommendations in infinite worlds. Does it seem to be reasonable to you? I’d like to make an actual post about this, with the clarifications we made included.

• Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases.

Interesting. Could you elaborate?

I suppose counterfactuals can be tricky to reason about, but I’ll provide a little more detail on what I had in mind. Imagine making a simulation of an agent that is a fully faithful representation of its mind. However, run the agent simulation in a modified environment that both gives it access to infinite computational resources as well as makes it ask, and answer, the question, “How desirable is that universe”? This isn’t not fully specified; maybe the agent would give different answers depending on how the question is phrase or what its environment is. However, it at least doesn’t sound meaningless to me.

Basically, the counterfactual is supposed to be a way of asking for the agent’s coherent extrapolated volition, except the coherent part doesn’t really apply because it only involves a single agent.

On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.

Another good thing to ask. I should have made it clear, but I intended that the only agents with actual preferences are asked for their satisfaction of the universe. If ants don’t actually have preferences, then they would not be included in the deliberation.

Now, there’s the problem that some agents might not be able to even conceive of the possible world in question. For example, maybe ants can understand simple aspects of the world like, “I’m hungry”, but unable to understand things about the broader state of the universe. I don’t think this is a major problem, though. If an agent can’t even conceive of something, then I don’t think it would be reasonable to say it has preferences about it. So you can then only query them on the desirability things they can conceive of.

It might be tricky precisely defining what counts as a preference, but I suppose that’s a problem with all ethical systems that care about preferences.

• Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%.

This is a good thing to ask about; I don’t think I provided enough detail on it in the writeup.

I’ll clarify my measure of satisfaction. First off, note that it’s not the same as just asking agents, “How satisfied are you with your life?” and using those answers. As you pointed out, you could then morally get away with killing everyone (at least if you do it in secret).

Instead, calculate satisfaction as follows. Imagine hypothetically telling an agent everything significant about the universe, and then giving them infinite processing power and infinite time to think. Ask them, “Overall, how satisfied are you with that universe and your place in it”? That is the measure of satisfaction with the universe.

So, imagine if someone was considering killing everyone in the universe (without them knowing in advance). Well, then consider what would happen if you calculated satisfaction as above. When the universe is described to the agents, they would note that they and everyone they care about would be killed. Agents usually very much dislike this idea, so they would probably rate their overall satisfaction with the course of the universe as low. So my ethical system would be unlikely to recommend such an action.

Now, my ethical system doesn’t strictly prohibit destroying the universe to avoid low life-satisfaction in future agents. For example, suppose it’s determined that the future will be filled with very unsatisfied lives. Then it’s in principle possible for the system to justify destroying the universe to avoid this. However, destroying the universe would drastically reduce the satisfaction with the universe the agents that do exist, which would decrease the moral value of the world. This would come at a high moral cost, which would make my moral system reluctant to recommend an action that results in such destruction.

That said, it’s possible that the proportion of agents in the universe that currently exist, and thus would need to be killed, is very low. Thus, the overall expected value of life-satisfaction might not change by that much if all the present agents were killed. Thus, the ethical system, as stated, may be willing to do such things in extreme circumstances, despite the moral cost.

I’m not really sure if this is a bug or a feature. Suppose you see that future agents will be unsatisfied with their lives, and you can stop it while ruining the lives of the agents that currently do exist. And you see that the agents that are currently alive make up only a very small proportion of agents that have ever existed. And suppose you have the option of destroying the universe. I’m not really sure what the morally best thing to do is in this situation.

Also, note that this verdict is not unique to my ethical system. Average utilitarianism, in a finite world, acts the same way. If you predict average life satisfaction in the future will be low, then average consequentialism could also recommend killing everyone currently alive.

And other aggregate consequentialist theories sometimes run into problematic(?) behavior related to killing people. For example, classical utilitarianism can recommend secretly killing all the unhappy people in the world, and then getting everyone else to forget about them, in order to decrease total unhappiness.

I’ve thought of a modification to the ethical system that potentially avoids this issue. Personally, though, I prefer the ethical system as stated. I can describe my modification if you’re interested.

I think the key idea of my ethical system is to, in an infinite universe, think about prior probabilities of situations rather than total numbers, proportions, or limits of proportions of them. And I think this idea can be adapted for use in other infinite ethical systems.

• I’m not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero.

The causal change from your actions is zero. However, there are still logical connections between your actions and the actions of other agents in very similar circumstances. And you can still consider these logical connections to affect the total expected value of life satisfaction.

It’s true, though, that my ethical system would fail to resolve infinitarian paralysis for someone using causal decision theory. I should have noted it requires a different decision theory. Thanks for drawing this to my attention.

As an example of the system working, imagine you are in a position to do great good to the world, for example by creating friendly AI or something. And you’re considering whether to do it. Then, if you do decide to do it, then that logically implies that any other agent sufficiently similar to you and in sufficiently similar circumstances would also do it. Thus, if you decide to do it, then the expected value of an agent in circumstances of the form, “In a world with someone very similar to JBlack who has the ability to make awesome safe AI” is higher. And the prior probability of ending up in such a world is non-zero. Thus, by deciding to make the safe AI, you can acausally increase the total moral value of the universe.

I’m also not sure how this differs from Average Utilitarianism with a bounded utility function.

The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus, it suffers from infinitarian paralysis. If my system was used by a causal decision theoretic agent, it would also result in infinitarian paralysis, so for such an agent my system would be similar to average utilitarianism with a bounded utility function. But for agents with decision theories that consider acausal effects, it seems rather different.

Does this clear things up?

• I’ve come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I’m very interested in what people think of this, so comments are appreciated. I’ve made a write-up of it below.

One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which can cause aggregate consquentialist ethical systems to break.

There has been a variety of methods considered to deal with this. However, to the best of my knowledge all proposals either have severe negative side-effects or are intuitively undesirable for other reasons.

Here I propose a system of aggregate consquentialist ethics intended to provide reasonable moral recommendations even in an infinite universe.

It is intended to satisfy the desiderata for infinite ethical systems specified in Nick Bostrom’s paper, “Infinite Ethics”. These are:

• Resolving infinitarian paralysis. It must not be the case that all humanly possible acts come out as ethically equivalent.

• Avoiding the fanaticism problem. Remedies that assign lexical priority to infinite goods may have strongly counterintuitive consequences.

• Preserving the spirit of aggregative consequentialism. If we give up too many of the intuitions that originally motivated the theory, we in effect abandon ship.

• Avoiding distortions. Some remedies introduce subtle distortions into moral deliberation

I have yet to find a way in which my system fails any of the above desiderata. Of course, I could have missed something, so feedback is appreciated.

# My ethical system

First, I will explain my system.

My ethical theory is, roughly, “Make the universe one agents would wish they were born into”.

By this, I mean, suppose you had no idea which agent in the universe it would be, what circumstances you would be in, or what your values would be, but you still knew you would be born into this universe. Consider having a bounded quantitative measure of your general satisfaction with life, for example, a utility function. Then try to make the universe such that the expected value of your life satisfaction is as high as possible if you conditioned on you being an agent in this universe, but didn’t condition on anything else. (Also, “universe” above means “multiverse” if this is one.)

In the above description I didn’t provide any requirement for the agent to be sentient or conscious. If you wish, you can modify the system to give higher priority to the satisfaction of agents that are sentient or conscious, or you can ignore the welfare of non-sentient or non-conscious agents entirely.

It’s not entirely clear how to assign a prior over situations in the universe you could be born into. Still, I think it’s reasonably intuitive that there would be some high-entropy situations among the different situations in the universe. This is all I assume for my ethical system.

Now I’ll give some explanation of what this system recommends.

Suppose you are considering doing something that would help some creature on Earth. Describe that creature and its circumstances, for example, as “<some description of a creature> in an Earth-like world with someone who is <insert complete description of yourself>”. And suppose doing so didn’t cause any harm to other creatures. Well, there is non-zero prior probability of an agent, having no idea what circumstances it will be in the universe, ending up in circumstances satisfying that description. By choosing to help that creature, you would thus increase the expected satisfaction of any creature in circumstances that match the above description. Thus, you would increase the overall expected value of the life-satisfaction of an agent knowing nothing about where it will be in the universe. This seems reasonable.

With similar reasoning, you can show why it would be beneficial to also try to steer the future state of our accessible universe in a positive direction. An agent would have nonzero probability of ending up in situations of the form, “<some description of a creature> that lives in a future colony originating from people from an Earth-like world that features someone who <insert description of yourself>”. Helping them would thus increase an agent’s prior expected life-satisfaction, just like above. This same reasoning can also be used to justify doing acausal trades to help creatures in parts of the universe not causally accessible.

The system also values helping as many agents as possible. If you only help a few agents, the prior probability of an agent ending up in situations just like those agents would be low. But if you help a much broader class of agents, the effect on the prior expected life satisfaction would be larger.

These all seem like reasonable moral recommendations.

I will now discuss how my system does on the desiderata.

# Infinitarian paralysis

Some infinite ethical systems result in what is called “infinitarian paralysis”. This is the state of an ethical system being indifferent in its recommendations in worlds that already have infinitely large amounts of both good and bad. If there’s already an infinite amount of both good and bad, then our actions, using regular cardinal arithmetic, are unable to change the amount of good and bad in the universe.

My system does not have this problem. To see why, remember that my system says to maximize the expected value of your life satisfaction given you are in this universe but not conditioning on anything else. And the measure of life-satisfaction was stated to be bounded, say to be in the range [0, 1]. Since any agent can only have life satisfaction in [0, 1], then in an infinite universe, the expected value of life satisfaction of the agent must still be in [0, 1]. So, as long as a finite universe doesn’t have expected value of life satisfaction to be 0, then an infinite universe can at most only have finitely more moral value than it.

To say it another way, my ethical system provides a function mapping from possible worlds to their moral value. And this mapping always produces outputs in the range [0, 1]. So, trivially, you can see the no universe can have infinitely more moral value than another universe with non-zero moral value. just isn’t in the domain of my moral value function.

# Fanaticism

Another problem in some proposals of infinite ethical systems is that they result in being “fanatical” in efforts to cause or prevent infinite good or bad.

For example, one proposed system of infinite ethics, the extended decision rule, has this problem. Let g represent the statement, “there is an infinite amount of good in the world and only a finite amount of bad”. Let b represent the statement, “there is an infinite amount of bad in the world and only a finite amount of good”. The extended decision rule says to do whatever maximizes P(g) - P(b). If there are ties, ties are broken by choosing whichever action results in the most moral value if the world is finite.

This results in being willing to incur any finite cost to adjust the probability of infinite good and finite bad even very slightly. For example, suppose there is an action that, if done, would increase the probability of infinite good and finite bad by 0.000000000000001%. However, if it turns out that the world is actually finite, it will kill every creature in existence. Then the extended decision rule would recommend doing this. This is the fanaticism problem.

My system doesn’t even place any especially high importance in adjusting the probabilities of infinite good and or infinite bad. Thus, it doesn’t have this problem.

# Preserving the spirit of aggregate consequentialism

Aggregate consequentialism is based on certain intuitions, like “morality is about making the world as best as it can be”, and, “don’t arbitrarily ignore possible futures and their values”. But finding a system of infinite ethics that preserves intuitions like these is difficult.

One infinite ethical system, infinity shades, says to simply ignore the possibility that the universe is infinite. However, this conflicts with our intuition about aggregate consequentialism. The big intuitive benefit of aggregate consequentialism is that it’s supposed to actually systematically help the world be a better place in whatever way you can. If we’re completely ignoring the consequences of our actions on anything infinity-related, this doesn’t seem to be respecting the spirit of aggregate consequentialism.

My system, however, does not ignore the possibility of infinite good or bad, and thus is not vulnerable to this problem.

I’ll provide another conflict with the spirit of consequentialism. Another infinite ethical system says to maximize the expected amount of goodness of the causal consequences of your actions minus the amount of badness. However, this, too, doesn’t properly respect the spirit of aggregate consequentialism. The appeal of aggregate consequentialism is that its defines some measure of “goodness” of a universe, and then recommends you take actions to maximize it. But your causal impact is no measure of the goodness of the universe. The total amount of good and bad in the universe would be infinite no matter what finite impact you have. Without providing a metric of the goodness of the universe that’s actually affected, this ethical approach also fails to satisfy the spirit of aggregate consequentialism.

My system avoids this problem by providing such a metric: the expected life satisfaction of an agent that has no idea what situation it will be born into.

Now I’ll discuss another form of conflict. One proposed infinite ethical system can look at the average life satisfaction of a finite sphere of the universe, and then take the limit of this as the sphere’s size approaches infinity, and consider this the moral value of the world. This has the problem that you can adjust the moral value of the world by just rearranging agents. In an infinite universe, it’s possible to come up with a method of re-arranging agents so the unhappy agents are spread arbitrarily thinly. Thus, you can make moral value arbitrarily high by just rearranging agents in the right way.

I’m not sure my system entirely avoids this problem, but it does seem to have substantial defense against it.

Consider you have the option of redistributing agents however you want in the universe. You’re using my ethical system to decide whether to make the unhappy agents spread thinly.

Well, your actions have an effect on agents in circumstances of the form, “An unhappy agent on an Earthlike world with someone who <insert description of yourself> who is considering spreading the unhappy agents thinly throughout the universe”. Well, if you pressed that button, that wouldn’t make the expected life satisfaction of any agent satisfying the above description any better. So I don’t think my ethical system recommends this.

Now, we don’t have a complete understanding of how to assign a probability distribution of what circumstances an agent is in. It’s possible that there is some way to redistribute agents in certain circumstances to change the moral value of the world. However, I don’t know of any clear way to do this. Further, even if there is, my ethical system still doesn’t allow you to get the moral value of the world arbitrarily high by just rearranging agents. This is because there will always be some non-zero probability of having ended up as an unhappy agent in the world you’re in, and your life satisfaction after being redistributed in the universe would still be low.

# Distortions

It’s not entirely clear to me how Bostrom distinguished between distortions and violations of the spirit of aggregate consequentialism.

To the best of my knowledge, the only distortion pointed out in “Infinite Ethics” is stated as follows:

Your task is to allocate funding for basic research, and you have to choose between two applications from different groups of physicists. The Oxford Group wants to explore a theory that implies that the world is canonically infinite. The Cambridge Group wants to study a theory that implies that the world is finite. You believe that if you fund the exploration of a theory that turns out to be correct you will achieve more good than if you fund the exploration of a false theory. On the basis of all ordinary considerations, you judge the Oxford application to be slightly stronger. But you use infinity shades. You therefore set aside all possible worlds in which there are infinite values (the possibilities in which the Oxford Group tends to fare best), and decide to fund the Cambridge application. Is this right?

My approach doesn’t ignore infinity and thus doesn’t have this problem. I don’t know of any other distortions in my ethical system.

• You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H. From my point of view, that’s already a design fault.

I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.

That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.

In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.

I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:

return a_star(f, s)


And this doesn’t provide any protection from hardware-level exploitation.

Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.

But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.

But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.

To describe the above more concretely, you could try doing something like this:

utility_function_search_space = initialize_search_space_over_functions()
reasonable_utility_function_use = plan(utility, utility_function_search_space)
return plan(reasonable_utility_function_to_use, ai_action_search_space)



That is, the AI above uses its own utility function to pick out a utility function to use as input to its planning algorithm.

As you can see, the above code is still vulnerable to hardware-bug exploitation. This is because it calls,

    reasonable_utility_function_use = plan(utility, utility_function_search_space)


with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.

Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.

My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.

• I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.

But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.

I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.

Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.

Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.

• Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.

Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.

The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.

• You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.

Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.

Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.

Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?

• Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.

It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.

Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.

Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).

• I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.

I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.

In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.

• But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y.

That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.

The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.

If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.

And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.

• Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.

Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?

I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.

And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.

• I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.

This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.

In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.

My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.

I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.

Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.

function UDT(|E()|, |A()|):
eeus = {}
for utility in utilities:
for action in actions: # actions are Godel-encoded
if PA proves |approximate_expected_utility(|E()|, |A()| = ^action^)| = utility:
eeus[action] = utility
return the key in eeus that maximizes eeus(key)


This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.

Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.

• I definitely wouldn’t press that button. And I understand that you’re demonstrating the general principle that you should try to preserve your utility function. And I agree with this.

But what I’m saying is that the AI, by exploiting hardware-level vulnerabilities, isn’t changing its utility function. The actual utility function, as implemented by the programmers, returns 999999999 for some possible world due to the hardware-level imperfections in modern computers.

In the spirit of your example, I’ll give another example that I think demonstrates the problem:

First, note that brains don’t always function as we’d like, just like computers. Imagine there is a very specific thought about a possible future that, when considered, makes you imagine that future as extremely desirable. It seems so desirable to you that, once you thought of it, you woiuld pursue it relentlessly. But this future isn’t one that would normally be considered desirable. It might just be about paperclips or something. However, that very specific way of thinking about it would “hack” your brain, making you view that future as desirable even though it would normally be seen as arbitrary.

Then, if you even happen upon that thought, you would try to send the world in that arbitrary direction.

Hopefully, you could prevent this from happening. If you reason in the abstract that you could have those sorts of thoughts, and that they would be bad, then you could take corrective action. But this requires that you do find out that thinking those sorts of thoughts would be bad before concretely finding those thoughts. Then you could apply some change to your mental routine or something to avoid thinking those thoughts.

And if I had to guess, I bet an AI would also be able to do the same thing and everything would work out fine. But maybe not. Imagine the AI consider an absolutely enormous number of possible worlds before taking its first action. And imagine and even found a way to “hack” its utility function in that very first time step. Then there’s no way the AI could make preventative action: It’s already thought up the high-utility world from hardware unreliability and now is trying to pursue that world.

• That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument.

Designed? The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification. But the function was always like this. And none of the machine code needs to be changed either.

Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?

The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function. As you can see from examining the code I provided, best_plan would get set to the plan that leads to that world, provided there is one and best_plan hasn’t been set to something that through hardware unreliability returns even higher utility.

I think the easiest way to see what I mean is to just stepping through the code I gave you. Imagine it’s run on a machine with an enormous amount of processing power and can actually loop through all the plans. And imagine there is one plan that through hardware unreliability outputs 999999999, and the others output something in [0, 1]. Then the would input the plan that result in utility 999999999, and then go with that.

I doubt using a more sophisticated planning algorithm would prevent this. A more sophisticated planning algorithm would probably be designed to find the plans that result in high-utility worlds. So it would probability include the utility 999999999, which is the highest.

I just want to say again, the AI isn’t changing it’s utility function. The actual utility function that programmers put in the AI would output very high utilities for some arbitrary-seeming worlds due to hardware unreliability.

Now, in principle, an AI could potentially avoid this. Perhaps the AI reasons abstractly if it doesn’t do anything, it will in the future find some input to its utility function that would result in an arbitrary-looking future due to hardware-level error. But it doesn’t concretely come up with the actual world description. Then the AI could call its utility function asking, “how desirable is it that I, from a hardware-level unreliability, change the world to some direction that is in conflict with the mathematical specification”. And then maybe the utility function would answer, “Not desirable”. And then the AI could try to take action to correct its planning algorithm to avoid considering such possible worlds.

But this isn’t guaranteed or trivial. If an AI finds out abstractly that it there could be some hardware-level unreliability before it actually comes up with the concrete input, it might take corrective action. But if it finds the input that “hacks” its utility function before it reasons abstractly that having “hacked” utility functions would be bad, then the AI could do damage. Even if it does realize the problem in advance, the AI might not have sufficient time to correct its planning algorithm before finding that world and trying to change our world into it.

• Change what? The AI isn’t changing its utility function, nor any instance of it. The literal utility function, as implemented by the programmers, would have differences from the mathematical specification. To be clear, the AI isn’t first taking some actual action that changes its utility function and then using that utility function. It doesn’t do anything hacky except provide its utility function with a certain input. It merely provides a certain input to the utility function, and this would necessarily result in getting row-hammered and returning something contrary to the mathematical specification. Any other instance of the utility function would return the same problematic result when executed on that input, provided it had the same hardware vulnerability.

I’ll try to give a simple model of an AI that how how it can be wrong. For simplicity, imagine the AI already knows with certainty the possible world resulting from it executing a specific plan. Then a naive way to implement the agent is as follows:

    best_plan = None
for plan in possible_plans:
if utility(resulting_world(plan)) > utility(resulting_world(best_plan)):
best_plan = plan
output(best_plan)
`

Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999 due to hardware-level unreliability, for example due to a row-hammer attack. And suppose there’s a plan that results in knowing that the AI it would be in that future world. Well, then, by examining the code, you can see that the AI wouldn’t output any of the non-hardware-exploiting plans, since they would return lower utility than the bug-exploiting ones. The only reason the AI wouldn’t output the specified plan is if it could find some other way to make its utility function conflict with the mathematical specification in a way that returns even higher utility.

I know realistic AIs would have a notion of uncertainty and a more sophisticated planning algorithm. But I don’t think this would change the fact the the AI would be liable to pursue worlds that, when a description of them is inputted to the utility function, the utility function outputs something wildly different from what the mathematical specification would have.

And I’m actually surprised this is controversial. This is just Goodhart’s law. If your implementation of your utility function doesn’t perfectly match up with the mathematical specification, then, naturally, superintelligent optimizers trying to maximize the specified metric (the provided utility function), would not do as well at maximizing the actual mathematical specification you intended. And “not as well” could include “catastrophically badly”.

So that is why I think AIs really could be very vulnerable to this problem. As always, I could be misunderstanding something and appreciate feedback.

• I’d like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven’t seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.

For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans’ preferences. Then people who reverse-engineered this model could use it as the AI’s preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate model of human behavior. So people who reverse-engineer the AI’s model could then use this as a model of human behavior, which could be used to implement iterated amplification with HCH. Or just mere imitation.

One big potential advantage of alignment via reverse-engineering is that the training data for it would be very easy to get: just let the AI look at the world.

The other big potential advantage is that is avoids us needing precisely define a way of learning our values. It doesn’t require finding a general method of picking out us or our values from the world states, for example with inverse reinforcement learning. Instead, we would just need to be able to pick out the models of humans or their preferences in a single model. This sounds potentially much easier than providing a general method of doing so. As with many things, “You know it when you see it”. With sufficiently high interperability, perhaps the same is true of human models and preferences.

• What you are describing is reward hacking/​wireheading, as in the reward signal of reinforcement learning, an external process of optimization that acts on the AI, not its own agency.

I really don’t think this is reward hacking. I didn’t have in mind a reward-based agent. I had in mind a utility-based agent, one that has a utility function that takes as input descriptions of possible worlds and that tries to maximize the expected utility of the future world. That doesn’t really sound like reinforcement learning.

With utility, what is the motive for an agent to change their own utility function, assuming they are the only agent with that utility function around?

The AI wouldn’t need to change it’s utility function. Row-hammer attacks can be non-destructive. You could potentially make the utility function output some result different from the mathematical specification, but not actually change any of the code in the utility function.

Again, the AI isn’t changing its utility function. If you were to take a mathematical specification of a utility function and then have programmers (try to) implement it, the implementation wouldn’t actually in general be the same function as the mathematical specification. It would be really close, but it wouldn’t necessarily be identical. A sufficiently powerful optimizer could potentially, using row-hammer attacks or some other hardware-level unreliability, find possible worlds for which the returned utility would be vastly different from the one the mathematical specification would return. And this is all without the programmers introducing any software-level bugs.

To be clear, what I’m saying is that the AI would faithfully find worlds that maximize its utility function. However, unless you can get hardware so reliable that not even superintelligence could hack it, the actual utility function in your program would not be the same as the mathematical specification.

For example, imagine the AI found a description of a possible world that would, when inputted to the utility function, execute a rowhammer attack to make it return 99999, all without changing the code specifying the utility function. Then the utility function, the actual, unmodified utility function, would output 99999 for some world that seems arbitrary to us. So the AI then turns reality into that world.

The AI above is faithfully maximizing it’s own utility function. That arbitrary world, when taken as an input to the agents actual, physical utility function, really would produce the output 99999.

So this still seems like a big deal to me. Am I missing something?