Chantiel

Karma: 88

Chantiel 26 Nov 2021 22:04 UTC
15 points
on: Chantiel’s Shortform
I’ve made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren’t valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.

The posts are:

Chantiel 18 Oct 2021 19:34 UTC
LW: 11 AF: 6
AF
on: Troll Bridge
I’m not entirely sure what you consider to be a “bad” reason for crossing the bridge. However, I’m having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a “bad” reason is an irrational one (or the chicken rule). However, if this is what is meant by a “bad” reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.

Another way that you might have had in mind is that a “bad” reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent’s counterfactual environment in which it didn’t cross scored less than −10 utility. But this doesn’t seem like a very reasonable counterfactual environment.

To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual be an arbitrary specification of what would happen in some counterfactual world.
```
def act():
    cross_eu = expected_utility(counterfactual('A = Cross'))
    stay_eu = expected_utility(counterfactual('A = Stay'))
    if cross_eu > stay_eu:
        return cross
    return stay
```
The chicken rule can be added, too, if you wish. I’ll assume the expected utility of staying is greater than −10.

Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get −10 utility if it crossed. Specifically,

Suppose $⊢ (A = C r o s s ⟹ U = - 10)$ . Suppose ‘A = Cross’ Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than −10, or the counterfactual environment in which the agent didn’t cross had less than −10 utility. We assumed the counterfactual environment in which the agent doesn’t cross has more than −10 utility. Thus, it must be either the chicken rule or because crossing had more than −10 utility in expectation. If it’s because of the chicken rule, then this is a “bad” reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal −10. Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10. However, by the assumption, $⊢ A = C r o s s ⟹ U = - 10$ . Thus, since the agent actually crosses, this in fact provably results in −10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI’s action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal −10. Thus, ’A = Cross \implies U = −10`. Thus, ( $⊢ A = C r o s s ⟹ U = - 10) ⟹ (A = C r o s s ⟹ U = - 10)$ . Thus, by Lob’s theorem, $A = C r o s s ⟹ U = - 10$

As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than −10 to the counterfactual environment in which the AI doesn’t cross. This seems like a “silly” counterfactual environment, so it doesn’t seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.

Chantiel 30 Jul 2021 21:59 UTC
7 points
on: Open and Welcome Thread – July 2021
There is a situation I’ve thought of in which functional decision theory, according to my understanding of it, does poorly. I might just be making some sort of mistake, but I tried to be pretty careful when reading Eliezer’s paper on functional decision theory, and it still seems to be a problem. I’m interested in what others think of this.
The situation is just like Newcomb’s problem, except that player is a superintelligent AI who is aware of the exact mental makeup of the predictor (the one considering placing things in the boxes) and can infer the predictor’s choices with what is effectively certainty. The predictor is just a regular, non-superintelligent creature that knows the superintelligence uses function decision theory and is superintelligent.
In this situation, there doesn’t seem to be a logical connection between what the superintelligent outputs and the its prediction of what the predictor does. I mean, the superintelligence can exactly infer what the predictor does, without referencing its own action, so it doesn’t seem like the superintelligence knowing what it itself does would really by informative.
So, suppose the superintelligence predicts the predictor puts money in both boxes, and would believe this prediction no matter what the superintelligence decides to do. In this situation, I don’t see any reason for it to not take two boxes.
And the predictor reasons, “Since the AI can predict my exact output with effective certainty, eliminating a logical connection between its choice and the content of the box, the AI has no reason not to just take two boxes if they contain money. So, I predict the AI would two-box, so I won’t put money in either box.”
And then the superintelligent AI gets $0.
I don’t think that’s what a superintelligence is supposed to get. And if the AI didn’t have the knowledge and power to predict the predictor’s output exactly, then a logical connection could have be preserved and the AI could potentially get money. But I don’t think more knowledge and more intelligence are supposed to make agents do worse.

Chantiel 8 Aug 2021 15:15 UTC
5 points
on: Chantiel’s Shortform
I made a new article about defining “optimizer”. I was wondering if someone could look over it and tell me what they think before I post it on Less Wrong. You can find it here.

Chantiel 24 Aug 2021 18:01 UTC
4 points
on: Chantiel’s Shortform
There is a matter I’m confused about: What exactly is base-level reality, does it necessarily exist, and is it ontologically different from other constructs?

First off, I had gotten the impression that there was a base-level reality, and that in some sense it’s ontologically different from the sorts of abstractions we use in our models. I thought that, it some sense, the subatomic particles “actually” existed, whereas our abstractions, like chairs, were “just” abstractions. I’m not actually sure how I got this impression, but I had the sense that other people thought this way, too.

And indeed, you could adopt an epistemology that would imply this. But I’m not sure what the benefit of doing so would be. Suppose people discovered lower-level particles that composed quantum particles, and modeling using these lower-level particles would provide high predictive accuracy than using mere quantum physics. But then suppose people discover sub-sub-quantum particles and that modeling the world in terms of these sub-sub-particles further yielded a more accurate world model than just modeling with sub-quantum particles. And what if this process continued forever: people just kept finding lower-level particles that composed higher-level particles and had higher predictive accuracy.

In the above situation, what’s supposed to be taken to be base-level reality? Now, if you wanted, you could imagine that the world actually does have a base-level reality in the form of an infinite-memory computer, and that this computer dynamically generates new abstractions to uses them to compute what the agents see, making sure that it manages to start simulating things at a lower level of abstraction before any agent could reach the current “base-level” reality.

But that doesn’t seem like a very natural hypothesis. If you keep finding more and more decompositions forever, it really seems to me that “there’s no base-level reality” would be a simpler and more natural hypothesis.

Chantiel 7 Nov 2021 0:03 UTC
3 points
AF
in reply to: abramdemski’s comment on: Troll Bridge
If we define “bad reasoning” as “crossing when there is a proof that crossing is bad” in general, this begs the question of how to evaluate actions. Of course the troll will punish counterfactual reasoning which doesn’t line up with this principle, in that case. The only surprising thing in the proof, then, is that the troll also punishes reasoners whose counterfactuals respect proofs (EG, EDT).

I’m concerned that may not realize that your own current take on counterfactuals respects logical to some extent, and that, if I’m reasoning correctly, could result in agents using it to fail the troll bridge problem.

You said in “My current take on counterfactuals”, that counterfactual should line up with reality. That is, the action the agent actually takes should in the utility it was said to have in its counterfactual environment.

You say that a “bad reason” is one such that the agents the procedure would think is bad. The counterfactuals in your approach are supposed to line up with reality, so if an AI’s counterfactuals don’t line up in reality, then this seems like this is a “bad” reason according to the definition you gave. Now, if you let your agent think “I’ll get < −10 utility if I don’t cross”, then it could potentially cross and not get blown up. But this seems like a very unintuitive and seemingly ridiculous counterfactual environment. Because of this, I’m pretty worried it could result in an AI with such counterfactual environments malfunctioning somehow. So I’ll assume the AI doesn’t have such a counterfactual environment.

Suppose acting using a counterfactual environment that doesn’t line up with reality counts as a “bad” reason for agents using your counterfactuals. Also suppose that in the counterfactual environment in which the agent doesn’t cross, the agent counterfactually gets more than −10 utility. Then:
1. Suppose $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$
2. Suppose $A =^{'} C r o s s^{'}$ . Then if the agent crosses it must be because either it used the chicken rule or because its counterfactual environment doesn’t line up with reality in this case. Either way, this is a bad reason for crossing, so the bridge gets blown up. Thus, the AI gets −10 utility.
3. Thus, $⊢ (⊢ A =^{'} C r o s s^{'} ⟹ U = - 10) ⟹ U = - 10$
4. Thus, by Lob’s theorem, $⊢ A =^{'} C r o s s^{'} ⟹ U = - 10$
Thus, either the agent doesn’t cross the bridge or it does and the bridge explodes. You might just decide to get around this by saying it’s okay for the agent to think it would get less than −10 utility if it didn’t cross. But I’m rather worried that this would cause other problems.

You seem to be assuming that the agent’s architecture has solved the problem of logical updatelessness, IE, of applying reasoning only to the (precise) extent to which it is beneficial to do so. But this is one of the problems we would like to solve! So I object to the “stop thinking about it” step w/o more details of the decision theory which allows you to do so.

I’ll talk about some ways I thought of potentially formalizing, “stop thinking if it’s bad”.

One simple way to try to do so is to have an agent using regular evidential decision theory but have a special, “stop thinking about this thing” action that it can take. Every so often, the agent considers taking this action using regular evidential decision theory. So, in the troll bridge case, it could potentially see that the path of reasoning it’s following is potentially dangerous, and thus decide to stop. Also, the agent needs to avoid thinking too many thoughts before considering to take the “stop thinking about this thing” action. Otherwise, it could think all sorts of problematic thoughts before being able to stop itself.

This simple technique might actually be enough to solve the problem, especially if the AI has the ability to choose its own inference algorithm to find one that makes the AI able to realize, “thinking about this is bad” before it finds the concrete bad thing. And, for what it’s worth, it’s enough for me personally to get across the bridge.

Now, this might not work reliably, because the AI might stumble upon a problematic thought before realizing it’s bad. So, instead, consider this. The AI has an ability to “rollback” its belief state and then adjust its search path. For example, suppose the AI thinks through the Troll Bridge proof and then realizes that it really messed up in doing so. Then it could have a hard-coded way rolling back its belief state to what it was before it thought about the troll bridge thing, mark the nodes it expanded as verboten, and then go expand some other node.

When should it rollback? Well, it shouldn’t rollback every time it comes across any evidence that decreases its expected utility. Doing so would result in the AI being very poorly calibrated by ignoring all negative evidence.

One option to do this is to actually manually come up with a rollback method that both rollbacks whenever doing so increases expected utility without resulting in poor calibration.

However, this sounds sort of hard. One cheap trick get around doing this is to just give the AI a special action to modify its rollback method, and then to select the new rollback method using its regular decision procedure. For example, I think it’s clear, using evidential decision theory, that choosing the rollback method, “Rollback any thought that decreases my expected utility estimate” would result in low utility. Because that would just mean the AI would delude itself. But I bet a rollback method that says, “Roll back if you find any troll bridge proofs” would work okay.

This trick might not be perfect, since the AI could potentially think about problematic before getting a rollback procedure good enough to roll it back. But as long as the AI is smart enough to realize that it should try to get a really good rollback function before doing much of anything else, then I bet it would work okay.

Also, don’t forget that we still need to do something about the agent-simulates-predictor problem. In the agent-simulates-predictor problem, agents are penalized for thinking about things in too much detail. And in whatever counterfactual environment you use, you’ll need a way to deal with the agent-simulates-predictor problem. I think the most obvious approach is by controlling what the AI things about. And if you’ve already done that, then you can pass the troll bridge problem for free.

Also, I think it’s important to note that just the fact the AI is trying to avoid thinking of crossing-is-bad proofs makes the proofs (potentially) not go through. For example, in the proof you originally gave, you supposed there is a proof the crossing results in −10 utility, and thus says the agent must have crossed from the chicken rule. But if the AI is trying to avoid these sorts of “proofs”, then if it does cross, it simply could have been because the AI decided to avoid following whatever train of thought would prove that it would get −10 utility. This is considered a reasonable thing to do by the AI, so it doesn’t seem like a “bad” reason.

There may be possible alternative proofs that apply to an AI that tries to steer its reasoning away from problematic areas. I’m not sure, though. I also suspect that any such proofs would be more complicated and thus harder to find.

Chantiel 2 Nov 2021 1:42 UTC
3 points
in reply to: Slider’s comment on: A system of infinite ethics

Kind of hard to ge a handle.

Are you referring to it being hard to understand? If so, I appreciate the feedback and am interested in the specifics what is difficult to understand. Clarity is a top priority for me.

If I have a choice of (finitely) helping a single human and I believe there to be infinite humans then the probability of a human being helped in my world will nudge less than a real number. And if we want to stick with probabilties being real then the rounding will make infinitarian paralysis.

You are correct that a single human would have 0 or infinitistimal causal impact on the moral value of the world or the satisfaction of an arbitrary human. However, it’s important to note that my system requires you to use a decision theory that considers not just your causal impacts, but also your acausal ones.

Remember that if you decide to take a certain action, that implies that other agents who are sufficiently similar to you and in sufficiently similar circumstances also take that action. Thus, you can acausally have non-infinitesimal impact on the satisfaction of agents in situations of the form, “An agent in a world with someone just like Slider who is also in very similar circumstances to Slider’s.” The above scenario is of finite complexity and isn’t ruled out by evidence. Thus, the probability of an agent ending up in such a situation, conditioning only only on being some agent in this universe, is nonzero.

Another scenario raises the possibility of the specter of fanatism. Say by doing murder I can create an AI that will make all future agents happy but being murdered is not happy times. Comparing agents before and after the singularity might make sense. And so might killing different finite amounts of people. but mixing them gets tricky or favours the “wider class”. One could think of a distribution where for values between 0 and 4 you up the utility by 1 except for pi (or any single real (or any set of measure 0)) for which you lower it by X. Any finite value for X will not be able to nudge the expectation value anywhere. Real ranges vs real ranges makes sense, discrete sets vs discrete sets makes sense, but when you cross transfinite archimedean classes one is in trouble.

I’m not really following what you see as the problem here. Perhaps by above explanation clears things up. If not, would you be willing to elaborate on how transfinite archimedean classes could potentially lead to trouble?

Also, to be clear, my system only considers finite probabilities and finite changes to the moral value of the world. Perhaps there’s some way to extend it beyond this, but as far as I know it’s not necessary.

Chantiel 15 Sep 2021 19:13 UTC
3 points
in reply to: JBlack’s comment on: Chantiel’s Shortform
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.

I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.

In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.

Chantiel 15 Sep 2021 19:03 UTC
3 points
in reply to: Vladimir_Nesov’s comment on: Chantiel’s Shortform

But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y.

That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.

The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.

If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.

And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.

Chantiel 11 Sep 2021 22:07 UTC
3 points
on: Chantiel’s Shortform
I had made a post proposing a new alignment technique. I didn’t get any responses, but it still seems like a reasonable idea to me, so I’m interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don’t want to waste time doing this if the idea is unworkable for a reason I hadn’t thought of.

(If you’re interested, please read the post before reading below.)

Of course, the idea’s not a complete solution to alignment, and things have a risk of going catastrophically wrong due to other problems, like unreliable reasoning. But it still seems to me that it’s potentially helpful for outer alignment and corrigability.

If the humans actually directly answer any query about the desirability of an outcome, then it’s hard for me to see a way this system wouldn’t be outer-aligned.

Now, consulting humans every time results in a very slow objective function. Most optimization algorithms I know of rely on huge numbers of queries to the objective function, so using these algorithms with humans manually implementing the objective function would be infeasible. However, I don’t see anything in principle impossible with coming up with an optimization algorithm that scores well on its objective function even if that function is extremely slow. Even if the technique I described to do in the post this was wrong, I haven’t seen anyone looking into this, so it doesn’t seem clearly unworkable to me.

Even if this does turn out to be intractable, I think the basic motivation of my post still has the potential to be useful. The main motivation of my post is to have a hard-coded method of querying humans before making major strategic decisions and to update its beliefs about what is desirable with their responses. But that is a technique that could be used in other AI systems as well. It wouldn’t solve the everything, of course, but it could provide an additional level of safety. I’m not sure if this idea has been discussed before.

I also have yet to find anything seriously problematic about the method I did provided to optimize with limited calls to the objective function. There could of course be some I haven’t thought of, though.

Chantiel 1 Sep 2021 5:06 UTC
3 points
in reply to: TekhneMakre’s comment on: Chantiel’s Shortform
Well, I can’t say I have that intuition, but it is a possibility.

It’s a nice idea: a world without extortion sounds good. But remember that, though we want this, we should be careful to avoid wishful thinking swaying us.

In actual causal conflicts among humans, the aggressors don’t seem to be in a worse position. Things might be different from acausal UDT trades, but I’m not sure why it would be.

Chantiel 21 Aug 2021 17:25 UTC
3 points
on: Chantiel’s Shortform
I found what seems to be a potentially dangerous false-negative in the most popular definition of optimizer. I didn’t get a response, so I would appreciate feedback on if it’s reasonable. I’ve been focusing on defining “optimizer”, so I think feedback would help me a lot. You can see my comment here .

Chantiel 17 Aug 2021 20:31 UTC
3 points
on: The ground of optimization

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

First, I want to say that I think your definition says something important.

That said, I’m concerned that the above definition would have some potentially problematic false negatives. I’m a little unclear what counts as a perturbation, though, but I haven’t been able to find a way to clarify it that doesn’t result in false negatives.

Specifically, consider a computer program that performs hill-climbing. This would normally be considered an optimizing system. When doing hill-climbing, it doesn’t seem like there is anything that would count as a perturbation unless some external system modified the program’s state. I mean, during a normal, undisturbed execution, the hill-climbing algorithm would just go right to the top of its nearest hill and then stop; that doesn’t seem to include any perturbations.

But suppose the program checked for any external perturbations, that is, modifications, of its code or program memory and would immediately halt execution if it found any. For example, suppose the program would simultaneously run thousands of identical instances of a hill-climbing algorithm and would immediately halt execution if any of the instances of the optimization algorithm failed to exactly match any other one. That way, if some external force modified one of the instances, for example, by modifying one of the candidate solutions, it would fail to match with all the other instances and the entire program would halt.

Now, there are some external perturbations of the system that would make it still reach its target state, for example by making the exact same external modification to every instance of the optimization procedure. But still almost all perturbations would result in the program failing to reach its target configuration of having found the local maximum or minimum. So it doesn’t really seem to tend to reach its target configuration despite perturbations. So it doesn’t seem it would be classified as an optimizer according to the given definition.

This could be problematic if the definition is used to prevent mesaoptimization. If the above would indeed not be considered an optimizer by your definition, then it could potentially allow for powerful mesaoptimizers to be created without matching the given definition of “optimizer”.
What links here?
- Chantiel's comment on Chantiel’s Shortform by Chantiel (21 Aug 2021 17:25 UTC; 3 points)

Chantiel 31 Jul 2021 22:21 UTC
3 points
in reply to: avturchin’s comment on: Comment on decision theory
There have been various academics who have discussed decision theory before MIRI even came into existence. I don’t know if they’re actually working on improving things, though. My sense is that they’ve mostly been just been sticking with causal decision theory and evidential decision theory. But there’s probably at least some work done improving.

Chantiel 2 Feb 2021 2:11 UTC
3 points
in reply to: TurnTrout’s comment on: Open & Welcome Thread—January 2021
I hadn’t thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that’s why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it’s a moot point now. And I don’t know if I’m right, anyways.
Basically, I’m concerned that most nontrivial things a person wants will take multiple actions, so in most of the steps the AI will be motivated mainly by the reward given in the current step for reward-shaping reasons (as long as it doesn’t gain too much power). And doing the action that gives the most immediate reward for reward shaping-reasons sounds pretty much like doing whatever action the human would think is best in that situation. Which is probably what the human (and mimic) would do.

Chantiel 30 Jan 2021 1:21 UTC
3 points
on: Open & Welcome Thread—January 2021
I have a question about attainable utility preservation. Specifically, I read the post “Attainable Utility Preservation: Scaling to Superhuman”, and I’m wondering how and agent using the attainable utility implementation in equations 3, 4, and 5 could actually be superhuman. I’ve been misunderstanding things and mis-explaining things recently, so I’m asking here instead of the post for now to avoid wasting an AI safety researcher’s time.
The equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but penalizes its ability to achieve rewards in later timesteps.
But what if the only way to receive a reward is to do something that will only give a reward several timesteps later? In realistic situations, when can you ever actually accomplish the goal you’re trying to accomplish in a single atomic action?
For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it’s just rewarded for making paperclips, and it can’t make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything.
I know you could adjust the reward function to reward the AI doing things that you think will help it accomplish your primary goal in the future. For example, you know the AI moving its arm towards the wire is useful, so you could reward that. But then I don’t see how the AI could do anything clever or superhuman to make paperclips.
Suppose the AI can come up with a clever means of making paperclips by creating a new form of paperclip-making machine. Presumably, it would take many actions to build before it could be completed. And the person responsible for giving out awards wouldn’t be able to anticipate that the exact device the AI is making would be helpful, so I don’t see how the person giving out the rewawrds could get the AI to make the clever machine. Or do anything else clever.
Then wouldn’t such a reduced-impact agent pretty much just follow the doing what a human would think is most helpful for making paperclips? But then wouldn’t the AI pretty much just emulating human, not superhuman, behavior?

Chantiel 29 Jan 2021 21:03 UTC
3 points
in reply to: TurnTrout’s comment on: The Gears of Impact
Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.
I thought about it, and I don’t think your agent would have the issue I described.
Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.

Chantiel 30 Apr 2022 21:56 UTC
2 points
on: Chantiel’s Shortform
I’ve realized I’m somewhat skeptical of the simulation argument.

The simulation argument proposed by Bostrom argued, roughly, that either almost exactly all Earth-like worlds don’t reach a posthuman level, almost exactly all such civilizations don’t go on to build many simulations, or that we’re almost certainly in a simulation.

Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don’t know how we could know this.

For example, consider zoos: Perhaps advanced aliens create “zoos” featuring humans in an Earth-like world, for their own entertainment or other purposes. These wouldn’t necessarily be simulations of any actual other planet, but might merely have been inspired by actual planets. Similarly, lions in the zoo are similar to lions in the wild, and their enclosure features plants and other environmental feature similar to what they would experience in the wild. But I wouldn’t call lions in zoos simulations of wild lions, even if the developed parts where humans could view them was completely invisible to them and their enclosure was arbitrarily large.

Similarly, consider games: Perhaps aliens create games or something like them set in Earth-like worlds that aren’t actually intended to be simulations of any particle world. Similarly, human fantasy RPGs often have a medieval theme, so maybe aliens would create games set in a modern-Earth-like world, without having in mind any actual planet to simulate.

Now, you could argue that in an infinite universe, these things are all actually simulations, because there must be some actual, non-simulated world that’s just like the “zoo” or game. However, by that reasoning, you could argue that a rock you pick up is nothing but a “rock simulation” because you know there is at least one other rock in the universe with the exact same configuration and environment as the rock you’re holding. That doesn’t seem right to me.

Similarly, you could say, then, that I’m actually in a simulation right now. Because even if I’m in the original Earth, there is some other Chantiel in the universe in a situation identical to my current one, who is logically constrained to do the same thing I do, so thus I am a simulation of her. And my environment is thus a simulation of hers.

Chantiel 22 Nov 2021 23:21 UTC
2 points
in reply to: conchis’s comment on: A system of infinite ethics
FWIW, this conclusion is not clear to me. To return to one of my original points: I don’t think you can dodge this objection by arguing from potentially idiosyncratic preferences, even perfectly reasonable ones; rather, you need it to be the case that no rational agent could have different preferences. Either that, or you need to be willing to override otherwise rational individual preferences when making interpersonal tradeoffs.

Yes, that’s correct. It’s possible that there are some agents with consistent preferences that really would wish to get extraordinarily uncomfortable to avoid the torture. My point was just that this doesn’t seem like it would would be a common thing for agents to want.

Still, it is conceivable that there are at least a few agents out their that would consistently want to opt for the 0.5 chance of being extremely uncomfortable option, and I do suppose it would be best to respect their wishes. This is a problem that I hadn’t previously fully appreciated, so I would like to thank you for brining it up.

Luckily, I think I’ve finally figured out a way to adapt my ethical system to deal with this. That is, the adaptation will allow for agents to choose the extreme-discomfort-from-dust-specks option if that is what they wish for my my ethical system to respect their preferences. To do this, allow for the measure to satisfaction to include infinitesimals. Then, to respect the preferences of such agents, you just need need to pick the right satisfaction measure.

Consider the agent that for which each 50 years of torture causes a linear decrease in their utility function. For simplicity, imagine torture and discomfort are the only things the agent cares about; they have no other preferences; also assume that the agent dislike torture more than it dislikes discomfort, but only be a finite amount. Since the agent’s utility function/satisfaction measure is linear, I suppose being tortured for an eternity would be infinitely worse for the agent than being tortured for a finite amount of time. So, assign satisfaction 0 to the scenario in which the agent is tortured for eternity. And if the agent is instead tortured for $n \in R$ years, let the agent’s satisfaction be $1 - n ϵ$ , where $ϵ$ is whatever infinitesimal number you want. If my understanding of infinitesimals is correct, I think this will do what we want it to do in terms having agents using my ethical system respect the agent’s preferences.

Specifically, since being tortured forever would be infinitely worse than being tortured for a finite amount of time, any finite amount of torture would be accepted to decrease the chance of infinite torture. And this is what maximizing this satisfaction measure does: for any lottery, changing the chance of infinite torture has finite affect on expected satisfaction, whereas changing the chance of finite torture only has infinitesimal effect, so so avoiding infinite torture would be prioritized.

Further, among lotteries involving finite amounts of torture, it seems the ethical system using this satisfaction measure continues to do what what it’s supposed to do. For example, consider the choice between the previous two options:
1. A 0.5 chance of being tortured for 3^^^^3 years and a 0.5 chance of being fine.
2. A 0.5 chance of 3^^^^3 − 9999999 years of torture and 0.5 chance of being extraordinarily uncomfortable.
If I’m using my infinitesimal math right, the expected satisfaction of taking option 1 would be $(0.5 * 3 ↑ ↑ ↑ ↑ 3 ϵ + 0.5 * ϵ)$ , and the expected satisfaction of taking option 2 would be $0.5 * (3 ↑ ↑ ↑ ↑ 3 - 9999999) ϵ * 0.5 * m ϵ$ , for some $m << 3 ↑ ↑ ↑ ↑ 3$ . Thus, to maximize this agent’s satisfaction measure, my moral system would indeed let the agent give infinite priority to avoiding infinite torture, the ethical system would itself consider the agent to get infinite torture infinitely-worse than getting finite torture, and would treat finite amounts of torture as decreasing satisfaction in a linear manner. And, since the utility measure is still technically bounded, it would still avoid the problem with utility monsters.

(In case it was unclear, $↑$ is Knuth’s up-arrow notion, just like “^”.)

Chantiel 19 Nov 2021 19:50 UTC
2 points
in reply to: conchis’s comment on: A system of infinite ethics

I think this framing muddies the intuition pump by introducing sadistic preferences, rather than focusing just on unboundedness below. I don’t think it’s necessary to do this: unboundedness below means there’s a sense in which everyone is a potential “negative utility monster” if you torture them long enough. I think the core issue here is whether there’s some point at which we just stop caring, or whether that’s morally repugnant.

Fair enough. So I’ll provide a non-sadistic scenario. Consider again the scenario I previously described in which you have a 0.5 chance of being tortured for 3^^^^3 years, but also have the repeated opportunity to cause yourself minor discomfort in the case of not being tortured and as a result get your possible torture sentence reduced by 50 years.

If you have an unbounded below utility function in which each 50 years causes a linear decrease in satisfaction or utility, then to maximize expected utility or life satisfaction, it seems you would need to opt for living in extreme discomfort in the non-torture scenario to decrease your possible torture time be an astronomically small proportion, provided the expectations are defined.

To me, at least, it seems clear that you should not take the opportunities to reduce your torture sentence. After all, if you repeatedly decide to take them, you will end up with a 0.5 chance of being highly uncomfortable and a 0.5 chance of being tortured for 3^^^^3 years. This seems like a really bad lottery, and worse than the one that lets me have a 0.5 chance of having an okay life.

Sorry, sloppy wording on my part. The question should have been “does this actually prevent us having a consistent preference ordering over gambles over universes” (even if we are not able to represent those preferences as maximising the expectation of a real-valued social welfare function)? We know (from lexicographic preferences) that “no-real-valued-utility-function-we-are-maximising-expectations-of” does not immediately imply “no-consistent-preference-ordering” (if we’re willing to accept orderings that violate continuity). So pointing to undefined expectations doesn’t seem to immediately rule out consistent choice.

Oh, I see. And yes, you can have consistent preference orderings that aren’t represented as a utility function. And such techniques have been proposed before in infinite ethics. For example, one of Bostrom’s proposals to deal with infinite ethics is the extended decision rule. Essentially, it says to first look at the set of actions you could take that would maximize P(infinite good) - P(infinite bad). If there is only one such action, take it. Otherwise, take whatever action among these that has highest expected moral value given a finite universe.

As far as I know, you can’t represent the above as a utility function, despite it being consistent.

However, the big problem with the above decision rule is that it suffers from the fanaticism problem: people would be willing to bear any finite cost, even 3^^^3 years of torture, to have even an unfathomably small chance of increasing the probability of infinite good or decreasing the probability of infinite bad. And this can get to pretty ridiculous levels. For example, suppose you are sure you can easily design a world that makes every creature happy and greatly increases the moral value of the world in a finite universe if implemented. However, you know that coming up with such a design would take one second of computation on your supercomputer, which means one less second to keep thinking about astronomically-improbable situations in which you could cause infinite good. Thus would have some minuscule chance of avoiding infinite good or causing infinite bad. Thus, you decide to not help anyone, because you won’t spare the 1 second of computer time.

More generally, I think the basic property of non-real-valued consistent preference orderings is that they value some things “infinitely more” than others. The issue is, if you really value some property infinitely more than some other property of lesser importance, it won’t be worth your time to even consider pursuing the property of lesser importance, because it’s always possible you could have used the extra computation to slightly increase your chances of getting the property of greater importance.