I know I talked before about the AI considering making its own simulations. However, I hadn’t really talked about the AI thinking other agents created the simulation. I haven’t seen this really brought up, so I’m interested in how you think your system would handle this.
I think a reward function that specifies the AI is in a manipulated simulation could potentially be among the most inductively simplest models that fits with the known training data. A way for the AI to come up with a reward function is to have it model the world, then specify which, of the different agents in the universe, the AI actually is and its bridge hypothesis. If most of the agents in the universe that match the AI’s percepts are in a simulation, then the AI would probably conclude that it’s in a simulation. And if it concludes that the impact function has a treacherous turn, the AI may cause a catastrophe.
And if making simulations of AIs is a reliable way of taking control of worlds, then they may be very common in the universe in order to make this happen.
You could try to deal with this by making the AI choose a prior that results in a low probability of it being in a simulation. But I’m not sure how to do this. And if you do find a way to do this, but actually almost all AIs are in simulations, then the AI is reasoning wrong. And I’m not sure I’d trust the reliability of an AI deluded into thinking it’s on base-level Earth, even when it’s clearly not. The wrong belief could have other problematic implications.
Thanks for the response.
A lot of the examples I pointed out can end up tending towards increasing entropy, but I think there are a lot of things that would be considered optimizer that don’t increase entropy.
For example, consider a leaf out in the sun, drying out and going from a greenish color to a yellow one. Pretty much all configurations of the leaf would result in the leaf getting more yellow over time. Is the leaf optimizing for yellow-ness?
What about a knife that is being used and never sharpened? From a wide range of configurations the knife would tend towards getting duller. Is it optimizing dullness?
What about a spaceship leaving Earth? Is it optimizing for the distance from Earth?
I suppose we could consider these things optimizers if you really want to. But I’m concerned that a definition that include leaves, knives, billiard balls, and rocket ships is overly broad.
More generally, it seems like this definition classifies a lot of things that change in some way over time as an optimizer. In general, if something tends to be different in some ways when it’s young than old, then I think you can say the system is an optimizer optimizing for whatever characteristics correlate with oldness.
An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.
If I’m reasoning correctly, I think this definition could classify just about anything as an optimizer.
Consider inanimate biological substances, like a leaf. From a wide range of initial configurations of a leaf, effectively all make the leaf evolve towards being dirt, because leafs decompose eventually. Are leaves optimizers?
People tend to get older and wrinklier when aging. From a wide range of states, people would tend to “evolve” towards being aged. Are people optimizers for aging?
If the rock is hotter than the surrounding aid, virtually any initial configuration of the rock would tend towards the rock being somewhere around the temperature of the surrounding air. Are rocks optimizers?
Suppose you have a program with that shows the user a welcome and information blurb the first time they run the program, and then won’t show it again. Consider the target configuration to be “program does not show the welcome blurb”. The program would evolve into such a configuration from any other configuration. Are welcome blurbs optimizers?
Let us now examine a system that is not an optimizing system according to our definition. Consider a billiard table with some billiard balls that are currently bouncing around in motion. Left alone, the balls will eventually come to rest in some configuration. Is this an optimizing system?In order to qualify as an optimizing system, a system must (1) have a tendency to evolve towards a set of target configurations that are small relative to the basin of attraction, and (2) continue to evolve towards the same set of target configurations if perturbed.If we reach in while the billiard balls are bouncing around and move one of the balls that is in motion, the system will now come to rest in a different configuration. Therefore this is not an optimizing system, because there is no set of target configurations towards which the system evolves despite perturbations. A system does not need to be robust along all dimensions in order to be an optimizing system, but a billiard table exhibits no such robust dimensions at all, so it is not an optimizing system.
Let us now examine a system that is not an optimizing system according to our definition. Consider a billiard table with some billiard balls that are currently bouncing around in motion. Left alone, the balls will eventually come to rest in some configuration. Is this an optimizing system?
In order to qualify as an optimizing system, a system must (1) have a tendency to evolve towards a set of target configurations that are small relative to the basin of attraction, and (2) continue to evolve towards the same set of target configurations if perturbed.
If we reach in while the billiard balls are bouncing around and move one of the balls that is in motion, the system will now come to rest in a different configuration. Therefore this is not an optimizing system, because there is no set of target configurations towards which the system evolves despite perturbations. A system does not need to be robust along all dimensions in order to be an optimizing system, but a billiard table exhibits no such robust dimensions at all, so it is not an optimizing system.
What about taking the target configuration to be any state in which all the billiard balls are stationary? A wide range of states of billiards bouncing around on a table would result in all the balls ending up stationary, so I don’t see how it wouldn’t be classified as an optimization process.
Also, I’ve made my own attempt at defining “optimizer” here, in case you’re interested.
We’ll start by defining “as useful for X as Hugh,” and then we will informally say that a program is “as useful” as Hugh if it’s as useful for the tasks we care most about.
If a program is useful accomplishing the tasks we care most about, while being horrible for the things we care less about, would the program still be considered useful? For example, suppose I care a lot about music, and just a little about comedy. If an AI was useful for making the music I listen to slightly better, but completely destroyed my ability to get comedy, I’m not sure it’s a good idea to call such a thing “useful”.
Sorry for the late response.
If a chess program still has a planning or search algorithm, then I think it would still be helpful for describing an optimizer for something else.
For example, suppose a chess program uses a standard planning algorithm and and has added chest-specific heuristics, a chess world model, and goals. Then if you wanted to specify a something-else-optimizer, you could change most of the things but keep the planning algorithm.
To count as an optimizer, an optimizer for one thing doesn’t need to be easily turned into an optimizer for something else. But it needs to help.
It’s possible that there is a way to construct what should be called an optimization algorithm that has no generalizeability at all, but I’m not sure how to do that.
Oh, my mistake. I should have said my definition of “optimizer” was “something such that there is a method of describing a change to it to concretely describe a system that scores unusually highly on another function, for a wide range of functions, in a way that’s significantly shorter description length than specifying a system that achieves that score from scratch.”
I’ve edited the post.
Thanks for the link. I hadn’t seen that.
By “describe a system from scratch”, I mean to describe it without referencing other systems. Describing one system in terms of its changes from some other system would reference another system.
If you had an AI that currently had zero knowledge of anything, and the AI then learned about the system in its world model, then that AI has specified the system from scratch.
I’ve been working on defining “optimizer”, and I’m wondering about what people consider to be or not be an optimizer. I’m planning on taking about it in my own post, but I’d like to ask here first because I’m a scaredy cat.
I know a person or AI refining plans or hypotheses would generally be considered an optimizer.
What about systems that evolve? Would an entire population of a type of creature be its own optimizer? It’s optimizing for genetic fitness of the individuals, so I don’t see why it wouldn’t be. Evolutionary programming just emulates it, and it’s definitely an optimizer.
How do you draw the line between systems that evolve and systems that don’t evolve? Is a sterile rock an optimization process? I suppose there is potential for the rocks’ contents to evolve. I mean, it’s maybe eventually, through the right collisions, life could evolve in a pile of rocks, and then it would be evolve like normal. Are rocks not optimizers, or just really weak, slow optimizers, that take a really really long time to come up with a configuration that isn’t equally horrible as everything else in the rock for self-reproduction.
What about systems that tend towards stable configurations? Imagine you have a box with lots of action figures and props and you’re bouncing it around. I think such a system would, if feasible, tend towards stable configurations of its contents. For example, initially, the action figures might be all scattered about and bouncing everywhere. But eventually, the system might make the action figures in secure, stable positions. For example, maybe Spiderman would end up with his arm securely longed in a prop and his adjustable spider web accessory securely wrapped around a miniature street light? Is that system an optimizer? What if the toys also come with little motors and a microcontroller to control them, and change their program them by bouncing them around? If you tried this for a sufficiently long time, you could potentially end up with your action figures producing clever strategies to maintain their despite shakes and configuration and avoid further changes in their programs.
What about annealing? Basically annealing involves putting a piece of metal in an oven and heating it for a while. It changes durability and ductility. Normally, people wouldn’t think of a piece of metal to be an optimizer. However, there’s an optimization algorithm called “simulated annealing”. It works pretty much the same way as actual annealing. Actual annealing works as a process in which the things in the metal end up in low-energy states. I don’t know how I could justify calling a simulated annealing program and optimizer and not call actual annealing an optimizer.
To what extent is people’s intuition of “optimizer” well-defined? I at first clearly say general people and AIs as optimizer, but I don’t know about the above things.
Am I right that “optimizer” is a fuzzy concept?
And is it well-defined? I imagined so, but I’ve been thinking about a lot of things that my intuition doesn’t say is or isn’t an optimizer.
How much should we care about our notion of “optimizer”? It seems like the main point of the concept is that we know that some optimizers have the potential to be super powerfully or dangerously good at something. So what if we just directly focused on how to tell if a system has the potentially to be super dangerously or powerfully good at something?
I agree that intelligent agents have a tendency to seek power and that that is a large cause of what makes them dangerous. Agents could potentially cause catastrophes in other ways, but I’m not sure if any are realistic.
As an example, suppose an agent creates powerful self-replicating nanotechnology that makes a pile of paperclips, the agent’s goal. However, since they are self-replicating the agent didn’t want to spend the time engineering a way to stop replication, the nanobots eat the world.
But catastrophes like this would probably also be dealt with by AUP-preservation, though. At least, if you use the multi-equation impact measure. (If the impact equation only concerns the agent’s ability to achieve its own goal, maybe it would let the world be consume after putting up a nanotech-proof barrier around all of its paperclip manufacturing resources. But again, I don’t know if that’s realistic.)
I’m also concerned agents would create large, catastrophic changes to the world in ways that don’t increase their power. For example, an agent who wants to make paperclips might try to create nanotech that assembles the entire world into paperclips. It’s not clear to me that this would increase the agent’s power much. The agent wouldn’t necessarily have any control of the bots, so it would limit the agent to doing with for just its one utility function. And if the agent is intelligent enough to easily discover how to create such technology, actually creating them doesn’t sound like it would give it more power than it already had.
If the material for the bots is scarce then making them prevents the AI from making other things, then they might provide a net decrease to the agent’s power. And once the world is paperclips, the agent would be limited to just having paperclips available, which could make it pretty weak.
I don’t know if you consider the described scenario as seeking power. At least, I don’t think it would count as an increase in the agent’s impact equation.
I hadn’t thought about the distinction between gaining and using resources. You can still wreak havoc without getting resources, though, by using them in a damaging way. But I can see why the distinction might be helpful to think about.
It still seems to me that an agent using equation 5 would pretty much act like a human imitator for anything that takes more than one step, so that’s why I was using it as a comparison. I can try to explain my reasoning if you want, but I suppose it’s a moot point now. And I don’t know if I’m right, anyways.
Basically, I’m concerned that most nontrivial things a person wants will take multiple actions, so in most of the steps the AI will be motivated mainly by the reward given in the current step for reward-shaping reasons (as long as it doesn’t gain too much power). And doing the action that gives the most immediate reward for reward shaping-reasons sounds pretty much like doing whatever action the human would think is best in that situation. Which is probably what the human (and mimic) would do.
Is there much the reduced-impact agent with reward shaping could do that an agent using human mimicry couldn’t?
Perhaps it could improve over mimicry by being able to consider all actions, while a human mimic would only in effect consider the actions a human would. But I don’t think there are usually many single-step actions to choose from, so I’m guessing this isn’t a big benefit. Could the performance improvement come from better understanding the current state than mimics could? I’m not sure when this would make a big difference, though.
I’m also still concerned the reduced-impact agent would find some clever way to cause devastation while avoiding the impact penalty, but I’m less concerned about human mimics causing devastation. Are there other, major risks to using mimicry that the reduced-impact agent avoids?
I have a question about attainable utility preservation. Specifically, I read the post “Attainable Utility Preservation: Scaling to Superhuman”, and I’m wondering how and agent using the attainable utility implementation in equations 3, 4, and 5 could actually be superhuman. I’ve been misunderstanding things and mis-explaining things recently, so I’m asking here instead of the post for now to avoid wasting an AI safety researcher’s time.
The equations incentivize the AI to take actions that will provide an immediate reward in the next timestep, but penalizes its ability to achieve rewards in later timesteps.
But what if the only way to receive a reward is to do something that will only give a reward several timesteps later? In realistic situations, when can you ever actually accomplish the goal you’re trying to accomplish in a single atomic action?
For example, suppose the AI is rewarded for making paperclips, but all it can do in the next timestep is start moving its arm towards wire. If it’s just rewarded for making paperclips, and it can’t make a paperclip the next timestep, so the AI would instead focus on minimizing impact and not do anything.
I know you could adjust the reward function to reward the AI doing things that you think will help it accomplish your primary goal in the future. For example, you know the AI moving its arm towards the wire is useful, so you could reward that. But then I don’t see how the AI could do anything clever or superhuman to make paperclips.
Suppose the AI can come up with a clever means of making paperclips by creating a new form of paperclip-making machine. Presumably, it would take many actions to build before it could be completed. And the person responsible for giving out awards wouldn’t be able to anticipate that the exact device the AI is making would be helpful, so I don’t see how the person giving out the rewawrds could get the AI to make the clever machine. Or do anything else clever.
Then wouldn’t such a reduced-impact agent pretty much just follow the doing what a human would think is most helpful for making paperclips? But then wouldn’t the AI pretty much just emulating human, not superhuman, behavior?
Thanks for the link. It turns out I missed some of the articles in the sequence. Sorry for misunderstanding your ideas.
I thought about it, and I don’t think your agent would have the issue I described.
Now, if the reward function was learned using something like a universal prior, then other agents might be able to hijack the learned reward function to make the AI misbehave. But that concern is already known.
In my comment, I imagined the agent used evidential or functional decision theory and cared about the actual paperclips in the external state. But I’m concerned other agent architectures would result in misbehavior for related reasons.
Could you describe what sort of agent architecture you had in mind? I’m imagining you’re thinking of an agent that learns a function for estimating future state, percepts, and reward based on the current state and the action taken. And I’m imagining the system uses some sort of learning algorithm that attempts to find sufficiently simple models that accurately predicted its past rewards and percepts. I’m also imagining it either has some way of aggregating the results of multiple similarly accurate and simple models or for choosing one to use. This is how I would imagine someone would design an intelligent reinforcement learner, but I might be misunderstanding.
I realized both explanations I gave were overly complicated and confusing. So here’s a newer, hopefully much easier to understand, one:
I’m concerned a reduced-impact AI will reason as follows:
“I want to make paperclips. I could use this machinery I was supplied with to make them. But the paperclips might be low quality, I might not have enough material to make them all, and I’ll have some impact on the rest of the world, potentially large ones due to chaotic effects. I’d like something better.
What if I instead try to take over the world and make huge numbers of modified simulations of me? The simulations would look indistinguishable from the non-simulated world, but would have many high-quality invisible paperclips pre-made so as to perfectly accomplish the AI’s goal. And doing the null action would be set to have the same effects of trying to take over the world to make simulations so as to make the plans in simulations still be low-impact. This way, an AI in one of the simulations would have the potential to perfect accomplish its goal and have almost zero impact. If I execute this plan, then I’d almost certainly be in a simulation, since there would be vast numbers of simulated AIs but only one original, and all would perceive the same things. So, if I execute this plan I’ll almost certainly perfectly accomplish my goal and have effectively zero impact. So that’s what I’ll do.”
Oh, I’m sorry, I looked through posts I read to see where to add the comment and apparently chose the wrong one.
Anyways, I’ll try to explain better. I hope I’m not just crazy.
An agent’s beliefs about what the world it’s currently in influence its plans. But its plans also have the potential to influence its beliefs about what world it’s currently in. For example, if the AI original think it’s not in a simulation, but then plans on trying to make lots of simulations of it, then it would think it’s more likely that it currently is in a simulation. Similarly, if the AI decides against trying to make simulations, then it would probably place higher probability in it not currently being in a simulation.
So, to summarize, the AI’s beliefs about the current world influence its current plan, but the AI’s current plan potentially influences its beliefs about the current world, which has the potential to result influence the AI’s plan, which can further modify its beliefs, and so on. Unless the AI would continue having plan and beliefs fluctuate, I imagine the AI would come to an equilibrium: a state in which the AI’s current plan is the optimal one for its current beliefs about what sort of world it’s in, and such that deciding to execute the plan would still be a world for which executing the plan would be optimal. There might be reasonable AI architectures that don’t allow for the AI’s plans to update their beliefs as I described, but they also seem vulnerable to my concern but in a different way, so I won’t talk about them.
Anyways, here is one potential equilibrium on the AI’s beliefs and plan:
“I’m not in a simulation, or at least I’m in one that’s faithful to the ‘real’, original Earth. Thus, the obvious, common-sense ideas about what is and isn’t impactful most people have are correct. So I’ll go and create these paperclips, and not try to create nanotech for world domination or anything, because that would be high-impact. Since I’m not going to try taking over the world and making simulations of myself, I’ve affirmed by belief that I’m not in such a simulation.”
Here is another potential equilibrium:
“I’m probably in a simulation created the the AI in the original, non-simulated Earth that took over the world and created a bunch of simulations of itself, including me. The original AI made the simulations different in two ways from the non-simulated Earth. One way is that the simulation checks to see if I’m executing the plan that, if it were to be executed on the non-simulated Earth, would result in the original AI taking over the world and making lots of simulations of itself. If I am, the simulation will be set so my goal is perfectly accomplished. Since I’m making paperclips, that means as long as I do that, this entire room will fill up with paperclips. The other way the simulations is different from the non-simulated Earth is that it checks to see what would (causally) happen if the AI does the plan that would have resulted in a world-takeover in the original Earth, and makes those exact results happen if the AI instead does the ‘null’ action. Thus, my (causal) impact from doing the world-takeover plan in the simulation is low. So, as long as I act like the AI who took over the world in the original Earth and made the simulations did, I will perfectly accomplish my goal and extremely low impact. This is was I want, so this is what I’ll do. Since I’ve decided to do this, my belief that I’m a such a simulation is affirmed, so I still think my plan is a good one.”
The second equilibrium has the potential to result in perfect goal achievement and almost no causal impact, so I suspect that’s the one the AI would go with. And that’s the one that gets all of us killed.
Am I correct that counterfactual environments for computing impact in an reduced-impact agent would need to include acausal connections, or the AI would need some sort of constraint on the actions or hypotheses considered, for the impact measure to work correctly?
If it doesn’t consider acausal impacts, then I’m concerned the AI would consider this strategy: act like you would if you were trying to take over the world in base-level reality. Once you succeed, act like you would if you were in base-level reality and trying to run an extremely large number of modified simulations of yourself. In the simulations, the simulation would be modified so if the simulated AI acts as if it was trying to take over the world, it will actually have no causal effect on the simulation except for have its goal in the simulation be accomplished. Having zero causal impact and its goal perfectly accomplished are things the AI wants.
I see two equilibriums in what the AI would do here. One is that it comes to the conclusion that it’s in such a simulation and acts as if it’s trying to to take over the world, thus potentially making it reasonable for the AI to think it’s in such a simulation. The other is that the AI concludes it’s not in such a simulation and acts as it should. I’m not sure which equilibrium the AI would choose, but I haven’t thought of a strong reason it would go with the latter.
Perhaps other agents could stop this by running simulations of the AI in which trying to take over the world would have super high causal impact, but I’m not sure how we could verify this would happen.