Two senses of “optimizer”
The word “optimizer” can be used in at least two different ways.
First, a system can be an “optimizer” in the sense that it is solving a computational optimization problem. A computer running a linear program solver, a SAT-solver, or gradient descent, would be an example of a system that is an “optimizer” in this sense. That is, it runs an optimization algorithm. Let “optimizer_1” denote this concept.
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in. This is the sense in which “optimizer” is used in posts such as this. Let “optimizer_2” denote this concept.
These two concepts are distinct. Say that you somehow hook up a linear program solver to a reinforcement learning environment. Unless you do the “hooking up” in a particularly creative way there is no reason to assume that the output of the linear program solver would push the environment in a particular direction. Hence a linear program solver is an optimizer_1, but not an optimizer_2. On the other hand, a simple tabular RL agent would eventually come to systematically push the environment in a particular direction, and is hence an optimizer_2. However, such a system does not run any internal optimization algorithm, and is therefore not an optimizer_1. This means that a system can be an optimizer_1 while not being an optimizer_2, and vice versa.
There are some arguments related to AI safety that seem to conflate these two concepts. In Superintelligence (pg 153), on the topic of Tool AI, Nick Bostrom writes that:
A second place where trouble could arise is in the course of the software’s operation. If the methods that the software uses to search for a solution are sufficiently sophisticated, they may include provisions for managing the search process itself in an intelligent manner. In this case, the machine running the software may begin to seem less like a mere tool and more like an agent. Thus, the software may start by developing a plan for how to go about its search for a solution. The plan may specify which areas to explore first and with what methods, what data to gather, and how to make best use of available computational resources. In searching for a plan that satisfies the software’s internal criterion (such as yielding a sufficiently high probability of finding a solution satisfying the user-specified criterion within the allotted time), the software may stumble on an unorthodox idea. For instance, it might generate a plan that begins with the acquisition of additional computational resources and the elimination of potential interrupters (such as human beings).
To me, this argument seems to make an unexplained jump from optimizer_1 to optimizer_2. It begins with the observation that a powerful Tool AI would be likely to optimize its internal computation in various ways, and that this optimization process could be quite powerful. In other words, a powerful Tool AI would be a strong optimizer_1. It then concludes that the system might start pursuing convergent instrumental goals – in other words, that it would be an optimizer_2. The jump between the two is not explained.
The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself. The fact that a system is internally running an optimization algorithm does not imply that the system is selecting its output in such a way that this output optimizes the environment of the system.
The excerpt from Superintelligence is just one example of an argument that seems to slide between optimizer_1 and optimizer_2. For example, some parts of Dreams of Friendliness seem to be doing so, or at least it’s not always clear which of the two is being talked about. I’m sure there are more examples as well.
Be mindful of this distinction when reasoning about AI. I propose that “consequentialist” (or perhaps “goal-directed”) is used to mean what I have called “optimizer_2”. I don’t think there is a need for a special word to denote what I have called “optimizer_1” (at least not once the distinction between optimizer_1 and optimizer_2 has been pointed out).
Note: It is possible to raise a sort of embedded agency-like objection against the distinction between optimizer_1 and optimizer_2. One might argue that:
There is no sharp boundary between the inside and the outside of a computer. An “optimizer_1” is just an optimizer whose optimization target is defined in terms of the state of the computer it is installed on, whereas an “optimizer_2” is an optimizer whose optimization target is defined in terms of something outside the computer. Hence there is no categorical difference between an optimizer_1 and an optimizer_2.
I don’t think that this argument works. Consider the following two systems:
A computer that is able to very quickly solve very large linear programs.
A computer that solves linear programs, and tries to prevent people from turning it off as it is doing so, etc.
System 1 is an optimizer_1 that solves linear programs, whereas system 2 is an optimizer_2 that is optimizing the state of the computer that it is installed on. These two things are different. (Moreover, the difference isn’t just that system 2 is “more powerful” than system 1 – system 1 might even be a better linear program solver than system 2.)
Acknowledgements: We were aware of the difference between “optimizer_1” and “optimizer_2″ while working on the mesa-optimization paper, and I’m not sure who first pointed it out. We were also probably not the first people to realise this.
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- AI Safety 101 : Reward Misspecification by 18 Oct 2023 20:39 UTC; 30 points) (
- [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence by 10 Sep 2019 19:10 UTC; 21 points) (
- AI Safety 101 : Reward Misspecification by 21 Dec 2023 14:26 UTC; 6 points) (EA Forum;
- 30 Aug 2019 20:10 UTC; 3 points) 's comment on Eigil Rischel’s Shortform by (
- 12 Sep 2019 6:19 UTC; 1 point) 's comment on [AN #63] How architecture search, meta learning, and environment design could lead to general intelligence by (
I think that this is one of the major ways in which old discussions of optimization daemons would often get confused. I think the confusion was coming from the fact that, while it is true in isolation that an optimizer_1 won’t generally self-modify into an optimizer_2, there is a pretty common case in which this is a possibility: the presence of a training procedure (e.g. gradient descent) which can perform the modification from the outside. In particular, it seems very likely to me that there will be many cases where you’ll get an optimizer_1 early in training and then an optimizer_2 later in training.
That being said, while having an optimizer_2 seems likely to be necessary for deceptive alignment, I think you only need an optimizer_1 for pseudo-alignment: every search procedure has an objective, and if that objective is misaligned, it raises the possibility of capability generalization without objective generalization.
Also, as a terminological note, I’ve taken to using “optimizer” for optimizer_1 and “agent” for something closer to optimizer_2, where I’ve been defining an agent as an optimizer that is performing a search over what its own action should be. I prefer that definition to your definition of optimizer_2, as I prefer mechanistic definitions over behavioral definitions since I generally find them more useful, though I think your notion of optimizer_2 is also a useful concept.
I agree with everything you have said.
I’m confused about this part. According to this definition, is “agent” a special case of optimizer_1? If so it doesn’t seem close to how we might want to define a “consequentialist” (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).
I pretty strongly think this is the same distinction as I am pointing at with selection vs control, although perhaps I focus on a slightly broader cluster-y distinction while you have a more focused definition.
I think this distinction is something which people often conflate in computer science more broadly, too. Often, for example, a method will be initially intended for the control case, and people will make ‘improvements’ to it which only make sense in a selection context. It’s easy for things to slide in that direction, because control-type algorithms will often be tested out in computer-simulated environments; but then, you have access to the environment, and can optimize it in more direct ways.
I’m more annoyed by this sort of mix-up than I probably should be.
Planned summary:
The first sense of “optimizer” is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.
Planned opinion:
I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
If the super-powerful SAT solver thing finds the plans but doesn’t execute them, would you still lump it with optimizer_2? (I know it’s just terminology and there’s no right answer, but I’m just curious about what categories you find natural.)
(BTW this is more-or-less a description of my current Grand Vision For AGI Safety, where the “dynamics of the world” are discovered by self-supervised learning, and the search process (and much else) is TBD.)
Hmm, idk, it feels more like an optimizer_1 in that situation. Now that you’ve posed this question, the super-powerful SAT solver that acts in the world feels like both an optimizer_1 and an optimizer_2.
Alternatively, you could say the distinction is whether the optimizer cares about the environment. I think there’s a sense (or senses?) in which these things can be made/considered equivalent. I don’t feel like I totally understand or am satisfied with either way of thinking about it, though.
This looks like a Cartesian distinction that exists only by virtue of not fully considering the embeddedness of the optimizer.
It only seems that the domain of an optimizer_1 cannot optimize or affect the environment like an optimizer_2 because you are thinking of it as operating in mathematical, ideal terms, rather than as a real system that runs on a computer by doing physics and interacting with the world. An optimizer_1 can smoothly turn into an optimizer_2 in at least two ways. One is via unintended side effects. Another is via scope creep. There is no clean, bright line separating the domain of the optimizer_1 from the rest of reality, and in fact it was always an optimizer_2, just only looking at a narrow slice of the world because you put up some guardrails to keep it there.
The worry is what happens when it jumps the guardrails, or the guardrails fail.
I don’t think that I’m assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of “optimizer” does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?
Sure, let’s be super specific about it.
Let’s say we have something you consider an optimizer_1, a SAT solver. It operates over a set of variables V arranged in predicts P using an algorithm A. Since this is a real SAT solver that is computed rather than a purely mathematical one we think about, it runs on some computer C and thus for each of V, P, and A there is some C(V), C(P), and C(A) that is the manifestation of each on the computer. We can conceptualize what C does to V, P, and A in different ways: it turns them into bytes, it turns A into instructions, it uses C(A) to operate on C(V) and C(P) to produce a solution for V and P.
Now the intention is that the algorithm A is an optimizer_1 that only operates on V and P, but in fact A is never run, properly speaking, C(A) is, and we can only say A is run to the extent C(A) does something to reality that we can set up an isomorphism to A with. So C(A) is only an optimizer_1 to the extent the isomorphism holds and it is, as you defined optimizer_1, “solving a computational optimization problem”. But properly speaking C(A) doesn’t “know” it’s an algorithm: it’s just matter arranged in a way that is isomorphic, via some transformation, to A.
So what is C(A) doing then to produce a solution? Well, I’d say it “optimizes its environment”, that is literally the matter and its configuration that it is in contact with, so it’s an optimizer_2.
You might object that there’s something special going on here such that C(A) is still an optimizer_1 because it was set up in a way that isolates it from the broader environment so it stays within the isomorphism, but that’s not a matter of classification, that’s an engineering problem of making an optimizer_2 behave as if it were an optimizer_1. And a large chunk of AI safety (mostly boxing) is dealing with ways in which, even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it “breaks” the isomorphism and does something that might still keep the isomorphism in tact but also does other things you didn’t think it would do if the isomorphism were strict.
Put pithily, there’s no free lunch when it comes to isomorphisms that allow you to manifest your algorithms to compute them, so you have to worry about the way they are computed.
I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is “leaky” in a given case is something that should be carefully examined, not something that should be glossed over.
I also agree with what ofer said.
“even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it “breaks” the isomorphism and does something that might still keep the isomorphism in tact but also does other things you didn’t think it would do if the isomorphism were strict”
I agree. Part of the reason why it’s valuable to make the distinction is to enable more clear thinking about these sorts of issues.
I think there is only a question of how leaky, but it is always non-zero amounts of leaky, which is the reason Bostrom and others are concerned about it for all optimizers and don’t bother to make this distinction.
It seems useful to have a quick way of saying:
“The quarks in this box implement a Turing Machine that [performs well on the formal optimization problem P and does not do any other interesting stuff]. And the quarks do not do any other interesting stuff.”
(which of course does not imply that the box is safe)
Sure. Not making the distinction seems important, though, because this post seems to be leaning towards rejecting arguments that depend on noticing that the distinction is leaky. Making it is okay so long as you understand it as “optimizer_1 is a way of looking at things that screens off many messy details of the world so I can focus on only the details I care about right now”, but if it becomes conflated with “and if something is an optimizer_1 I don’t have to worry about the way it is also an optimizer_2″ then that’s dangerous.
The author of the post suggests it’s a problem that “some arguments related to AI safety that seem to conflate these two concepts”. I’d say they don’t conflate them, but understand that every optimizer_1 is an optimizer_2.
Maybe we’re just not using the same definitions, but according to the definitions in the OP as I understand them, a box might indeed contain an arbitrarily strong optimizer_1 while not containing an optimizer_2.
For example, suppose the box contains an arbitrarily large computer that runs a brute-force search for some formal optimization problem. [EDIT: for some optimization problems, the evaluation of a solution might result in the execution of an optimizer_2]
Yes, and I’m saying that’s not possible. Every optimizer_1 is an optimizer_2.
I think only if its gets its feedback from the real world. If you have gradient descent, then the true answers for its samples are stored somewhere “outside” the intended demarcation, and it might try to reach them. But how is a hillclimber that is given a graph and solves the traveling salesman problem for it an optimizer_2?
I would answer this the same way I did earlier in this thread, simply substituting in whatever problem you like for SAT in that example.
All feedback is feedback about the real world because the real world is the only place you can instantiate the computation to reason about “math land”.
Yes, obviously its going to change the world in some way to be run. But not any change anywhere makes it an optimizer_2. As defined, an optimizer_2 optimizes its environment. Changing something inside the computer does not make it an optimizer_2, and changing something outside without optimizing it doesnt either. Yes, the computer will inevitably cause some changes in its enviroment just by running, but what makes something an optimizer_2 is to systematically bring about a certain state of the environment. And this offers a potential solution other than by hardware engineering: If the feedback is coming from inside the computer, then what is incentivized are only states within the computer, and so only the computer is getting optimized.
Well, one thing a powerful optimizer might do at some point is ask itself “what programm should I run that will figure out such and such for me”. This is what Bostrom is describing in the quote, an optimizer optimizing its own search process. Now, if the AI then searches through the space of possible programms, predicts which one will give it the answer quickest, and then implements it, heres a thing that might happen: There might be a programm that, when ran, affects the outside world in such a way as to speed up the process of answering.
For example, it might lead electricity to run through the computer in such a way as to cause it to emit electromagnetic waves, through which it sends a message to a nearby w-lan router and the uses the internet to hack a bank account to buy extra hardware and have it delivered to and pluged into itself, and the it runs a programm calculating the answer on this much more powerful hardware, and in this way ends up having the answer faster then if it had just started calculating away on the weaker hardware.
And if the optimizer works as described above, it will implement that programm, and thereby optimize its enviroment. Notably, it will optimize for solving the original optimisation problem faster/better, not try to implement the solution to it it has found.
I dont think this makes your distinction useless, as there are genuine system_1 optimizers, even relatively powerful ones, but the Cartesian boundary is an issue once we talk about self-improving AI.
The fact that a superintelligent AI contains an optimization algorithm does not necessarily mean that this optimization algorithm is itself superintelligent (or that it has access to the world model of the overall system, etc). It might, it might not – it depends on the design of the system.
”the Cartesian boundary is an issue once we talk about self-improving AI.”
This presumably depends on a lot of specific facts about how the system is designed.
It doesnt need to. The “inner” programm could also use its hardware as quasi-sense organs and figure out a world model of its own.
Of course this does depend on the design of the system. In the example described, you could, rather then optimize for speed itself, have a fixed function that estimates speed (like what we do in complexity theory) and then optimize for *that*, and that would get rid of the leak in question.
The point I think Bostrom is making is that contrary to intuition, just building the epistemic part of an AI and not telling it to enact the solution it found doesnt guarantee you dont get an optimizer_2.
I think I have an example of “an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful”. I posted it a couple days ago: Self-supervised learning & manipulative predictions. A self-supervised learning system is an optimizer_1: It’s trying to predict masked bits in a fixed, pre-loaded set of data. This task does not entail interacting with the world, and we would presumably try hard to design it not to interact with the world.
However, if it was a powerful learning system with world-knowledge (via its input data) and introspective capabilities, it would eventually figure out that it’s an AGI and might hypothesize what environment it’s in, and then hypothesize that its operations could affect its data stream via unintended causal pathways, e.g. sending out radio signals. Then, if it used certain plausible types of heuristics as the basis for its predictions of masked bits, it could wind up making choices based on their downstream effects on itself via manipulating the environment. In other words, it starts acting like an optimizer_2.
I’m not super-confident about any of this and am open to criticism. (And I agree with you that this a useful distinction regardless; indeed I was arguing a similar (but weaker) point recently, maybe not as elegantly, at this link)
I’m confused about a couple of your examples. In my mind, “optimizing” relates to a number going up or down. More happiness, more money (more problems), more healthy, etc.
Gradient descent makes the cost function go down. RL makes a reward go up. I understand those two examples because there’s an implied goal.
But how is an SAT solver an optimizer? There’s not an implied goal as far as I can tell.
Same for a linear solver. I could solve linear regression with a linear solver and that has an implied loss function. But not linear solvers in general.
There’s a bunch of things you want to fit in your backpack, but only so much space. Fortunately, you’re carrying these things to sell, and you work in a business where you’re guaranteed to sell everything you bring with you, and the prices are fixed. You write a program with the same goal as you—finding the combination which yields maximum profit.
Thanks Pattern! I do see now how it could be used as an optimizer, but it still seems like it’s not intrinsically an optimizer (like how linear algebra isn’t an optimizer but it could be used as an optimizer).
I don’t think anyone actually claimed they were all intrinsically optimizers, but I was still confused by it.
The way SAT solvers work is by trying to assign the maximum number of variables without conflict. For a finite number of constraints, the max is at most all of them, so it stops. (If the people who write it assume there’s a solution, then they probably don’t include code to keep the (current) “best” solution around.)
As Joar Skalse noted below, this could be considered intrinsic optimization. (Maybe if you wrote an SAT solver which consisted solely of trying random assignments for all variables at once until it happened to find one that satisfied all constraints it wouldn’t be an optimizer, absent additional code to retain the best solution so far.)
In contrast, I haven’t seen any work on this* - optimization is always described for finding solutions which fits all constraints. (The closest to an exception is Lagrange multipliers—that method produces a set of candidate solutions which includes the maximum/minimum, and then one has to check all of them to find which one that is. But all candidates satisfy the other constraints.)
The way it’s described would have one think it maximizing a utility function—which only returns 0 or 1. But the way the solution works is by assembling pieces towards that end—navigating the space of possible solutions in a fashion that will explore/eliminate every possibility before giving up.
*Which really stuck out once I had a problem in that form. I’m still working on it. Maybe work on chess AI includes this because the problem can’t be fully solved, and that’s what it takes to get people to give up on “optimal” solutions which take forever to find.
A linear program solver is a system that maximises or minimises a linear function subject to non-strict linear constraints.
Many SAT-solvers are implemented as optimizers. For example, they might try to find an assignment that satisfies as many clauses as possible, or they might try to minimise the size of the clauses using resolution.
Thanks for the explanation and links! I agree that linear program solvers are intrinsically optimizing. SAT-solvers not intrinsically so, but could be used to optimize (such as Pattern’s example in his comment).
Moving on, I think it’s hard to define “environment” in a way that isn’t arbitrary, but it’s easier to think about “everything this optimizer affects”.
For every optimizer, I could (theoretically) list everything that it affects at that moment, but that list changes based off what’s around the optimizer to be affected. I could rig a set up such that, as soon as I run the optimizer code, it would cause a power surge and blow a fuse. Then, every optimizer has changed the environment while optimizing.
But you may argue that an opt2 would purposely choose an action that would affect the environment (especially if choosing that action maximizes reward) while an opt1 wouldn’t even have that action available.
But while available actions may be constrained by the type of optimizer, and you could try to make a distinction between different available actions, the affects of those limited actions changes with the programming language, the hardware, etc.
I’m still confused on this, and you still may have made a good distinction between different optimizers.
Generalizing from Evolution.
The argument I quoted does not mention evolution. I’m not saying that the argument can’t be patched, I’m saying that the argument is inadequate as stated. I should note, however, that evolution selects organisms based on their ability to do optimisation_2, not their ability to do optimisation_1. It’s therefore not clear when and how you can simply “generalise from evolution”.
I meant the reason people think the jump is possible is because of evolution. (Evolution was just running, business as usual, and suddenly! We’re on the moon!)
I have seen better versions of this argument that mention humans or evolutionary algorithms explicitly. And you’re right—the argument isn’t clear on that. The jump is ‘a search for better plans will find better plans’.
A superintelligence is potentially more useful if it can model more. As an example, imagine that you want an AI that gives you a cure for cancer. Well, it does, but as a side effect of the cure, the patient loses 50 IQ points. Or perhaps the cure is incredibly painful. Or it is made from dead babies’ stem cells, causing outrage. Or it is insanely expensive, e.g. you would have to construct it atom by atom, in large quantities. Etc.
It would be better to have a superintelligence that understands all of these things, takes a little more time thinking, and finds a cure for cancer that also happens to be relatively cheap, inoffensive, painless, well tasting, and without negative side effects. (For the sake of argument, I assume here that both solutions are possible, it’s just that the second one is a bit more difficult to find, so the former AI goes with the first solution it finds because why not.)
But the further this way you go, the more likely the superintelligence is able to model its own existence, and people’s reaction on it. As soon as the AI is able to model correctly “if people turn me off 5 minutes before producing the cure for cancer, it means 0 people will be cured, even if my algorithm would have produced an efficient cure otherwise”, we get the first bits of self-awareness. Now the superintelligence will optimize the environment for its instrumental goals (survival, more resources, greater popularity or ability to defend itself) as a side effect of solving other problems.
It would require a selective blindness to make the superintelligence assume that it is disembodied, and that its computations will continue and produce effects in real world even if its body is destroyed. Actually… with sufficiently good model of the world, it could still reason about building another intelligence to assist it with the project. And if you make it blind towards computer science, there is still a chance it would invent another intelligence that doesn’t exactly fit your definition of a “computer”, e.g. an intelligent swarm of nanobots built from organic molecules. (There is a general argument somewhere on LW that you can’t reliably limit a superintelligence by creating a blacklist of forbidden moves, because by being smarter than you it can possibly think about things that should have been on your blacklist, but you didn’t think about them.)
Using your terminology, not every optimizer_1 is an optimizer_2, but the most useful ones of them are. A computer able to solve a huge system of linear equations is not as useful as the one that can find a cure for cancer.
I know these things. Nothing you have said contradicts my point, as far as I can see. The point I am making here is one of conceptual clarification, which the intent of enabling more clear thinking and reasoning.
You seem to be talking about a system that outputs “plans that, if implemented, would achieve X” (roughly), and your point seems to be that such a system would be likely to be or behave like an optimizer_2. I find this claim quite plausible (and fully compatible with the point I’m making).
“It would require a selective blindness to make the superintelligence assume that it is disembodied, and that its computations will continue and produce effects in real world even if its body is destroyed.”
Unclear, if anything it seems like it might be easier to make a Cartesian AI than a non-Cartesian one. But that’s a side note.
RE “make the superintelligence assume that it is disembodied”—I’ve been thinking about this a lot recently (see The Self-Unaware AI Oracle) and agree with Viliam that knowledge-of-one’s-embodiment should be the default assumption. My reasoning is: A good world-modeling AI should be able to recognize patterns and build conceptual transformations between any two things it knows about, and also should be able to do reasoning over extended periods of time. OK, so let’s say it’s trying to figure something out something about biology, and it visualizes the shape of a tree. Now it (by default) has the introspective information “A tree has just appeared in my imagination!”. Likewise, if it goes through any kind of reasoning process, and can self-reflect on that reasoning process, then it can learn (via the same pattern-recognizing algorithm it uses for the external world) how that reasoning process works, like “I seem to have some kind of associative memory, I seem to have a capacity for building hierarchical generative models, etc.” Then it can recognize that these are the same ingredients present in those AGIs it read about in the newspaper. It also knows a higher-level pattern “When two things are built the same way, maybe they’re of the same type.” So now it has a hypothesis that it’s an AGI running on a computer.
It may be possible to prevent this cascade of events, by somehow making sure that “I am imagining a tree” and similar things never get written into the world model. I have this vision of two data-types, “introspective information” and “world-model information”, and your static type-checker ensures that the two never co-mingle. And voila, AI Safety! That would be awesome. I hope somebody figures out how to do that, because I sure haven’t. (Admittedly, I have neither time nor relevant background knowledge to try properly.) I’m also slightly concerned that, even if you figure out a way to cut off introspective knowledge, it might incidentally prevent the system from doing good reasoning, but I currently lean optimistic on that.
The dominant framework that I expect people to have which disagree with distinction is simply that when optimizers become more powerful, there might be a smooth transition between an optimizer_1 and an optimizer_2. That is, if an optimizer is trained on some simulated environment, then from our point of view it may well look like it is performing a local constrained search for policies within its training environment. However, when the optimizer is taken off the distribution, then it may act more like an optimizer_2.
One particular example would be if we were dumping so much compute into selecting for mesa optimizers that they became powerful enough to understand external reality. On the training distribution they would do well, but off it they would just aim for whatever their mesa objective was. In this case it might look more like it was just an optimizer_2 all along and we were simply mistaken about its search capabilities, but on the other hand, the task we gave it was limited enough that we initially thought it would only run optimizer_1 searches.
That said, I agree that it is difficult to see how such a transition from optimizer_1 to optimization_2 could occur in the real world.
I should clarify that I’m not necessarily saying that there can’t be cases in which a system that is believed or intended to be an optimizer_1 might become or turn out to be an optimizer_2 – I have not really argued for or against this. What I want to do is enable clearer thinking about issue, so that one does not slide between these two concepts without noticing.
This definition of optimizer_2 depends on the definition of “environment”. It seems that for an RL agent you use the word “environment” to mean the formal environment as defined in RL. How do you define “environment”, for this purpose, in non-RL settings?
What should be considered the environment of a SAT solver, or an arbitrary mesa-optimizer that was optimized to be a SAT solver?
Yep. Good post. Important stuff. I think we’re still struggling to understand all of this fully, and work on indifference seems like the most relevant stuff.
My current take is that as long as there is any “black-box” part of the algorithm which is optimizing for performance, then it may end up behaving like an optimizer_2, since the black box can pick up on arbitrary effective strategies.
(in partial RE to Rohin below): I wouldn’t necessarily say that such an algorithm knows about its environment (i.e. has a good model), it may simply have stumbled upon an effective strategy for interacting with it (i.e. have a good policy).
I think the assumption is that a sufficiently capable optimizer_1 would need to be an optimizer_2.