Subtle point: the key question is not how certain we are, but how certain the predictor system (e.g. GPT) is. Presumably if it’s able to generalize that far out of distribution at all, it’s likely to have enough understanding to make a pretty high-confidence guess as to whether AGI will take over or not. We humans might not know the answer very confidently, but an AI capable enough to apply the human mimicry strategy usefully is more likely to know the answer very confidently, whatever that answer is.
I would love to see more people write this sort of thing, seems very high-value for newcomers to orient, and for existing researchers to see how people understand/misunderstand their arguments, and for the writers to accelerate the process of orienting, and generally for people to understand the generators behind each others’ work and communicate better.
The best way to solve this problem is to specify a utility function that, for the most part, avoids instrumentally convergent goals (power seeking, preventing being turned off).
I’m not sure that a good formulation of corrigibility would look like a utility function at all.
Same with the “crux” later on:
The best way to make progress on alignment is to write down a utility function for an AI that:GeneralizesIs robust to large optimization pressureSpecifies precisely what we want
The best way to make progress on alignment is to write down a utility function for an AI that:
Is robust to large optimization pressure
Specifies precisely what we want
Outer alignment need not necessarily look like a utility function. (There are good arguments that it will behave like a utility function, but that should probably be a derived fact rather than an assumed fact, at the very least.) And even if it is, there’s a classic failure mode in which someone says “well, it should maximize utility, so we want a function of the form u(X)...” and they don’t notice that most of the interesting work is in figuring out what the utility function is over (i.e. what “X” is), not the actual utility function.
Also, while we’re talking about a “crux”...
In each section, we’ve laid out some cruxes, which are statements that support that frame on the core of the alignment problem. These cruxes are not necessary or sufficient conditions for a problem to be central.
Terminological nitpick: the term “crux” was introduced to mean something such that, if you changed your mind about that thing, it would also probably change your mind about whatever thing we’re talking about (in this case, centrality of a problem). A crux is not just a statement which supports a frame.
We can get a soft-optimization proposal that works to solve this problem (instead of having the AGI hard-optimize something safe).
Not sure if this is already something you know, but “soft-optimization” is a thing we know how to do. The catch, of course, is that mild optimization can only do things which are “not very hard” in the sense that they don’t require very much optimization pressure.
This problem being tractable relies on some form of the Natural Abstractions Hypothesis.There is, ultimately, going to end up being a thing like “Human Values,” that can be pointed to and holds up under strong optimization pressure.
This problem being tractable relies on some form of the Natural Abstractions Hypothesis.
There is, ultimately, going to end up being a thing like “Human Values,” that can be pointed to and holds up under strong optimization pressure.
Note that whether human values or corrigibility or “what I mean” or some other direct alignment target is a natural abstraction is not strictly part of the pointers problem. Pointers problem is about pointing to latent concepts in general; whether a given system has an internal latent variable corresponding to “human values” specifically is a separate question.
Also, while tractability of the pointers problem does depend heavily on NAH, note that it’s still a problem (and probably an even more core and urgent one!) even if NAH turns out to be relatively weak.
Overall, the problem-summaries were quite good IMO.
Name suggestion: “The Craig Venter Principle”. Back in ’98, the Human Genome Project was scheduled to finish sequencing the first full human genome in another 5 years (having started in 1990). Venter started a company to do it in two years with more modern tech (specifically shotgun sequencing). That basically forced the HGP to also switch to shotgun sequencing in order to avoid public embarrassment, and the two projects finished a first draft sequence at basically the same time.
Lazy data structures.
This post is super useful, thankyou for writing it!
One thing I’m unclear on: does the transferrable utility assumption require that utility be linear with respect to money (or money-analogue)? It seems like we need that in order to split the game into a sum of two games in the CoCo derivation, but it makes the argument a lot weaker once we introduce uncertainty.
I really like the question this post asks. The post itself is a pretty scattered, but that’s not necessarily a bad thing given that it’s obviously a dump of in-progress research.
One place where I might direct effort differently: it seems like the “ideas that differ between people” in Appendix A offer more total bits-of-evidence than all the ML experiments combined, despite (I would guess) the ML experiments taking far more effort. This is a common pattern: the existing world offers an enormous number of bits-of-evidence at much lower cost than experiments. Often the experiments are necessary, in order to check how things behave outside the distribution offered by the real world, or in order to check something difficult to directly observe in the real world. But before investing lots of effort in an experiment, it’s worth stopping to ask whether you can get a lot more bits for your buck by looking at existing real-world instances. That’s especially true in the very early stages, when we’re not really sure what outside-of-real-world-distribution experiments will actually tell us anything useful.
You’re right, that was missing. Very good and important point.
In the case of human planning, I know that there are lots of things which will cause other humans to “turn me off”, like e.g. going on a murder spree. So I mostly use search methods such that those things aren’t in my search space in the first place.
An AI using search methods such that things-humans-find-obviously-bad-and-will-punish just aren’t in the search space probably looks, at first glance, like an AI actually working as intended (even given interpretability tools). The problem is that there’s also a bunch of stuff humans would consider bad but either wouldn’t notice or wouldn’t punish (most likely because they wouldn’t easily notice/understand why it’s bad, at least until much later). And the AI has no particular reason to leave that stuff out of its search space, nor any particular reason to deceive humans about it; from the AI’s perspective, that stuff is strategically isomorphic to stuff humans don’t care about at all.
So, part of the reason we expect a retargetable search process in the first place is that it’s useful for the AI to recursively call it with new subproblems on the fly; recursive search on subproblems is a useful search technique. What we actually want to retarget is not every instance of the search process, but just the “outermost call”; we still want it to be able to make recursive calls to the search process while solving our chosen problem.
Ok, let’s try to disentangle a bit. There are roughly three separate failure modes involved here:
Nobody asks things like “If we take the action you just proposed, will we be happy with the outcome?” in the first place (mainly because organizations of >10 people are dysfunctional by default).
The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
(Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize “what questions should I ask?” such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.
Zooming in on the last bullet in more detail (because that’s the one closest to the OP): one of Buck’s proposed questions upthread was “If we take the action you just proposed, will we be happy with the outcome?”. That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.
(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don’t really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)
Now, it’s totally reasonable to say “but that’s just one random question Buck made up on the spot, obviously in practice we’ll put a lot more effort into it”. The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don’t think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don’t think of? When there’s an intelligence differential even just as large as an IQ −2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can’t get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it’s answering that question.
I’m just very skeptical that the kinds of neural nets we’re training right now can learn to do anything remotely like that— I think it’s much more likely that people will hard code this type of reasoning into the compute graph with stuff like MCTS. This has already been pretty useful for e.g. MuZero. Once we’re hard coding search it’s less scary because it’s more interpretable and we can see exactly where the mesaobjective is.
I hope that you’re right; that would make Retargeting The Search very easy, and basically eliminates the inner alignment problem. Assuming, of course, that we can somehow confidently rule out the rest of the net doing any search in more subtle ways.
Indeed, ELK is very much asking the right questions, and I do expect people would use it if a robust and reasonably-performant solution were found. (Alignment is in fact economically valuable; it would be worth a lot.)
I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it.
That assumes humans are, in fact, likely to meaningfully resist a takeover attempt. My guess is that humans are not likely to meaningfully resist a takeover attempt, and the AI will (implicitly) know that.
I mean, if the AI tries to change who’s at the top of society’s status hierarchy (e.g. the President), then sure, the humans will freak out. But what does an AI care about the status hierarchy? It’s not like being at the top of the status hierarchy conveys much real power. It’s like your “total horse takeover” thing; what the AI actually wants is to be able to control outcomes at a relatively low level. Humans, by and large, don’t even bother to track all those low-level outcomes, they mostly pay attention to purely symbolic status stuff.
Now, it is still true that humans are a major influence on the world and source of resources. An AI will very plausibly want to work with the humans, use them in various ways. But that doesn’t need to parse to human social instincts as a “takeover”.
I’m imagining roughly (1), though with some caveats:
Of course it probably wouldn’t literally be A* search
Either the heuristic-generation is internal to the search subroutine, or it’s using a standard library of general-purpose heuristics for everything (or some combination of the two).
A lot of the subgoal formulation is itself internal to the search (i.e. recursively searching on subproblems is a standard search technique).
I do indeed expect that the major alignment work is in formulating problem specification, and possibly subgoals/heuristics (depending on how much of that is automagically handled by instrumental convergence/natural abstraction). That’s basically the conclusion of the OP: outer alignment is still hard, but we can totally eliminate the inner alignment problem by retargeting the search.
Separately: do you think we could easily “retarget the search” for an adult human, if we had mechanistic interpretability + edit access for the human’s brain? I’d expect “no”.
I expect basically “yes”, although the result would be something quite different from a human.
We can already give humans quite arbitrary tasks/jobs/objectives, and the humans will go figure out how to do it. I’m currently working on a post on this, and my opening example is Benito’s job; here are some things he’s had to do over the past couple years:
build a prototype of an office
resolve neighbor complaints at a party
find housing for 13 people with 2 days notice
figure out an invite list for 100+ people for an office
deal with people emailing a funder trying to get him defunded
set moderation policies for LessWrong
write public explanations of grantmaking decisions
organize weekly online zoom events
ship books internationally by Christmas
moderate online debates
do April Fools’ Jokes on Lesswrong
figure out which of 100s of applicants to do trial hires with
So there’s clearly a retargetable search subprocess in there, and we do in fact retarget it on different tasks all the time.
That said, in practice most humans seem to spend most of their time not really using the retargetable search process much; most people mostly just operate out of cache, and if pressed they’re unsure what to point the retargetable search process at. If we were to hardwire a human’s search process to a particular target, they’d single-mindedly pursue that one target (and subgoals thereof); that’s quite different from normal humans.
Do you mean “hardcoded reward circuit”
I’m not that committed to the RL frame, but roughly speaking yes. Whatever values we have are probably generated by ~tens of hardcoded things. Anyway, on to the meat of the discussion...
It seems like a whole bunch of people are completely thrown off by use of the word “search”. So let’s taboo that and talk about what’s actually relevant here.
We should expect compression, and we should expect general-purpose problem solving (i.e. the ability to take a fairly arbitrary problem in the training environment and solve it reasonably well). The general-purpose part comes from a combination of (a) variation in what the system needs to do to achieve good performance in training, and (b) the recursive nature of problem solving, i.e. solving one problem involves solving a wide variety of subproblems. Compactness means that it probably won’t be a whole boatload of case-specific heuristics; lookup tables are not compact. A subroutine for reasonably-general planning or problem-solving (i.e. take a problem statement, figure out a plan or solution) is the key thing we’re talking about here. Possibly a small number of such subroutines for a few different problem-classes, but not a large number of such subroutines, because compactness. My guess would be basically just one.
That probably will not look like babble and prune. It may look like a general-purpose heuristic-generator (like e.g. relaxation based heuristic generation). Or it may look like general-purpose efficiency tricks, like caching solutions to common subproblems. Or it may look like harcoded heuristics which are environment-specific but reasonably goal-agnostic (like e.g. the sort of thing in Mazes and Duality yields a maze-specific heuristic, but one which applies to a wide variety of path finding problems within that maze). Or it may look like harcoded strategies for achieving instrumentally convergent goals in the training environment (really this is another frame of caching solutions to common subproblems). Or it may look like learning instrumentally convergent concepts and heuristics from the training environment (i.e. natural abstractions; really this is another frame on environment-specific but goal-agnostic heuristics). Probably it’s a combination of all of those, and others too.
The important point is that it’s a problem-solving subroutine which is goal-agnostic (though possibly environment-specific). Pass in a goal, it figures out how to achieve that goal. And we do see this with humans: you can give humans pretty arbitrary goals, pretty arbitrary jobs to do, pretty arbitrary problems to solve, and they’ll go figure out how to do it.
My ontology indeed has search and a narrow notion of optimization as approximately synonyms; they differ only somewhat in type signature and are easily interchangeable. Conceptually, both take in an objective, and return something which scores highly on the objective. (This is narrower than e.g. Flint’s notion of “optimization”; in that ontology it might be called a “general-purpose optimizer” instead.)
Anyway, insofar as any of this is relevant to the arguments for mesa-optimization, it’s the notion of search/optimization as general problem solving which applies there.
It excludes methods specific to a small number of problems. Search is about general problem solving.
Anyway, IIUC this is how the term “search” has historically been used in AI, it is also the notion of “search” which is relevant to the arguments for mesaoptimization in Risks From Learned Optimization, it is also the notion of search which is relevant to general intelligence being A Thing, it is the notion of search which is relevant to the possibility of an ML system suddenly grokking “general search” and thereby undergoing a rapid increase in capabilities, etc.
Yeah, that’s definitely not what I mean by search (nor what I think others mean by search, in the context of AI and inner agents).
Roughly speaking, a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective. For instance, a gradient descent algorithm takes in an objective, and returns a point which scores well on the objective, for a very broad class of possible objectives; gradient descent is therefore a search method.
Enumerating possible actions and evaluating their consequences is one way to do general search, but it’s wildly inefficient; I would typically refer to that as “brute force search”. Gradient descent does better by leveraging backprop and gradients; approximately none of the algorithmic work done by gradient descent comes from direct evaluation of the consequences of actions. And there are many other tricks one can use too—like memoization on subsearches, or A*-style heuristic search, or (one meta-level up from A*) relaxation-based methods to discover heuristics. The key point is that these tricks are all very general purpose: they work on a very wide variety of search problems, and therefore produce general-purpose search algorithms which are more efficient than brute force (at least on realistic problems).
More advanced general-purpose search methods seem to rely relatively little on enumerating possible actions and evaluating their consequences. By the time we get to human-level search capabilities, we see human problem-solvers spend most of their effort on nontrivial problems thinking about subproblems, abstractions and analogies rather than thinking directly about particular solutions.