Why I’m not working on {debate, RRM, ELK, natural abstractions}

[For background & spelling out the acronyms in the title, see: Debate (AI safety technique), Recursive Reward Modeling, Eliciting Latent Knowledge, Natural Abstractions.]

When I say “Why I’m not working on X”, I am NOT trying to find a polite & diplomatic way to say “Nobody should work on X because X is unhelpful for AGI safety”. Hmm, OK, well, maybe it’s just a little bit that. But really, I don’t feel strongly. Instead, I think:

  1. A lot of disagreement about what a solution to technical AGI safety looks like is really downstream of disagreements about questions like “How will AGI be built? What will it look like? How will it work?”

  2. Nobody really knows the answers to those questions.

  3. So we should probably be contingency-planning, by going through any possible answers to those questions that at least some reasonable person finds plausible, and doing AGI safety research conditional on those answers being correct.

  4. But still, I have my own opinions about the answers to those questions, and obviously I think my opinions are right, and I am not going to work on something unless it makes sense on my own models. And since people ask me from time to time, it seems worth explaining why the various research programs in the post title do not seem to be a good use of time, on my own models of how AGI will be developed and what AGI will look like.

I wrote this post quickly and did not run it by the people I’m (sorta) criticizing. Do not assume that I described anything fairly and correctly. Please leave comments, and I’ll endeavor to update this post or write a follow-up in the case of major errors /​ misunderstandings /​ mind-changes.

(By the way: If I’m not working on any of those research programs, then what am I working on? See here. I listed six other projects that seem particularly great to me here, and there are many others besides.)

1. Background

1.1 “Trying” to figure something out seems both necessary & dangerous

(Partly self-plagiarized from here.)

Let’s compare two things: “trying to get a good understanding of some domain by building up a vocabulary of new concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common!

  • In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there’s some analogy between this and that?”, or you could think the thought “that’s a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities.

  • In both cases, you’ll often get some early hints about whether moves were wise, but you won’t really know that you’re on the right track except in hindsight.

  • And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful.

Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space /​ priors /​ world-model) are the thing that gets rewarded.

Next, consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level:

  • Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transformations. For this to happen, we need “understand symplectic transformations” to be flagged as a temporary sub-goal, and to be pursued, and we want it to be able to spawn further sub-sub-goals and so on.

  • Consequentialist planning—Relatedly, we want the AI to be able to summon and re-read a textbook on linear algebra, or mentally work through an example problem, because it anticipates that these activities will lead to better understanding of the target domain.

  • Meta-cognition—We want the AI to be able to learn patterns in which of its own “thoughts” lead to better understanding and which don’t, and to apply that knowledge towards having more productive thoughts.

Putting all these things together, it seems to me that the default for this kind of AI would be to figure out that “seizing control of its off-switch” would be instrumentally useful for it to do what it’s trying to do (e.g. develop a better understanding of the target domain), and then to come up with a clever scheme to do so, and then to do it.

So “trying” to figure something out seems to me to be both necessary and dangerous.

(Really, there are two problems: (A) “trying to figure out X” spawns dangerous power-seeking instrumental subgoals by default; and (B) we don’t know how to make an AGI that is definitely “trying to figure out X” in the first place, as opposed to “trying to make paperclips” or whatever.)

1.2 The “follow-the-trying game”

Just like Eliezer’s “follow-the-improbability game”, I often find myself playing the “follow-the-trying game” when evaluating AGI safety proposals.

As above, I don’t think an AI can develop new useful concepts or come up with new plans (at least, not very well) without “trying to figure out [something]”, and I think that “trying” inevitably comes along with x-risk. Thus, for example:

  • I often see proposals like: “The AI comes up with a plan, and the human evaluates the plan, and the human implements the plan if it seems good”. The proposers want to focus the narrative on the plan-evaluation step, with the suggestion that if humans are capable of evaluating the plan, then all is well, and if not, maybe the humans can have AI assistance, etc. But to me, the more dangerous part is the step where the AI is coming up with the plan—that’s where the “trying” would be! And we should be thinking about things like “when the AI is supposedly ‘trying’ to come up with a plan, what if it’s actually ‘trying’ to hack its way out of the box?”, or “what if the AI is actually ‘trying’ to find a plan which will trick the humans?”, or (as an extreme version of that) “what if the AI outputs a (so-called) plan that’s just a text file saying “Help I’m trapped in a box…”?”.

  • Likewise, my criticism of Vanessa Kosoy’s research agenda significantly centered around my impression that she tends to presuppose that the model already has a superhumanly-capable world-model, and the safety risks come from having it choose outputs based on that knowledge. But I want to talk about the safety risks that happen during the process of building up that superhuman understanding in the first place. Again, I claim that this process necessarily involves “trying to figure things out”, and wherever there’s “trying”, there’s x-risk.

  • In the case of “vanilla LLMs” (trained 100% by self-supervised learning): I’m oversimplifying here, but basically I think the “trying” was performed by humans, in the training data. This is a good thing insofar as it makes vanilla LLMs safer, but it’s a bad thing insofar as it makes me expect that vanilla LLMs won’t scale to AGI, and thus that sooner or later people will either depart from the vanilla-LLM paradigm (in a way that makes it far more dangerous), or else make AGI in a different (and far more dangerous) way.

1.3 Why I want to move the goalposts on “AGI”

Two different perspectives are:

  • AGI is about knowing how to do lots of things

  • AGI is about not knowing how to do something, and then being able to figure it out.

I’m strongly in the second camp. That’s why I’ve previously commented that the Metaculus criterion for so-called “Human/​Machine Intelligence Parity” is no such thing. It’s based on grad-school-level technical exam questions, and exam questions are inherently heavily weighted towards already knowing things rather than towards not knowing something but then figuring it out. Or, rather, if you’re going to get an “A+” on an exam, there’s a spectrum of ways to do so, where one end of the spectrum has relatively little “already knowing” and a whole lot of “figuring things out”, and the opposite end of the spectrum has a whole lot of “already knowing” and relatively little “figuring things out”. I’m much more interested in the “figuring things out” part, so I’m not too interested in protocols where that part of the story is to some extent optional.

(Instead, I’ve more recently started talking about “AGI that can develop innovative science at a John von Neumann level”, and things like that. Seems harder to game by “brute-force massive amounts of preexisting knowledge (both object-level and procedural)”.)

(Some people will probably object here, on the theory that “figuring things out” is not fundamentally different from “already knowing”, but rather is a special case of “already knowing”, wherein the “knowledge” is related to meta-learning, plus better generalizations that stem from diverse real-world training data, etc. My response is: that’s a reasonable hypothesis to entertain, and it is undoubtedly true to some extent, but I still think it’s mostly wrong, and I stand by what I wrote. However, I’m not going to try to convince you of that, because my opinion is coming from “inside view” considerations that I don’t want to get into here.)

OK, that was background, now let’s jump into the main part of the post.

2. Why I’m not working on debate or recursive reward modeling

Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment questions: if there’s an AGI “trying” to do something, how do we know that it’s not also “trying” to hack its way out of the box, seize power, and so on? And if we can control the AGI’s motivations well enough to answer those questions, why not throw out the whole “debate” idea and use those same techniques (whatever they are) to simply make an AGI that is “trying” to figure out the correct answer and tell it to us?

(One possible answer is that there are two AGIs working at cross-purposes, and they will prevent each other from breaking out of the box. But the two AGIs are only actually working at cross-purposes if we solve the alignment problem!! And even if we somehow knew that each AGI was definitely motivated to stop the other AGI from escaping the box, who’s to say that one couldn’t spontaneously come up with a new good idea for escaping the box that the other didn’t think of defending against? Even if they start from the same base model, they’re not thinking the same thoughts all the time, I presume.)

As for recursive reward modeling, I already covered it in Section 1.2 above.

3. Why I’m not working on ELK

Let’s play the “follow-the-trying game” again, this time on ELK. As I open the original ELK document, I immediately find an AI that already has a superhuman understanding of what’s going on.

So if I’m right that superhuman understanding requires an AI that was “trying” to figure things out, and that this “trying” is where [part of] the danger is, then the dangerous part [that I’m interested in] is over before the ELK document has even gotten started.

In other words:

  • There’s an open question of how to make a model that is “trying” to figure out what the sensor will say under different conditions, and doing so without also “trying” to escape the box and seize power etc. This safety-critical problem is outside the scope of ELK.

  • …And if we solve that problem, then maybe we could use those same techniques (whatever they are) to just directly make a model that is “trying” to be helpful. And then ELK would be unnecessary.

So I don’t find myself motivated to think about ELK directly.

(Update: See discussion with Paul in the comments.)

4. Why I’m not working on John Wentworth’s “natural abstractions” stuff

4.1 The parts of the plan that John is thinking hard about, seem less pressing to me

I think John is mostly focused on building a mapping between “things we care about” (e.g. corrigibility, human flourishing) and “the internal language of neural nets”. I mostly see that as some combination of “plausibly kinda straightforward” and “will happen by default in the course of capabilities research”.

For example, if I want to know which neurons in the AGI are related to apples, I can catch it looking at apples (maybe show it some YouTube videos), see which neural net neurons light up when it does, flag those neurons, and I’m done. That’s not a great solution, but it’s a start—more nuanced discussion here.

As another example, in “Just Retarget the Search”, John talks about the mesa-optimizer scenario where a search algorithm emerges organically inside a giant deep neural net, and we have to find it. But I’m expecting that AGI will look like model-based RL, in which case, we don’t have to search for search, the search is right there in the human source code. The analog of “Just Retarget the Search” would instead look like: Go looking for things that we care about in the world-model, and manually set their “value” (in RL jargon) /​ “valence” (in psych jargon) to very positive, or very negative, or neutral, depending on what we’re trying to do. Incidentally, I see that as an excellent idea, and it’s the kind of thing I discuss here.

4.2 The parts of the plan that seem very difficult to me, John doesn’t seem to be working on

So my impression is that the things that John is working on are things that I’m kinda not too worried about. And conversely, I’m worried about different things that John does not seem too focused on. Specifically:

  • Dealing with “concept extrapolation”—Let’s say an AGI has an idea of how to invent “mind-meld technology” (whatever that is), and is deciding whether doing so is a good idea or not. Or maybe the AGI figures out that someone else might invent “mind-meld technology”, and needs to decide what if anything to do about that. Either way, the AGI had a suite of abstractions that worked well in the world of its training, but it’s now forced to have preferences about a different world, a world where many of its existing concepts /​ abstractions related to humanity & personhood etc. are broken. (“Mind-meld technology” is an extreme example, for clarity, but I think micro-versions of this same dynamic are inevitable and ubiquitous.) There isn’t any good way (that I know of) for either extrapolating its existing preferences into this new [hypothetical future] world containing new natural abstractions, or for querying a human for clarification. Again see discussion here, and a bit more in my back-and-forth with John here. (One possible solution is to load up the AGI with the full suite of human social and moral instincts, such that it will tend to do concept-extrapolation in a human-like way for human-like reasons—see here. As it turns out, I am very interested in that, but it looks pretty different from what John is doing.)

    • Relatedly, I suspect that something like “corrigibility” would actually probably be a huge number of different concepts /​ abstractions that are strongly statistically correlated in the real world. But they could come apart out of distribution, so we need to decide which one(s) we really care about. Related SSC post.

  • Interpretability specifically around self-concept and meta-preferences, a.k.a. “the first-person problem” seems especially hard and especially important—see discussion at that link.

  • Figuring out what exactly value /​ valence to paint onto exactly what concepts (my impression is that John wants to put off that question until later).