The Plan − 2022 Update

johnswentworth1 Dec 2022 20:43 UTC

LW: 239 AF: 73

Interpretability (ML & AI)AI Natural Abstraction Confirmation Bias

So, how’s The Plan going?

Pretty well!

In last year’s writeup of The Plan, I gave “better than a ⁵⁰⁄₅₀ chance” that it would work before AGI kills us all (and my median AI timelines were around 10-15 years). That was an outside view, accounting for planning fallacy and the inevitable negative surprises. My inside view was faster—just based on extrapolating my gut feel of the rate of progress, I privately estimated that The Plan would take around 8 years. (Of those 8, I expected about 3 would be needed to nail down the core conceptual pieces of agent foundations, and the other 5 would be to cross the theory-practice gap. Of course those would be intermingled, though with the theory part probably somewhat more front-loaded.)

Over the past year, my current gut feel is that progress has been basically in line with the inside-view 8 year estimate (now down to 7, since a year has passed), and maybe even a little bit faster than that.

So, relative to my outside-view expectation that things always go worse than my gut expects, things are actually going somewhat better than expected! I’m overall somewhat more optimistic now, although the delta is pretty small. It’s only been a year, still lots of time for negative surprises to appear.

Any high-level changes to The Plan?

There have been two main high-level changes over the past year.

First: The Plan predicted that, sometime over the next 5 (now 4) years, the field of alignment would “go from a basically-preparadigmatic state, where we don’t even know what questions to ask or what tools to use to answer them, to a basically-paradigmatic state, where we have a general roadmap and toolset”. Over the past year, I tentatively think the general shape of that paradigm has become visible, as researchers converge from different directions towards a common set of subproblems.

Second: I’ve updated away from thinking about ambitious value learning as the primary alignment target. Ambitious value learning remains the main long-term target, but I’ve been convinced that e.g. corrigibility is worth paying attention to as a target for early superhuman AGI. Overall, I’ve updated from “just aim for ambitious value learning” to “empirically figure out what potential medium-term alignment targets (e.g. human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AGI’s internal concept-language”.

Convergence towards a paradigm sounds exciting! So what does it look like?

Exciting indeed! Gradual convergence toward a technical alignment paradigm has probably been the most important update from the past year.

On the theoretical side, Paul Christiano, Scott Garrabrant, and myself had all basically converged to working on roughly the same problem (abstraction, ontology identification, whatever you want to call it) by early 2022. That kind of convergence is a standard hallmark of a proto-paradigm.

Meanwhile, within the past year-and-a-half or so, interpretability work has really taken off; Chris Olah’s lab is no longer head-and-shoulders stronger than everyone else. And it looks to me like the interpretability crowd is also quickly converging on the same core problem of abstraction/ontology-identification/whatever-you-want-to-call-it, but from the empirical side rather than the theoretical side.

That convergence isn’t complete yet—I think a lot of the interpretability crowd hasn’t yet fully internalized the framing of “interpretability is primarily about mapping net-internal structures to corresponding high-level interpretable structures in the environment”. In particular I think a lot of interpretability researchers have not yet internalized that mathematically understanding what kinds of high-level interpretable structures appear in the environment is a core part of the problem of interpretability. You have to interpret the stuff-in-the-net as something, and it’s approximately-useless if the thing-you-interpret-stuff-in-the-net-as is e.g. a natural-language string without any legible mathematical structure attached, or an ad-hoc mathematical structure which doesn’t particularly cut reality at the joints. But interpretability researchers have a very strong feedback loop in place, so I expect they’ll iterate toward absorbing that frame relatively quickly. (Though of course there will inevitably be debate about the frame along the way; I wouldn’t be surprised if it’s a hot topic over the next 1-2 years. And also in the comment section of this post.)

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.

As that shift occurs, I expect we’ll also see more discussion of end-to-end alignment strategies based on directly reading and writing the internal language of neural nets. (Retargeting The Search is one example, though it makes some relatively strong assumptions which could probably be relaxed quite a bit.) Since such strategies very directly handle/sidestep the issues of inner alignment, and mostly do not rely on a reward signal as the main mechanism to incentivize intended behavior/internal structure, I expect we’ll see a shift of focus away from convoluted training schemes in alignment proposals. On the flip side, I expect we’ll see more discussion about which potential alignment targets (like human values, corrigibility, Do What I Mean, etc) are likely to be naturally expressible in the internal language of neural nets, and how to express them.

Assuming this paradigm formation extrapolation is roughly correct, it’s great news! This sort of paradigm formation is exactly why The Plan was so optimistic about being able to solve alignment in the next 10-15 (well, now 9-14) years. And, if anything, it currently looks like the paradigm is coming together somewhat faster than expected.

Why the update about corrigibility?

Let’s start with why I mostly ignored corrigibility before. Mainly, I wasn’t convinced that “corrigibility” was even a coherent concept. Lists of desiderata for corrigibility sounded more like a grab-bag of tricks than like a set of criteria all coherently pointing at the same underlying concept. And MIRI’s attempts to formalize corrigibility had found that it was incompatible with expected utility maximization. That sounds to me like corrigibility not really being “a thing”.

Conversely, I expect that some of the major benefits which people want from corrigibility would naturally come from value learning. Insofar as humans want their AGI to empower humans to solve their own problems, or try to help humans do what the humans think is best even if it seems foolish to the AGI, or… , a value-aligned AI will do those things. In other words: value learning will produce some amount of corrigibility, because humans want their AGI to be corrigible. Therefore presumably there’s a basin of attraction in which we get values “right enough” along the corrigibility-relevant axes.

The most interesting update for me was when Eliezer reframed the values-include-some-corrigibility argument from the opposite direction (in an in-person discussion): insofar as humans value corrigibility (or particular aspects of corrigibility), the same challenges of expressing corrigibility mathematically also need to be solved in order to target values. In other words, the key mathematical challenges of corrigibility are themselves robust subproblems of alignment, which need to be solved even for value learning. (Note: this is my takeaway from that discussion, not necessarily the point Eliezer intended.)

That argument convinced me to think some more about MIRI’s old corrigibility results. And… they’re not very impressive? Like, people tried a few hacks, and the hacks didn’t work. Fully Updated Deference is the only real barrier they found, and I don’t think it’s that much of a barrier—it mostly just shows that something is wrong with the assumed type-signature of the child agent, which isn’t exactly shocking.

(Side note: fully updated deference doesn’t seem like that much of a barrier in the grand scheme of things, but it is still a barrier which will probably block whatever your first idea is for achieving corrigibility. There are probably ways around it, but you need to actually find and use those ways around.)

While digging around old writing on the topic, I also found an argument from Eliezer that “corrigibility” is a natural concept:

The “hard problem of corrigibility” is interesting because of the possibility that it has a relatively simple core or central principle—rather than being value-laden on the details of exactly what humans value, there may be some compact core of corrigibility that would be the same if aliens were trying to build a corrigible AI, or if an AI were trying to build another AI.
…
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.

Now that sounds like the sort of thing which is potentially useful! Shame that previous attempts to formulate corrigibility started with kinda-ad-hoc desiderata, rather than from an AI building a sub-AI while being prone to various sorts of errors. (Pro tip for theory work: when you’re formalizing a concept, and you have some intuitive argument for why it’s maybe a natural concept, start from that argument!)

So my overall takeaway here is:

There’s at least a plausible intuitive argument that corrigibility is A Thing.
Previous work on formalizing/operationalizing corrigibility was pretty weak.

So are you targeting corrigibility now?

No. I’ve been convinced that corrigibility is maybe A Thing; my previous reasons for mostly-ignoring it were wrong. I have not been convinced that it is A Thing; it could still turn out not to be.

But the generalizable takeaway is that there are potentially-useful alignment targets which might turn out to be natural concepts (of which corrigibility is one). Which of those targets actually turn out to be natural concepts is partially a mathematical question (i.e. if we can robustly formulate it mathematically then it’s definitely natural), and partially empirical (i.e. if it ends up being a natural concept in an AI’s internal ontology then that works too).

So my new main position is: which potential alignment targets (human values, corrigibility, Do What I Mean, human mimicry, etc) are naturally expressible in an AI’s internal language (which itself probably includes a lot of mathematics) is an empirical question, and that’s the main question which determines what we should target.

How has broadening alignment target changed your day-to-day research?

It hasn’t. The reason is explained in Plans Are Predictions, Not Optimization Targets. Briefly: the main thing I’m working on is becoming generally less confused about how agents work. While doing that, I mostly aim for robust bottlenecks—understanding abstraction, for instance, is robustly a bottleneck for many different approaches (which is why researchers converge on it from many different directions). Because it’s robust, it’s still likely to be a bottleneck even when the target shifts, and indeed that is what happened.

What high-level progress have you personally made in the past year? Any mistakes made or things to change going forward?

In my own work, theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)

As of The Plan, by six months ago I was hoping to have efficient algorithms for computing natural abstractions in simulated environments, and that basically didn’t happen. I did do a couple interesting experiments (which haven’t been written up):

Both Jeffery Andrade and myself tried to calculate natural abstractions in the Game of Life, which basically did not work.
I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.

… but mostly I ended up allocating time to other things. The outputs of those experiments were what I need for now; I’m back to being bottlenecked on theory. (Which is normal—running a computational experiment and exploring the results in detail takes a few days or maybe a couple weeks at most, which is far faster than an iteration cycle on theory development, so of course I spend most of my time bottlenecked on theory.)

On the theory side, progress has zoomed along surprisingly quickly despite spending less time on it than I expected as of late last year. The Basic Foundations sequence is the main publicly-visible artifact of that progress so far; behind the scenes I’ve also continued to streamline the math of natural abstraction, and lately I’ve been working to better unify it with thermodynamic-style arguments and phase changes. (In particular, my current working hypothesis is that grokking is literally a phase change in the thermodynamic sense, induced by coupling to the environment via SGD. On that hypothesis, understanding how such coupling-induced phase changes work is the main next step to mapping net-internal structures to natural abstractions in the environment. But that’s the sort of hypothesis which could easily go out the window in another few weeks.) The main high-level update from the theory work is that, while getting abstraction across the theory-practice gap continues to be difficult, basically everything else about agent foundations is indeed way easier once we have a decent working operationalization of abstraction.

So I’ve spent less time than previously expected on both theory, and on crossing the theory-practice gap. Where did all that time go?

First, conferences and workshops. I said “yes” to basically everything in the first half of 2022, and in hindsight that was a mistake. Now I’m saying “no” to most conferences/workshops by default.

Second, training people (mostly in the MATS program), and writing up what I’d consider relatively basic intro-level arguments about alignment strategies which didn’t have good canonical sources. In the coming year, I’m hoping to hand off most of the training work; at this point I think we have a scalable technical alignment research training program which at least picks the low-hanging fruit (relative to my current ability to train people). In particular, I continue to be optimistic that (my version of) the MATS program shaves at least 3 years off the time it takes participants to get past the same first few bad ideas which everyone has and on to doing potentially-useful work.

What’s the current status of your work on natural abstractions?

In need of a writeup. I did finally work out a satisfying proof of the maxent form for natural abstractions on Bayes nets, and it seems like every week or two I have an interesting new idea for a way to use it. Writing up the proofs as a paper is currently on my todo list; I’m hoping to nerd-snipe some researchers from the complex systems crowd.

Getting it across the theory-practice gap remains the next major high-level step. The immediate next step is to work out and implement the algorithms implied by the maxent form.

What links here?