Most People Start With The Same Few Bad Ideas

Epistemic status: lots of highly subjective and tentative personal impressions.

Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”. Often this involves an analogy to venture capital: alignment funding is hits-based (i.e. the best few people are much more productive than everyone else combined), funders aren’t actually that good at distinguishing the future hits, so what we want is a whole bunch of uncorrelated bets.

The main place where this fails, in practice, is the “uncorrelated” part. It turns out that most newcomers to alignment have the same few Bad Ideas.

The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.

People who are less aware of standard alignment arguments tend to start with “train an AI on human feedback” or “iterate until the problems go away”. In the old days, pre-sequences, people started from even worse ideas; at least the waterline has risen somewhat.

People with more of a theory bent or an old-school AI background tend to reinvent IRL or CIRL variants. (A CIRL variant was my own starting Bad Idea—you can read about it in this post from 2020, although the notes from which that post was written were from about 2016-2017.)

My impression (based on very limited data) is that it takes most newcomers ~5 years to go from their initial Bad Idea to actually working on something plausibly useful. For lack of a better name, let’s call that process the Path of Alignment Maturity.

My impression is that progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans—e.g. the builder/​breaker framework from the Eliciting Latent Knowledge doc, or some version of the Alignment Game Tree exercise, or having a group of people who argue and poke holes in each others’ plans. (Of course these all first require not being too emotionally attached to your own plan; it helps a lot if you can come up with a second or third line of attack, thereby building confidence that there’s something else to move on to.) It can also be accelerated by starting with some background knowledge of difficult problems adjacent to alignment/​agency—I notice philosophers tend to make unusually fast progress down the Path that way, and I think prior experience with adjacent problems also cut about 3-4 years off the Path for me. (To be clear, I don’t necessarily recommend that as a strategy for a newcomer—I spent ~5 years working on agency-adjacent problems before working on alignment, and that only cut ~3-4 years off my Path of Alignment Maturity. That wasn’t the only alignment-related value I gained from my background knowledge, but the faster progress down the Path was not worthwhile on its own.) General background experience/​knowledge about the world also helps a lot—e.g. I expect someone who’s founded and worked at a few startups will make faster progress than someone who’s only worked at one big company, and either of those will make faster progress than someone who’s never been outside of academia.

On the flip side, I expect that progress down the Path of Alignment Maturity is slower for people who spend their time heads-down in the technical details of a particular approach, and spend less time reflecting on whether it’s the right approach at all or arguing with people who have very different models. I’d guess that this is especially a problem for people at orgs with alignment work focused on specific agendas—e.g. I’d guess progress down the Path is slower at Redwood or OpenAI, but faster at Conjecture or Deepmind (because those orgs have a relatively high variety of alignment models internally, as I understand it).

I think accelerating newcomers’ progress down the Path of Alignment Maturity is one of the most tractable places where community builders and training programs can add a lot of value. I’ve been training about a dozen people through the MATS program this summer, and I currently think accelerating participants’ progress down the Path has been the biggest success. We had a lot of content aimed at that: the Alignment Game Tree, two days of the “train a shoulder John” exercise plus a third day of the same exercise with Eliezer, the less formal process of people organized into teams kicking ideas around and arguing with each other, and of course general encouragement to pivot to new problems and strategies (which most people did multiple times). Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/​Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.