Most People Start With The Same Few Bad Ideas

johnswentworth9 Sep 2022 0:29 UTC

LW: 173 AF: 45

Epistemic status: lots of highly subjective and tentative personal impressions.

Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”. Often this involves an analogy to venture capital: alignment funding is hits-based (i.e. the best few people are much more productive than everyone else combined), funders aren’t actually that good at distinguishing the future hits, so what we want is a whole bunch of uncorrelated bets.

The main place where this fails, in practice, is the “uncorrelated” part. It turns out that most newcomers to alignment have the same few Bad Ideas.

The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.

People who are less aware of standard alignment arguments tend to start with “train an AI on human feedback” or “iterate until the problems go away”. In the old days, pre-sequences, people started from even worse ideas; at least the waterline has risen somewhat.

People with more of a theory bent or an old-school AI background tend to reinvent IRL or CIRL variants. (A CIRL variant was my own starting Bad Idea—you can read about it in this post from 2020, although the notes from which that post was written were from about 2016-2017.)

My impression (based on very limited data) is that it takes most newcomers ~5 years to go from their initial Bad Idea to actually working on something plausibly useful. For lack of a better name, let’s call that process the Path of Alignment Maturity.

My impression is that progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans—e.g. the builder/breaker framework from the Eliciting Latent Knowledge doc, or some version of the Alignment Game Tree exercise, or having a group of people who argue and poke holes in each others’ plans. (Of course these all first require not being too emotionally attached to your own plan; it helps a lot if you can come up with a second or third line of attack, thereby building confidence that there’s something else to move on to.) It can also be accelerated by starting with some background knowledge of difficult problems adjacent to alignment/agency—I notice philosophers tend to make unusually fast progress down the Path that way, and I think prior experience with adjacent problems also cut about 3-4 years off the Path for me. (To be clear, I don’t necessarily recommend that as a strategy for a newcomer—I spent ~5 years working on agency-adjacent problems before working on alignment, and that only cut ~3-4 years off my Path of Alignment Maturity. That wasn’t the only alignment-related value I gained from my background knowledge, but the faster progress down the Path was not worthwhile on its own.) General background experience/knowledge about the world also helps a lot—e.g. I expect someone who’s founded and worked at a few startups will make faster progress than someone who’s only worked at one big company, and either of those will make faster progress than someone who’s never been outside of academia.

On the flip side, I expect that progress down the Path of Alignment Maturity is slower for people who spend their time heads-down in the technical details of a particular approach, and spend less time reflecting on whether it’s the right approach at all or arguing with people who have very different models. I’d guess that this is especially a problem for people at orgs with alignment work focused on specific agendas—e.g. I’d guess progress down the Path is slower at Redwood or OpenAI, but faster at Conjecture or Deepmind (because those orgs have a relatively high variety of alignment models internally, as I understand it).

I think accelerating newcomers’ progress down the Path of Alignment Maturity is one of the most tractable places where community builders and training programs can add a lot of value. I’ve been training about a dozen people through the MATS program this summer, and I currently think accelerating participants’ progress down the Path has been the biggest success. We had a lot of content aimed at that: the Alignment Game Tree, two days of the “train a shoulder John” exercise plus a third day of the same exercise with Eliezer, the less formal process of people organized into teams kicking ideas around and arguing with each other, and of course general encouragement to pivot to new problems and strategies (which most people did multiple times). Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.

What links here?

johnswentworth9 Sep 2022 0:29 UTC

LW: 173 AF: 45

31 comments3 min readLW link

AI Alignment Fieldbuilding AI

moridinamael 9 Sep 2022 16:56 UTC
84 points
52
I sometimes worry that ideas are prematurely rejected because they are not guaranteed to work, rather than because they are guaranteed not to work. In the end it might turn out that zero ideas are actually guaranteed to work and thus we are left with an assortment of not guaranteed to work ideas which are underdeveloped because some possible failure mode was found and thus the idea was abandoned early.
- johnswentworth 9 Sep 2022 17:19 UTC
  42 points
  25
  Parent
  That is a problem in principle, but I’d guess that the perception of that problem mostly comes from a couple other phenomena.
  First: I think a lot of people don’t realize on a gut level that a solution which isn’t robust is guaranteed to fail in practice. There are always unknown unknowns in a new domain; the presence of unknown unknowns may be the single highest-confidence claim we can make about AGI at this point. A strategy which fails the moment any surprise comes along is going to fail; robustness is necessary. Now, robustness is not the same as “guaranteed to work”, but the two are easy to confuse. A lot of arguments of the form “ah but your strategy fails in case X” look like they’re saying “the strategy is not guaranteed to work”, but the actually-important content is “the strategy is not robust to <broad class of failures>”; the key is to think about how broadly the example-failure generalizes. (I think a common mistake newcomers make is to argue “but that particular failure isn’t very likely”, without thinking about how the failure mode generalizes or what other lack-of-robustness it implies.)
  Second: I think it’s very common for people to say “we just don’t know whether X will work” when in fact we have enormous amounts of real-world evidence about close analogues of X. (This thread on the Sandwich Problem post is a central example.) People imagine that we need to run an Official Experiment in order to Know Things, and that’s just not how the world actually works. In general, we have an enormous amount of relevant prior information from the world. But often all that prior information is not as legible as an Official Experiment, so it’s harder to explain the argument. I think people confuse the lack of legibility with a lack of certainty.
  What links here?
  - Quintin Pope 10 Sep 2022 0:36 UTC
    11 points
    8
    Parent
    The issue is that it’s very difficult to reason correctly in the absence of an “Official Experiment”^[1]. I think the alignment community is too quick to dismiss potentially useful ideas, and that the reasons for those dismissals are often wrong. E.g., I still don’t think anyone’s given a clear, mechanistic reason for why rewarding an RL agent for making you smile is bound to fail (as opposed to being a terrible idea that probably fails).
    ^
    More precisely, it’s very difficult to reason correctly even with many “Official Experiments”, and nearly impossible to do so without any such experiments.
    What links here?
    sunwillrise's comment on Optimistic Assumptions, Longterm Planning, and “Cope” by Raemon (18 Jul 2024 1:25 UTC; 18 points)
    - johnswentworth 10 Sep 2022 15:19 UTC
      14 points
      4
      Parent
      It’s a preparadigmatic field. Nobody is going to prove beyond a shadow of a doubt that X fails, for exactly the same reasons that nobody is going to prove beyond a shadow of a doubt that X works. And that just doesn’t matter very much, for decision-making purposes. If something looks unlikely to work, then the EV-maximizing move is to dismiss it and move on. Maybe one or two people work on the thing-which-is-unlikely-to-work in order to decorrelate their bets with everyone else, but mostly people should ignore things which are unlikely to work, especially if there’s already one or two people looking closer at it.
evhub 10 Sep 2022 4:02 UTC
LW: 79 AF: 35
81
AF

The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.

I don’t think these are crazy or bad ideas at all—I’d be happy to steelman them with you at some point if you want. Certainly, we don’t know how to make any of them work right now, but I think they are all reasonable directions to go down if one wants to work on the various open problems related to them. The problem—and this is what I would say to somebody if they came to me with these ideas—is that they’re not so much “ideas for how to solve alignment” so much as “entire research topics unto themselves.”
Radford Neal 9 Sep 2022 2:35 UTC
33 points
13
If I understand your idea, you propose that new people will try to think of new ideas, and when they say “How about A?”, someone more “mature” says, “No, that won’t work because of X”, then they say “How about B?”, and get the response “No, that won’t work because of Y”, and so forth, until finally they say “How about Q?”, and Q is something no-one has thought of before, and so is worth investigating.
It could be that a new Q is what’s needed. But might it instead be that “won’t work because of Y” is flawed, and what is needed is someone who can see that flaw? It doesn’t seem like this proposal would encourage discovery of such a flaw, once the new person is accustomed to listening to the “mature” person’s dismissal of “non-working” ideas.
This seems like it might be a situation where personal interaction is counterproductive. Of course the new person should learn something about past work. But it’s easier to question that past work, and persist in trying to think of how to make B work, when the dismissals of B as not workable are in papers one is reading, rather than in personal conversation with a mentor.
- johnswentworth 9 Sep 2022 2:40 UTC
  40 points
  28
  Parent
  If I understand your idea, you propose that new people will try to think of new ideas, and when they say “How about A?”, someone more “mature” says, “No, that won’t work because of X”, then they say “How about B?”, and get the response “No, that won’t work because of Y”, and so forth, until finally they say “How about Q?”, and Q is something no-one has thought of before, and so is worth investigating.
  Nope, that is not what I propose. I actually give my mentees pretty minimal object-level feedback at all, mainly because I don’t want them in the habit of deferring to my judgement. When I did give them intensive feedback for a few days, it was explicitly for the purpose of “building a John model”, and I chose that framing specifically to try to keep the “John model” separate from peoples’ own models.
  I generally think it’s best to do these exercises with a peer group, not with someone whose judgement one will hesitate to question. (Although of course reading stuff by more experienced people—like e.g. List of Lethalities—or occasionally getting feedback from more experienced people is a useful sub-step along the way.) The target outcome is not that people will ask more experienced people to find holes in their plans, but rather that people will look for the holes in their own plans, and iterate independently.
  What links here?
  - johnswentworth's comment on Most People Start With The Same Few Bad Ideas by johnswentworth (10 Sep 2022 21:55 UTC; 5 points)
Jan_Kulveit 12 Sep 2022 20:36 UTC
LW: 19 AF: 5
12
AF
It is not clear to me to what extent this was part of the “training shoulder advisors” exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people “training their shoulder Eliezers” and the shoulder advisors pushing them to think in a crude version of Eliezer’s ontology.
- johnswentworth 12 Sep 2022 23:35 UTC
  LW: 10 AF: 5
  2
  AF Parent
  I chose the “train a shoulder advisor” framing specifically to keep my/Eliezer’s models separate from the participants’ own models. And I do think this worked pretty well—I’ve had multiple conversations with a participant where they say something, I disagree with it, and then they say “yup, that’s what my John model said”—implying that they did in fact disagree with their John model. (That’s not quite direct evidence of maintaining a separate ontology, but it’s adjacent.)
Zach Stein-Perlman 9 Sep 2022 3:12 UTC
14 points
−4
Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.
This seems like super amazing news! If this is true, your potential work improving and scaling this stuff seems clearly much higher EV than your research (in the next year, say) (on average time-invested, and also on the margin unless you’re working on this way more than I thought). Do you agree; what are you planning?
Edit: relatedly, I think this is the highlight of the post and the title misses the point.
- johnswentworth 9 Sep 2022 4:28 UTC
  12 points
  4
  Parent
  I still put higher EV on my technical research, because this isn’t the only barrier to scaling research. Indeed, I expect technical work itself will be a necessary precondition to scaling research in the not-very-distant future; scaling research ultimately requires a paradigm, and paradigm discovery is a technical problem.
  But I do think this is high-value, plans are under way to scale up for the next round of MATS, and I’m also hoping to figure out how to offload most of the work to other people.
  Edit: relatedly, I think this is the highlight of the post and the title misses the point.
  Yeah, a lot of my posts over the past month or two have been of frankly mediocre quality by my usual standards. I previously had a policy of mostly not talking directly about alignment strategy, critiques of other peoples’ research, and other alignment meta stuff. For various reasons I’ve lately been writing down a bunch of it very quickly, which does trade off to some extent with quality. Hopefully I’ll run out of such material soon and go back to writing better posts about more interesting things.
Yitz 9 Sep 2022 19:58 UTC
10 points
4
I noticed that you began this post by saying

Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”.

But then the rest of the post did not directly address this question. You point out that most of “everyone who wants to try it” will start out with some (probably) flawed ideas, but that doesn’t seem like a good argument for not funding them. After all, you yourself experienced the growth you want others to go through while being funded yourself. I’d expect someone who is able to spend more of their time doing research (due to being able to focus on it full-time) will likely reach intellectual maturity on the topic faster than someone who has to focus on making a living in another area as well.
- johnswentworth 9 Sep 2022 21:02 UTC
  12 points
  9
  Parent
  Mostly what I’m arguing for here is a whole different model, where newcomers are funded with a goal of getting them through the Path (probably with resources designed for that purpose), rather than relying on Alignment Maturity coming about accidentally as a side-effect of research.
  (Also, minor point, I think I was most of the way through the Path by the time I got my first grant, so I actually did go through that growth before I had funding. But I don’t think that’s particularly relevant here.)
  - ChristianKl 9 Sep 2022 22:22 UTC
    6 points
    0
    Parent
    At the moment there’s a plan to create The Berlin Hub as a coliving space for new AI safety researchers. What lessons do you think should be drawn from the thesis you laid out for that project? Do you believe that the peer review that happens through that environment will push people on the [ath forward or would you fear that a lot of people at the Hub would do work that doesn’t matter?
    - lemonhope 9 Sep 2022 23:30 UTC
      5 points
      0
      Parent
      This is extremely difficult. Some good literature on cooperative living worth reading because there are countless common pitfalls. Also being a research org at the same time is quite ambitious. Good luck!
      - the gears to ascension 9 Sep 2022 23:33 UTC
        2 points
        0
        Parent
        do you happen to have additional references besides those words to find literature on cooperative living?
        lemonhope 10 Sep 2022 6:15 UTC
        4 points
        0
        Parent
        Had some books at previous coop — might have been these.
        
        https://www.ic.org/community-bookstore/product/wisdom-of-communities-complete-set/
        
        Many practicalities with admitting good members, dealing with problematic members, keeping the kitchen sink clean, keeping the floors clean, keeping track of rent, doing repairs, etc. Some of this is alleviated if you have a big budget. Culture is extremely tricky. It is extremely rewarding when it works.
        
        Visiting a coop for even a week reveals quite a bit about how it works — if you haven’t done that already
    - johnswentworth 9 Sep 2022 23:01 UTC
      4 points
      0
      Parent
      The main immediate advice I’d give is to look at people switching projects/problems/ideas as a key metric. Obviously that’s not a super-robust proxy and will break down if people start optimizing for it directly. But insofar as changes in which projects people work on are driven by updates to their underlying models, it’s a pretty good metric of progress down the Path.
      At this point, I still have a lot of uncertainty about things which will work well or not work well for accelerating people down the Path; it looks tractable, but that doesn’t mean that it’s clear yet what the best methods are. Trying things and seeing what causes people to update a lot seems like a generally good approach.
- Davidmanheim 10 Sep 2022 19:28 UTC
  5 points
  −16
  Parent
  To clarify, basically anyone who actually wants to try to work on alignment full time who is at all promising and willing to learn is already getting funded to upskill and investigate for at least a few months. The question here is “why not fund them to do X, if they suggest it,” and my answer is that if they only thing they are interested in is X, and X is one of the things John listed above, they aren’t going to get funded unless they have a really great argument. And most don’t, and either they take feedback and come up with another idea. I suggest they upskill and learn more, or they decide to do something else.
  - ChristianKl 11 Sep 2022 8:23 UTC
    10 points
    8
    Parent
    How do you define whether or not someone is “at all promising”?
Nicholas Kross 9 Sep 2022 17:32 UTC
9 points
7
Upvoted for changing my mind on how exactly to do this, and discussing the promising solutions!

(Also, somebody doing “shoulder John” with me at EAG, helped turn me away from my bad ideas and work on learning the key problems in more depth. That plus a random meeting with Yudkowsky. TLDR: my research ideas were actually research topics, instead of ideas.)
foodforthought 7 Sep 2025 18:50 UTC
4 points
0
More generally, in the very early phase of discovery in any field, when nobody has any idea what is going to turn out to be true or fruitful, insiders of a field tend to get bogged down because they have an overly correlated set of ideas. Not only the main ideas (like hypotheses, in science) but also concepts, tools, approaches, analogies, aesthetic preferences, and background knowledge. So a vast effort gets allocated to a tiny corner of the potential search space. This is why cross disciplinary transplants can make outsized contributions.
This is in tension with expertise. Amateurs and beginners by definition lack a lot of existing knowledge. This is a blessing, because they don’t “know” things all the experts “know” incorrectly, and a curse because they also don’t “know” things all the experts know correctly. In a decently rigorous and otherwise productive field, most of what experts know is in the latter category^[1].
This leads me to a speculation: the optimal way to tap the potential contributions of field-outsiders is to pair them up with experts, or integrate them into teams of experts. That is a big investment and commitment on both parts, so a prior vetting / recruitment step is needed.
As a first step you can invite such people to participate in high level, big picture conversations on a one time or short term basis. The interaction group has to be big enough that it can accommodate a couple of wild cards, but small enough that it won’t be a huge drag on the experts to have to constantly explain basic things. That said, asking experts to explain things in plain language which they all take for granted as obvious, is often the value added.
The outsider has to be the right sort, though. Smart enough to pick up new ideas quickly; confident enough to ask “dumb” questions or speak up in general; articulate enough to explain their ideas to people outside their own domain of expertise; enough social intelligence to notice when it’s a good time to pipe up vs be quiet, and capacity to self-regulate accordingly. And it takes a certain kind of creativity to be good at recognizing unseen connections or implications.
It isn’t necessarily obvious which other disciplines have the sauce that is missing. Physics and Philosophy are often good bets. But here’s a speculation: anyone who is a seasoned expert in any completely different but rigorous and successful domain is a good bet. (Young folks with a few years in another strong discipline also make great trainees. People who are long-established in a field that is mostly bankrupt are less likely to help than any random person on the street).
So if you are running a workshop or conference or symposium that is not too big, where many of the participants are high-level experts within Alignment, and most others are coming up from within the field, consider allocating a significant budget (in terms of limited attendee slots) to inviting relatively senior people from other disciplines. Worst case, they are lost or bored or contribute useless ideas, and one slot was wasted for a few days. But if they engage well, you have a lot of information about their potential to contribute as a member of a team (hire them, collaborate, invite them to more things), and they have a lot of information about how exciting and important that might be.
Here’s the rub: people who are that senior and that good are busy and get a lot if invitations, and are selective about which to accept. So you may need to make a strong pitch explaining why alignment is an important problem and why you think their particular expertise would be valuable. The low hanging fruit is therefore to invite ones who have already expressed an interest, however tentatively. Invite pretty much all of those.
Disclosure 1: these observations are supported by my reading of history of science, my own experience switching fields, and experience in a leadership role promoting interdisciplinary collaboration in another fledgling research area.
Disclosure 2: these comments are potentially self-serving, speaking as a relatively senior member of an outside discipline who is interested in engaging with the alignment community, but not finding opportunities to engage at this sort of level, despite the widely professed value placed on diverse perspectives.
1. ^
  By the same token, the outsider brings with them a lot of other background knowledge, in quantity proportional to their maturity in the previous field, which is correct in proportion of the rigor of the previous field, and non-overlapping in proportion to the distance of the previous field.
What links here?
- foodforthought's comment on Trying to understand my own cognitive edge by Wei Dai (9 Nov 2025 16:01 UTC; 1 point)
GeneSmith 10 Sep 2022 21:30 UTC
4 points
0
One possible failure mode I can imagine with this approach:

Suppose some important part of the conventional wisdom among you and/or your collaborators is wrong. It is likely that this flaw, if it exists, becomes more difficult for a researcher to discover if they have already heard a plausible-sounding explanation for the discrepancy. So by providing an “accelerated” path for new alignment researchers, you may reduce the likelihood that the error will be discovered.

This risk may be justifiable if enough smart researchers continue to work towards an understanding of alignment without taking this “accelerated track”. But my own observations in the context of high-prestige internships suggest that programs like this will be seen as a “fast track” to success in the field, and the most talented students will compete for entry.
- johnswentworth 10 Sep 2022 21:55 UTC
  5 points
  0
  Parent
  It sounds like you are picturing an implementation which is not actually what I’d recommend. I believe this comment already responds to basically the same concern, but let me know if you’re saying something not covered by that.
  - GeneSmith 11 Sep 2022 1:08 UTC
    4 points
    0
    Parent
    Nope, that pretty much covers it
Chris_Leong 10 Sep 2022 9:29 UTC
LW: 4 AF: 3
0
AF
I would love to see you say why you consider these bad ideas. Obvious such AI’s could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?
- Raemon 10 Sep 2022 10:08 UTC
  LW: 8 AF: 5
  2
  AF Parent
  John’s Why Not Just… sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)
lemonhope 9 Sep 2022 23:26 UTC
LW: 4 AF: 3
2
AF
This is true in every field and is very difficult to systemize apparently. Perhaps a highly unstable social state to have people changing directions or thinking/speaking super honestly often.

How could one succeed where so few have?
rotatingpaguro 10 Sep 2022 12:33 UTC
3 points
0
About 5 years to learn the mindset, frequent discussion with peers in a shared culture, less frequent interaction and feedback from experienced mentors. May as well talk about “building an AI alignment department.”
- Davidmanheim 10 Sep 2022 19:30 UTC
  3 points
  0
  Parent
  They are, at at least a dozen universities and research centers.
M. Y. Zuo 9 Sep 2022 20:10 UTC
3 points
0
There does seem to be a sizeable amount of broken-record type repetition on this topic. Though the phenomena occurs in all advanced fields with public attention.