Looking back on my alignment PhD

This post has been recorded as part of the LessWrong Curated Podcast, and can be listened to on Spotify, Apple Podcasts, Libsyn, and more.

My dissertation. It’s long, so if you’re going to read anything from it, read Chapter 0 (Introduction).

The funny thing about long periods of time is that they do, eventually, come to an end. I’m proud of what I accomplished during my PhD. That said, I’m going to first focus on mistakes I’ve made over the past four[1] years.


I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD.

Social dynamics distracted me from my core mission

I focused on “catching up” to other thinkers

I figured this point out by summer 2021.

I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge.

But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area.

If I had been more honest with myself, I could have nipped the “catching up with other thinkers” mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness.

But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.

But there is not yet any “rationality tech tree”, no succession of well-defined rationality skills such that you can learn them in order and grow way stronger. Like, you can’t just do the calibration exercises, and then the noticing-confusion exercises, and then other things. Those tools help, but they aren’t enough. There won’t be a clear and immediate direction of improvement, at first. But you may want to get stronger anyways.

I focused on seeming smart and defensible

I figured this point out this spring.

When I started working on alignment, I didn’t know what to do at first, and I felt insecure about my credentials. As far as I remember, I figured I’d start off by becoming respected, since other people’s feedback was initially a better guide than my own taste. Unfortunately, I didn’t realize how deeply and subtly this goal would grow its roots.

I worried about upvotes, I worried about winning arguments, I worried about being defensible against criticism. I was so worried that someone would comment on one of my posts and tear everything down, because I hadn’t been careful enough, because I had left myself open by not dotting all my ’i’s. (Not that anyone has ever done that on LessWrong before...)

I think it was this year that I had my (second) “oh man, don’t forget the part where everyone is allowed to die to AI” moment. To illustrate the new mindset this gut-realization gave me, I’ll detail a recent decision with social consequences, and then compare the old and the new mindsets.

A few months back, Quintin Pope approached me with (what he claimed to be) a new alignment paradigm, which blossomed from asking the following kind of questions:

We clearly prefer future AIs to generalize in the way that neuroscientists generalize, so it seems worthwhile to ask: “why don’t neuroscientists wirehead themselves?”

It’s clearly not because humans evolved away from wireheading, specifically. There are somewhat similar situations to wireheading in the ancestral environment: psychoactive drugs, masturbation, etc. Is the reason we don’t wirehead because evolution instilled us with an aversion to manipulating our reward function, which then zero-shot generalized to wireheading, despite wireheading being so wildly dissimilar to the contents of the ancestral environment? How could evolution have developed an alignment approach that generalized so well?

After a few days, I realized my gut expectations were that he was broadly correct and that this theory of alignment could actually be right. However, I realized I wasn’t consciously letting myself think that because it would be Insufficiently Skeptical to actually think the alignment problem is solvable. This seemed obviously stupid to me, so I quickly shut that line of thinking down and second-order updated towards optimism so that I would stop predictably getting more optimistic about Quintin’s theory.[2]

I realized I assigned about 5% credence to “this line of thinking marks a direct and reasonably short path to solving alignment.” Thus, on any calculation of benefits and harms, I should be willing to stake some reputation to quickly get more eyeballs on the theory, even though I expected to end up looking a little silly (with about 95% probability). With my new attitude, I decided “whatever, let’s just get on with it and stop wasting time.”

The old “don’t leave any avenue of being criticized!” attitude would have been less loyal to my true beliefs: “This could work, but there are so many parts I don’t understand yet. If I figure those parts out first, I can explain it better and avoid having to go out on a limb in the process.” Cowardice and social anxiety, dressed up as prudence and skepticism.

I still get anxious around disagreements with people I respect. I am still working on fully expunging the “defensibility” urges, because they suck. But I’ve already made a lot of progress.[3]

Too much deference, too little thinking for myself

I realized and started fixing this mistake this spring. (Seeing a pattern?)

I filtered the world through a status lens. If I read a comment from a high-status person, I would gloss over confusing parts, because I was probably the one reading it wrong. Sure, I would verbally agree that modest epistemology is unproductive. I just happened to not think thoughts like “[high-status person]’s claim seems obviously dumb and wrong.”

Now I let myself think thoughts like that, and it’s great. For example, last week I was reading about Pavlov’s conditioning experiments with dogs. I read the following:

Pavlov (1902) started from the idea that there are some things that a dog does not need to learn. For example, dogs don’t learn to salivate whenever they see food. This reflex is ‘hard-wired’ into the dog.

I thought, “that seems like bullshit. Really, the dogs are hard-wired to salivate when they see food? Doesn’t that require hard-wiring a food-classifier into the dog’s brain?!”

And you know what? It was bullshit. I searched for about 8 minutes before finding references of the original lectures Pavlov gave:

Dr. Zitovich took several young puppies away from their mother and fed them for considerable time only on milk. When the puppies were a few months old he established fistulae of their salivary ducts, and was thus able to measure accurately the secretory activity of the glands. He now showed these puppies some solid food—bread or meat—but no secretion of saliva was evoked.

Our world is so inadequate that seminal psychology experiments are described in mangled, misleading ways. Inadequacy abounds, and status only weakly tracks adequacy. Even if the high-status person belongs to your in-group. Even if all your smart friends are nodding along.

Would you notice if this very post were inadequate and misleading? Would it be bullshit for the dog-genome to hardwire a food-classifier? Think for yourself. Constant vigilance!

Non-social mistakes

I thought about comfortable, familiar problems

I figured this point out this spring, because I bumped into Quintin as described above.

I remember a sunny summer day in 2019, sitting in the grass with Daniel Filan at UC Berkeley. He recommended putting together an end-to-end picture of the alignment problem. I remember feeling pretty uncomfortable about that, feeling that I wouldn’t understand which alignment problems go where in my diagram (“do embedded agency failures crop up here, or there?”). Wouldn’t it just make more sense to read more alignment papers and naturally refine those views over time?

This was a rationalization, plain and simple. There is no point where you feel ready to put all the pieces together. If you feel totally comfortable about how alignment fits together such that Daniel’s exercise does not push you on some level, we have either already solved the alignment problem, or you are deluded.

I did not feel ready, and I was not ready, and I should have done it anyways. But I focused on more comfortable work with well-defined boundaries, because it felt good to knock out new theorems. Whether or not those theorems were useful and important to alignment, that was a mistake. So I stayed in my alignment comfort zone. I should have stopped working on impact measures and power-seeking way earlier than I did, even though I did end up doing some cool work.

Not admitting to myself that I thought alignment was doomed

Figured this out this spring. I’m not sure if I’ve fixed the general error yet.

After I became more optimistic about alignment due to having a sharper understanding of the overall problem and of how human values formed to begin with, I also became more pessimistic about other approaches, like IDA/​ELK/​RRM/​AUP/​[anything else with a three-letter acronym]. But my new understanding didn’t seem to present any specific objections. So why did I suddenly feel worse about these older ideas?

I suspect that part of the explanation is: I hadn’t wanted to admit how confused I was about alignment, and I (implicitly) clutched to “but it could work”-style hopefulness. But now that I had a different reason to hope, resting upon a more solid and mechanistic understanding, now it was apparently emotionally safe for me to admit I didn’t have much hope at all for the older approaches.


If that’s what happened, I was seriously deluding myself. I will do better next time.

I viewed my life through narratives

I probably figured this point out in 2021.

Back in 2018, I had the “upstart alignment researcher” narrative—starting off bright-eyed and earnest, learning a lot, making friends. But then I hurt my hands and couldn’t type anymore, which broke the narrative. I felt dejected—to slightly exaggerate, I felt I had fallen off of the sunlit path, and now nothing was going to go as it should.

Another example of narrative-thinking is when people say “I’m just not a math person.” This is an inference and a story they tell themselves. Strictly speaking, they may not know much math, and they may not enjoy math, and they may not see how to change either of those facts. But the narrative is that they are not a math person. Their discomfort and their aversion-to-trying stem not just from their best-guess assessment of their own weaknesses, but from a story they are living in.

Every moment is an opportunity for newly-directed action. Keep your identity small and keep the narratives in the story-books. At least, if you want to use narratives, carefully introspect to make sure you’re using them, and they aren’t using you.

Other helpful habits I picked up

I’m not really sure where these two habits go, so I’ll put them here. I wish I’d had these skills in 2018.

  • Distinguish between observations and inferences. When people speak to you, mark their arguments as observations or as inferences. Keep the types separate. I’ve gained so much from this simple practice.

    Here are two cases I’ve recently found where people seem to mistake the folk wisdom for observation:

    • “People often say they’re afraid to die” is an observation, and “people are hard-wired to be afraid of death” is an inference.

    • “I often feel ‘curiosity’ and some kind of exploration-impulse” is an observation, and “people are innately curious” is an inference.

  • Be concrete. My friend Kurt remarks that I constantly ask for examples.

    • If a friend comes to me for advice and says “I’m terrible at dating, I just feel so shy!”, I could say “You’re really fun to be around, you’re probably just in your head too much”, and then they could say “Agh, maybe, but it’s just so frustrating.” Wouldn’t that just be such a useful conversation for them? That’ll definitely solve their awkwardness!

      • Alternatively, if I ask for an example, we can both analyze an event which actually happened. Perhaps they say, “I met a girl named Alice at the party, but I somehow ran out of things to say, and it got quiet, and we found excuses to part ways.” Then I can help my friend introspect and figure out why they didn’t have anything to say, which is in fact a question with a real answer.

    • The general rhythm is: Bind your thinking to coherent scenarios (preferably ones which actually happened, like meeting a girl named Alice), so that you (and possibly other people) can explore the details together (like why it got quiet) in order to figure out what to change (like running mock encounters to shoo away the social anxiety).

    • On the other hand, if you can’t think of a concrete example to ground your airy words, maybe your thinking is totally untethered from reality. Maybe your assumptions are contradictory and you can’t even see it.

      • Here’s something I recently said on Discord:

        ”If there are some circuits who can defer to the market prediction, then each circuit can get their coalitional contribution as their fixed weight. This lets some relatively simpler circuits retain weight. At least, those are the abstract words I want to say, but now I feel confused about how to apply that to a concrete example for how e.g. a shallow but broad “don’t steal” value negotiates via Critch-bargaining. Not being able to give a concrete example means I don’t really know what I’m talking about here.

    • Don’t tell me how your alignment strategy will e.g. “faithfully reproduce human judgments.” Explain what concrete benefits you hope to realize, and why “faithful reproduction of human judgments” will realize those benefits.

      • If the actual answer is that you don’t know, then just say it, because it’s the truth. Be aware that you don’t know.

To close out the “Mistakes” section, I mostly wish I’d expected more from myself. I wish I’d believed myself capable of building an end-to-end picture of the alignment problem, of admitting what I didn’t know and what I hadn’t thought about, of being able to survive/​ignore the harsh winds of criticism and skepticism.

I did these things eventually, though, and I’m proud of that.

What I’m proud of

  1. I didn’t keep working on computational chemistry. Boy howdy, would that have been awful for me. Thank you, TurnTrout!

    1. I remember thinking “You know what, I’d rather get expelled than not do [the 2018 CHAI internship].” This thought gave me the courage to find a new advisor who would let me work on AI safety, funding be damned.

    2. I’m not a natural nonconformist. Conflict makes me nervous. I’ve had to work for it.

  2. I learned a lot of math, even though I felt sheepish and insecure about it at first.

  3. I think I ended up achieving rationality escape velocity.

    1. When I get stuck /​ feel depressed, errors get thrown, exception-handling activates, I start thinking “these thoughts seem unreasonably dark; my cognition is compromised; have I eaten enough food today, have I drank enough water, should I call a friend...”.

    2. When I get stuck on a problem (e.g. what is the type signature of human values?), I do not stay stuck. I notice I am stuck, I run down a list of tactics, I explicitly note what works, I upweight that for next time.

    3. When I realize I’ve been an idiot about something (e.g. nicking my hand with a knife, missing a deadline), I stop and think wow, that was stupid, what’s the more general error I’m making?

    4. The general rhythm is: I feel agentic and capable and self-improving, and these traits are strengthening over time, as is the rate of strengthening.

    5. This definitely didn’t have to happen, but I made it happen (with the help of some friends and resources).

  4. Research achievements:

    1. I think Reframing Impact correctly inferred our intuitions around what “impact” means, and also that sequence was beautiful and I loved making it.

    2. My dissertation is also beautiful. I painstakingly wrote and formatted and edited it, even hiring a professional to help out. I fought to keep its tone focused on what matters: the sharp dangers of AGI.

    3. I likewise poured myself into Optimal Policies Tend To Seek Power, and its follow-up, Parametrically Retargetable Decision-Makers Tend To Seek Power.

      1. First, I had felt instrumental convergence should be provable and formally understandable. It was a mystery to me in 2019, and now it’s not.

      2. Second, I used to suck at writing academic papers, but I managed to get two NeurIPS spotlights by the end of my program. NeurIPS spotlights might not save the world, but that was tough and I did a good job with it.

    4. Attainable utility preservation is pointless for AGI alignment, but damn is it cool that we could do unsupervised learning to get a reward function, preserve the agent’s ability to optimize that single random objective, and just get cautious behavior in complicated environments.

Looking forward

Leaving Oregon was a bit sad, but coming to Berkeley is exciting. I’ll be starting my CHAI postdoc soon. I’m working with lots of cool, smart, loyal friends. I’m feeling strong and confident and relatively optimistic, both about alignment and about my personal future.

Here’s to winning. 🥂

  1. ^

    My PhD was six years long (it started in the fall of 2016). However, I’m not even going to critique the first two years, because that would make the “Mistakes” section far too long.

  2. ^

    If you’re interested in reading about the theory now, see this recent comment. I’m currently putting together some prerequisite posts to bridge the inferential gap.

  3. ^

    Sometimes I feel the urge to defend myself just a little more, to which some part of me internally replies “are you serious, this defensibility thing again?! Are you ever going to let me actually think?”

    I like that part of me a lot.