Swimming Upstream: A Case Study in Instrumental Rationality

One data point for careful planning, the unapologetic pursuit of fulfillment, and success. Of particular interest to up-and-coming AI safety researchers, this post chronicles how I made a change in my PhD program to work more directly on AI safety, overcoming significant institutional pressure in the process.

It’s hard to believe how much I’ve grown and grown up in these last few months, and how nearly every change was borne of deliberate application of the Sequences.

  • I left a relationship that wasn’t right.

  • I met reality without flinching: the specter of an impossible, unfair challenge; the idea that everything and everyone I care about could actually be in serious trouble should no one act; the realization that people should do something [1], and that I am one of those people (are you?).

  • I attended a CFAR workshop and experienced incredible new ways of interfacing with myself and others. This granted me such superpowers as (in ascending order): permanent insecurity resolution, figuring out what I want from major parts of my life and finding a way to reap the benefits with minimal downside, and having awesome CFAR friends.

  • I ventured into the depths of my discomfort zone, returning with the bounty of a new love: a new career.

  • I followed that love, even at risk of my graduate career and tens of thousands of dollars of loans. Although the decision was calculated, you better believe it was still scary.

I didn’t sacrifice my grades, work performance, physical health, or my social life to do this. I sacrificed something else.

CHAI For At Least Five Minutes

January-Trout had finished the Sequences and was curious about getting involved with AI safety. Not soon, of course—at the time, I had a narrative in which I had to labor and study for long years before becoming worthy. To be sure, I would never endorse such a narrative—Something to Protect, after all—but I had it.

I came across several openings, including a summer internship at Berkeley’s Center for Human-Compatible AI. Unfortunately, the posting indicated that applicants should have a strong mathematical background (uh) and that a research proposal would be required (having come to terms with the problem mere weeks before, I had yet to read a single result in AI safety).

OK, I’m really skeptical that I can plausibly compete this year, but applying would be a valuable information-gathering move with respect to where I should most focus my efforts.

I opened Concrete Problems in AI Safety, saw 29 pages of reading, had less than 29 pages of ego to deplete, and sat down.

This is ridiculous. I’m not going to get it.
… You know, this would be a great opportunity to try for five minutes.

At that moment, I lost all respect for these problems and set myself to work on the one I found most interesting. I felt the contours of the challenge take shape in my mind, sensing murky uncertainties and slight tugs of intuition. I concentrated, compressed, and compacted my understanding until I realized what success would actually look like. The idea then followed trivially [2].

Reaching the porch of my home, I turned to the sky made iridescent by the setting sun.

I’m going to write a post about this at some point, aren’t I?


This idea is cool, but it’s probably secretly terrible. I have limited familiarity with the field and came up with it after literally twenty minutes of thinking? My priors say that it’s either already been done, or that it’s obviously flawed.

Terrified that this idea would become my baby, I immediately plotted its murder. Starting from the premise that it was insufficient even for short-term applications (not even in the limit), I tried to break it with all the viciousness I could muster. Not trusting my mind to judge sans rose-color, I coded and conducted experiments; the results supported my idea.

I was still suspicious, and from this suspicion came many an insight; from these insights, newfound invigoration. Being the first to view the world in a certain way isn’t just a rush—it’s pure joie de vivre.

Risk Tolerance

I’m taking an Uber with Anna Salamon back to her residence, and we’re discussing my preparations for technical work in AI safety. With one question, she changes the trajectory of my professional life:

Why are you working on molecules, then?

There’s the question I dare not pose, hanging exposed, in the air. It scares me. I acknowledge a potential status quo bias, but express uncertainty about my ability to do anything about it. To be sure, that work is important and conducted by good people whom I respect. But it wasn’t right for me.

We reach her house and part ways; I now find myself in an unfamiliar Berkeley neighborhood, the darkness and rain pressing down on me. There’s barely a bar of reception on my phone, and Lyft won’t take my credit card. I just want to get back to the CFAR house. I calm my nerves (really, would Anna live somewhere dangerous?), absent-mindedly searching for transportation as I reflect. In hindsight, I felt a distinct sense of avoiding-looking-at-the-problem, but I was not yet strong enough to admit even that.

A week later, I get around to goal factoring and internal double cruxing this dilemma.

Litany of Tarski, OK? There’s nothing wrong with considering how I actually feel. Actually, it’s a dominant strategy, since the value of information is never negative [3]. Look at the thing.

I realize that I’m out of alignment with what I truly want—and will continue to be for four years if I do nothing. On the other hand, my advisor disagrees about the importance of preparing safety measures for more advanced agents, and I suspect that they would be unlikely to support a change of research areas. I also don’t want to just abandon my current lab.

I’m a second-year student—am I even able to do this? What if no professor is receptive to this kind of work? If I don’t land after I leap, I might have to end my studies and/​or accumulate serious debt, as I would be leaving a paid research position without any promise whatsoever of funding after the summer. What if I’m wrong, or being impulsive and short-sighted?

Soon after, I receive CHAI’s acceptance email, surprise and elation washing over me. I feel uneasy; it’s very easy to be reckless in this kind of situation.

Information Gathering

I knew the importance of navigating this situation optimally, so I worked to use every resource at my disposal. There were complex political and interpersonal dynamics at play here; although I consider myself competent in these considerations, I wanted to avoid even a single preventable error.

Who comes to mind as having experience and/​or insight on navigating this kind of situation? This list is incomplete—whom can I contact to expand it?

I contacted friends on the CFAR staff, interfaced with my university’s confidential resources, and reached out to contacts I had made in the rationality community. I posted to the CFAR alumni Google group, receiving input from AI safety researchers around the world, both at universities and at organizations like FLI and MIRI [4].

What obvious moves can I make to improve my decision-making process? What would I wish I’d done if I just went through with the switch now?
  • I continued a habit I have cultivated since beginning the Sequences: gravitating towards the arguments of intelligent people who disagree with me, and determining whether they have new information or perspectives I have yet to properly consider. What would it feel like to be me in a world in which I am totally wrong?

    • Example: while reading the perspectives of attendees of the ’17 Asilomar conference, I noticed that Dan Weld said something I didn’t agree with. You would not believe how quickly I clicked his interview.

  • I carefully read the chapter summaries of Decisive: How to Make Better Choices in Life and Work (having read the book in full earlier this year in anticipation of this kind of scenario).

  • I did a pre-mortem: “I’ve switched my research to AI safety. It’s one year later, and I now realize this was a terrible move—why?”, taking care of the few reasons which surfaced.

  • I internal double cruxed fundamental emotional conflicts about what could happen, about the importance of my degree to my identity, and about the kind of person I want to become.

    • I prepared myself to lose, mindful that the objective is not to satisfy that part of me which longs to win debates. Also, idea inoculation and status differentials.

  • I weighed the risks in my mind, squaring my jaw and mentally staring at each potential negative outcome.

Gears Integrity

At the reader’s remove, this choice may seem easy. Obviously, I meet with my advisor (whom I still admire, despite this specific disagreement), tell them what I want to pursue, and then make the transition.

Sure, gears-level models take precedence over expert opinion. I have a detailed model of why AI safety is important; if I listen carefully and then verify the model’s integrity against the expert’s objections, I should have no compunctions about acting.

I noticed a yawning gulf between privately disagreeing with an expert, disagreeing with an expert in person, and disagreeing with an expert in person in a way that sets back my career if I’m wrong. Clearly, the outside view is that most graduate students who have this kind of professional disagreement with an advisor are mistaken and later, regretful [5]. Yet, argument screens off authority, and

You have the right to think.
You have the right to disagree with people where your model of the world disagrees.
You have the right to decide which experts are probably right when they disagree.
You have the right to disagree with real experts that all agree, given sufficient evidence.
You have the right to disagree with real honest, hardworking, doing-the-best-they-can experts that all agree, even if they wouldn’t listen to you, because it’s not about whether they’re messing up.


Many harrowing days and nights later, we arrive at the present, concluding this chapter of my story. This summer, I will be collaborating with CHAI, working under Dylan Hadfield-Menell and my new advisor to extend both Inverse Reward Design and Whitelist Learning (the latter being my proposal to CHAI; I plan to make a top-level post in the near future) [6].


I sacrificed some of my tethering to the social web, working my way free of irrelevant external considerations, affirming to myself that I will look out for my interests. When I first made that affirmation, I felt a palpable sense of relief. Truly, if we examine our lives with seriousness, what pressures and expectations bind us to arbitrary social scripts, to arbitrary identities—to arbitrary lives?

[1] My secret to being able to continuously soak up math is that I enjoy it. However, it wasn’t immediately obvious that this would be the case, and only the intensity of my desire to step up actually got me to start studying. Only then, after occupying myself in earnest with those pages of Greek glyphs, did I realize that it’s fun.

[2] This event marked my discovery of the mental movement detailed in How to Dissolve It; it has since paid further dividends in both novel ideas and clarity of thought.

[3] I’ve since updated away from this being true for humans in practice, but I felt it would be dishonest to edit my thought process after the fact.

Additionally, I did not fit any aspect of this story to the Sequences post factum; every reference was explicitly considered at the time (e.g., remembering that specific post on how people don’t usually give a serious effort even when everything may be at stake).

[4] I am so thankful to everyone who gave me advice. Summarizing for future readers:

If you’re navigating this situation, are interested in AI safety but want some direction, or are looking for a community to work with, please feel free to contact me.

[5] I’d like to emphasize that support for AI safety research is quickly becoming more mainstream in the professional AI community, and may soon become the majority position (if it is not already).

Even though ideas are best judged by their merits and not by their popular support, it can be emotionally important in these situations to remember that if you are concerned, you are not on the fringe. For example, 1,273 AI researchers have publicly declared their support for the Future of Life Institute’s AI principles.

A survey of AI researchers (Muller & Bostrom, 2014) finds that on average they expect a 50% chance of human-level AI by 2040 and 90% chance of human-level AI by 2075. On average, 75% believe that superintelligence (“machine intelligence that greatly surpasses the performance of every human in most professions”) will follow within thirty years of human-level AI. There are some reasons to worry about sampling bias based on e.g. people who take the idea of human-level AI seriously being more likely to respond (though see the attempts made to control for such in the survey) but taken seriously it suggests that most AI researchers think there’s a good chance this is something we’ll have to worry about within a generation or two.
AI Researchers on AI Risk (2015)

[6] Objectives are subject to change.