Swimming Upstream: A Case Study in Instrumental Rationality

One data point for care­ful plan­ning, the un­apolo­getic pur­suit of fulfill­ment, and suc­cess. Of par­tic­u­lar in­ter­est to up-and-com­ing AI safety re­searchers, this post chron­i­cles how I made a change in my PhD pro­gram to work more di­rectly on AI safety, over­com­ing sig­nifi­cant in­sti­tu­tional pres­sure in the pro­cess.

It’s hard to be­lieve how much I’ve grown and grown up in these last few months, and how nearly ev­ery change was borne of de­liber­ate ap­pli­ca­tion of the Se­quences.

  • I left a re­la­tion­ship that wasn’t right.

  • I met re­al­ity with­out flinch­ing: the specter of an im­pos­si­ble, un­fair challenge; the idea that ev­ery­thing and ev­ery­one I care about could ac­tu­ally be in se­ri­ous trou­ble should no one act; the re­al­iza­tion that peo­ple should do some­thing [1], and that I am one of those peo­ple (are you?).

  • I at­tended a CFAR work­shop and ex­pe­rienced in­cred­ible new ways of in­ter­fac­ing with my­self and oth­ers. This granted me such su­per­pow­ers as (in as­cend­ing or­der): per­ma­nent in­se­cu­rity re­s­olu­tion, figur­ing out what I want from ma­jor parts of my life and find­ing a way to reap the benefits with min­i­mal down­side, and hav­ing awe­some CFAR friends.

  • I ven­tured into the depths of my dis­com­fort zone, re­turn­ing with the bounty of a new love: a new ca­reer.

  • I fol­lowed that love, even at risk of my grad­u­ate ca­reer and tens of thou­sands of dol­lars of loans. Although the de­ci­sion was calcu­lated, you bet­ter be­lieve it was still scary.

I didn’t sac­ri­fice my grades, work perfor­mance, phys­i­cal health, or my so­cial life to do this. I sac­ri­ficed some­thing else.

CHAI For At Least Five Minutes

Jan­uary-Trout had finished the Se­quences and was cu­ri­ous about get­ting in­volved with AI safety. Not soon, of course—at the time, I had a nar­ra­tive in which I had to la­bor and study for long years be­fore be­com­ing wor­thy. To be sure, I would never en­dorse such a nar­ra­tive—Some­thing to Pro­tect, af­ter all—but I had it.

I came across sev­eral open­ings, in­clud­ing a sum­mer in­tern­ship at Berkeley’s Cen­ter for Hu­man-Com­pat­i­ble AI. Un­for­tu­nately, the post­ing in­di­cated that ap­pli­cants should have a strong math­e­mat­i­cal back­ground (uh) and that a re­search pro­posal would be re­quired (hav­ing come to terms with the prob­lem mere weeks be­fore, I had yet to read a sin­gle re­sult in AI safety).

OK, I’m re­ally skep­ti­cal that I can plau­si­bly com­pete this year, but ap­ply­ing would be a valuable in­for­ma­tion-gath­er­ing move with re­spect to where I should most fo­cus my efforts.

I opened Con­crete Prob­lems in AI Safety, saw 29 pages of read­ing, had less than 29 pages of ego to de­plete, and sat down.

This is ridicu­lous. I’m not go­ing to get it.
… You know, this would be a great op­por­tu­nity to try for five min­utes.

At that mo­ment, I lost all re­spect for these prob­lems and set my­self to work on the one I found most in­ter­est­ing. I felt the con­tours of the challenge take shape in my mind, sens­ing murky un­cer­tain­ties and slight tugs of in­tu­ition. I con­cen­trated, com­pressed, and com­pacted my un­der­stand­ing un­til I re­al­ized what suc­cess would ac­tu­ally look like. The idea then fol­lowed triv­ially [2].

Reach­ing the porch of my home, I turned to the sky made iride­s­cent by the set­ting sun.

I’m go­ing to write a post about this at some point, aren’t I?


This idea is cool, but it’s prob­a­bly se­cretly ter­rible. I have limited fa­mil­iar­ity with the field and came up with it af­ter liter­ally twenty min­utes of think­ing? My pri­ors say that it’s ei­ther already been done, or that it’s ob­vi­ously flawed.

Ter­rified that this idea would be­come my baby, I im­me­di­ately plot­ted its mur­der. Start­ing from the premise that it was in­suffi­cient even for short-term ap­pli­ca­tions (not even in the limit), I tried to break it with all the vi­cious­ness I could muster. Not trust­ing my mind to judge sans rose-color, I coded and con­ducted ex­per­i­ments; the re­sults sup­ported my idea.

I was still sus­pi­cious, and from this sus­pi­cion came many an in­sight; from these in­sights, newfound in­vi­go­ra­tion. Be­ing the first to view the world in a cer­tain way isn’t just a rush—it’s pure joie de vivre.

Risk Tolerance

I’m tak­ing an Uber with Anna Sala­mon back to her res­i­dence, and we’re dis­cussing my prepa­ra­tions for tech­ni­cal work in AI safety. With one ques­tion, she changes the tra­jec­tory of my pro­fes­sional life:

Why are you work­ing on molecules, then?

There’s the ques­tion I dare not pose, hang­ing ex­posed, in the air. It scares me. I ac­knowl­edge a po­ten­tial sta­tus quo bias, but ex­press un­cer­tainty about my abil­ity to do any­thing about it. To be sure, that work is im­por­tant and con­ducted by good peo­ple whom I re­spect. But it wasn’t right for me.

We reach her house and part ways; I now find my­self in an un­fa­mil­iar Berkeley neigh­bor­hood, the dark­ness and rain press­ing down on me. There’s barely a bar of re­cep­tion on my phone, and Lyft won’t take my credit card. I just want to get back to the CFAR house. I calm my nerves (re­ally, would Anna live some­where dan­ger­ous?), ab­sent-mind­edly search­ing for trans­porta­tion as I re­flect. In hind­sight, I felt a dis­tinct sense of avoid­ing-look­ing-at-the-prob­lem, but I was not yet strong enough to ad­mit even that.

A week later, I get around to goal fac­tor­ing and in­ter­nal dou­ble crux­ing this dilemma.

Li­tany of Tarski, OK? There’s noth­ing wrong with con­sid­er­ing how I ac­tu­ally feel. Ac­tu­ally, it’s a dom­i­nant strat­egy, since the value of in­for­ma­tion is never nega­tive [3]. Look at the thing.

I re­al­ize that I’m out of al­ign­ment with what I truly want—and will con­tinue to be for four years if I do noth­ing. On the other hand, my ad­vi­sor dis­agrees about the im­por­tance of prepar­ing safety mea­sures for more ad­vanced agents, and I sus­pect that they would be un­likely to sup­port a change of re­search ar­eas. I also don’t want to just aban­don my cur­rent lab.

I’m a sec­ond-year stu­dent—am I even able to do this? What if no pro­fes­sor is re­cep­tive to this kind of work? If I don’t land af­ter I leap, I might have to end my stud­ies and/​or ac­cu­mu­late se­ri­ous debt, as I would be leav­ing a paid re­search po­si­tion with­out any promise what­so­ever of fund­ing af­ter the sum­mer. What if I’m wrong, or be­ing im­pul­sive and short-sighted?

Soon af­ter, I re­ceive CHAI’s ac­cep­tance email, sur­prise and ela­tion wash­ing over me. I feel un­easy; it’s very easy to be reck­less in this kind of situ­a­tion.

In­for­ma­tion Gathering

I knew the im­por­tance of nav­i­gat­ing this situ­a­tion op­ti­mally, so I worked to use ev­ery re­source at my dis­posal. There were com­plex poli­ti­cal and in­ter­per­sonal dy­nam­ics at play here; al­though I con­sider my­self com­pe­tent in these con­sid­er­a­tions, I wanted to avoid even a sin­gle pre­ventable er­ror.

Who comes to mind as hav­ing ex­pe­rience and/​or in­sight on nav­i­gat­ing this kind of situ­a­tion? This list is in­com­plete—whom can I con­tact to ex­pand it?

I con­tacted friends on the CFAR staff, in­ter­faced with my uni­ver­sity’s con­fi­den­tial re­sources, and reached out to con­tacts I had made in the ra­tio­nal­ity com­mu­nity. I posted to the CFAR alumni Google group, re­ceiv­ing in­put from AI safety re­searchers around the world, both at uni­ver­si­ties and at or­ga­ni­za­tions like FLI and MIRI [4].

What ob­vi­ous moves can I make to im­prove my de­ci­sion-mak­ing pro­cess? What would I wish I’d done if I just went through with the switch now?
  • I con­tinued a habit I have cul­ti­vated since be­gin­ning the Se­quences: grav­i­tat­ing to­wards the ar­gu­ments of in­tel­li­gent peo­ple who dis­agree with me, and de­ter­min­ing whether they have new in­for­ma­tion or per­spec­tives I have yet to prop­erly con­sider. What would it feel like to be me in a world in which I am to­tally wrong?

    • Ex­am­ple: while read­ing the per­spec­tives of at­ten­dees of the ’17 Asilo­mar con­fer­ence, I no­ticed that Dan Weld said some­thing I didn’t agree with. You would not be­lieve how quickly I clicked his in­ter­view.

  • I care­fully read the chap­ter sum­maries of De­ci­sive: How to Make Bet­ter Choices in Life and Work (hav­ing read the book in full ear­lier this year in an­ti­ci­pa­tion of this kind of sce­nario).

  • I did a pre-mortem: “I’ve switched my re­search to AI safety. It’s one year later, and I now re­al­ize this was a ter­rible move—why?”, tak­ing care of the few rea­sons which sur­faced.

  • I in­ter­nal dou­ble cruxed fun­da­men­tal emo­tional con­flicts about what could hap­pen, about the im­por­tance of my de­gree to my iden­tity, and about the kind of per­son I want to be­come.

    • I pre­pared my­self to lose, mind­ful that the ob­jec­tive is not to satisfy that part of me which longs to win de­bates. Also, idea in­oc­u­la­tion and sta­tus differ­en­tials.

  • I weighed the risks in my mind, squar­ing my jaw and men­tally star­ing at each po­ten­tial nega­tive out­come.

Gears Integrity

At the reader’s re­move, this choice may seem easy. Ob­vi­ously, I meet with my ad­vi­sor (whom I still ad­mire, de­spite this spe­cific dis­agree­ment), tell them what I want to pur­sue, and then make the tran­si­tion.

Sure, gears-level mod­els take prece­dence over ex­pert opinion. I have a de­tailed model of why AI safety is im­por­tant; if I listen care­fully and then ver­ify the model’s in­tegrity against the ex­pert’s ob­jec­tions, I should have no com­punc­tions about act­ing.

I no­ticed a yawn­ing gulf be­tween pri­vately dis­agree­ing with an ex­pert, dis­agree­ing with an ex­pert in per­son, and dis­agree­ing with an ex­pert in per­son in a way that sets back my ca­reer if I’m wrong. Clearly, the out­side view is that most grad­u­ate stu­dents who have this kind of pro­fes­sional dis­agree­ment with an ad­vi­sor are mis­taken and later, re­gret­ful [5]. Yet, ar­gu­ment screens off au­thor­ity, and

You have the right to think.
You have the right to dis­agree with peo­ple where your model of the world dis­agrees.
You have the right to de­cide which ex­perts are prob­a­bly right when they dis­agree.
You have the right to dis­agree with real ex­perts that all agree, given suffi­cient ev­i­dence.
You have the right to dis­agree with real hon­est, hard­work­ing, do­ing-the-best-they-can ex­perts that all agree, even if they wouldn’t listen to you, be­cause it’s not about whether they’re mess­ing up.


Many har­row­ing days and nights later, we ar­rive at the pre­sent, con­clud­ing this chap­ter of my story. This sum­mer, I will be col­lab­o­rat­ing with CHAI, work­ing un­der Dy­lan Had­field-Menell and my new ad­vi­sor to ex­tend both In­verse Re­ward De­sign and Whitelist Learn­ing (the lat­ter be­ing my pro­posal to CHAI; I plan to make a top-level post in the near fu­ture) [6].


I sac­ri­ficed some of my teth­er­ing to the so­cial web, work­ing my way free of ir­rele­vant ex­ter­nal con­sid­er­a­tions, af­firm­ing to my­self that I will look out for my in­ter­ests. When I first made that af­fir­ma­tion, I felt a pal­pable sense of re­lief. Truly, if we ex­am­ine our lives with se­ri­ous­ness, what pres­sures and ex­pec­ta­tions bind us to ar­bi­trary so­cial scripts, to ar­bi­trary iden­tities—to ar­bi­trary lives?

[1] My se­cret to be­ing able to con­tin­u­ously soak up math is that I en­joy it. How­ever, it wasn’t im­me­di­ately ob­vi­ous that this would be the case, and only the in­ten­sity of my de­sire to step up ac­tu­ally got me to start study­ing. Only then, af­ter oc­cu­py­ing my­self in earnest with those pages of Greek glyphs, did I re­al­ize that it’s fun.

[2] This event marked my dis­cov­ery of the men­tal move­ment de­tailed in How to Dis­solve It; it has since paid fur­ther div­i­dends in both novel ideas and clar­ity of thought.

[3] I’ve since up­dated away from this be­ing true for hu­mans in prac­tice, but I felt it would be dishon­est to edit my thought pro­cess af­ter the fact.

Ad­di­tion­ally, I did not fit any as­pect of this story to the Se­quences post fac­tum; ev­ery refer­ence was ex­plic­itly con­sid­ered at the time (e.g., re­mem­ber­ing that spe­cific post on how peo­ple don’t usu­ally give a se­ri­ous effort even when ev­ery­thing may be at stake).

[4] I am so thank­ful to ev­ery­one who gave me ad­vice. Sum­ma­riz­ing for fu­ture read­ers:

If you’re nav­i­gat­ing this situ­a­tion, are in­ter­ested in AI safety but want some di­rec­tion, or are look­ing for a com­mu­nity to work with, please feel free to con­tact me.

[5] I’d like to em­pha­size that sup­port for AI safety re­search is quickly be­com­ing more main­stream in the pro­fes­sional AI com­mu­nity, and may soon be­come the ma­jor­ity po­si­tion (if it is not already).

Even though ideas are best judged by their mer­its and not by their pop­u­lar sup­port, it can be emo­tion­ally im­por­tant in these situ­a­tions to re­mem­ber that if you are con­cerned, you are not on the fringe. For ex­am­ple, 1,273 AI re­searchers have pub­li­cly de­clared their sup­port for the Fu­ture of Life In­sti­tute’s AI prin­ci­ples.

A sur­vey of AI re­searchers (Mul­ler & Bostrom, 2014) finds that on av­er­age they ex­pect a 50% chance of hu­man-level AI by 2040 and 90% chance of hu­man-level AI by 2075. On av­er­age, 75% be­lieve that su­per­in­tel­li­gence (“ma­chine in­tel­li­gence that greatly sur­passes the perfor­mance of ev­ery hu­man in most pro­fes­sions”) will fol­low within thirty years of hu­man-level AI. There are some rea­sons to worry about sam­pling bias based on e.g. peo­ple who take the idea of hu­man-level AI se­ri­ously be­ing more likely to re­spond (though see the at­tempts made to con­trol for such in the sur­vey) but taken se­ri­ously it sug­gests that most AI re­searchers think there’s a good chance this is some­thing we’ll have to worry about within a gen­er­a­tion or two.
AI Re­searchers on AI Risk (2015)

[6] Ob­jec­tives are sub­ject to change.

No nominations.
No reviews.