Thomas Kwa’s MIRI research experience

  • Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added continuously, the way a comment thread might work. If you’d also be excited about finding an interlocutor to debate, dialogue, or getting interviewed by: fill in this dialogue matchmaking form.


    Hi Thomas, I’m quite curious to hear about your research experience working with MIRI. To get us started: When were you at MIRI? Who did you work with? And what problem were you working on?

  • I was at MIRI Sept 2022 to Sept 2023, which was full time from Sept 2022 - March 2023 and 14 time afterwards. The team was Vivek Hebbar, Peter Barnett and me initially; we brought on James Lucassen and Thomas Larsen from MATS around January, and Jeremy Gillen a bit after that. We were loosely mentored by Nate Soares, and we spent 1 week out of 6 talking to him for like 4 hours a day, which I think was a pretty strange format.

    We were working on a bunch of problems, but the overall theme/​goal started off as “characterize the sharp left turn” and evolved into getting fundamental insights about idealized forms of consequentialist cognition. For most of the time, we would examine a problem or angle, realize that it was intractable or not what we really wanted, and abandon it after a few days to months.

  • What’s your overall take on how it went? How happy are you with it? Were there any major takeaways?

  • I’m pretty underwhelmed with our output, and the research process was also frustrating. But I don’t want to say I got zero output, because I definitely learned to think more concretely about cognition—this was the main skill Nate was trying to teach.

    I think the project did not generate enough concrete problems to tackle, and also had an unclear theory of change. (I realized the first one around March, which is why I went part-time.) This meant we had a severe lack of feedback loops. The fault probably lies with a combination of our inexperience, Nate’s somewhat eccentric research methodology, and difficulty communicating with Nate.

  • Is there an obvious thing you wished was different?

  • Also, some more specific questions that come up are:

    • What were some difficulties communicating with Nate?

    • Is there anything you can say about Nate’s research methodology or is that mostly private?

  • Oh, a bunch of things I wish were different, I’ll just give a list. (After we wrote this, Nate gave some important context which you can find below).

    Communication

    • Nate is very disagreeable and basically changes nothing about his behavior for social reasons. This made it overall difficult to interact with him.

    • There’s this thing Nate and Eliezer do where they proclaim some extremely nonobvious take about alignment, say it in the same tone they would use to declare that grass is green, and don’t really explain it. (I didn’t talk to Eliezer much but I’ve heard this from others too.)

    • Nate thinks in a different ontology from everyone, and often communicates using weird analogies

      • e.g. he pointed out a flaw in some of our early alignment proposals, where we train the AI to have some property relevant to alignment, but it doesn’t learn it well. He communicated this using the IMO unhelpful analogy that you could train me with the outer objective of predicting what chess move Magnus Carlsen would make, and I wouldn’t be able to actually beat chess grandmasters.

    • when Nate thinks you don’t understand something or have a mistaken approach, he gets visibly distressed and sad. I think this conditioned us to express less disagreement with him. I have a bunch of disagreements from his world model, and could probably be convinced to his position on like 13 of them, but I’m too afraid to bring them all up and if I did he’d probably stop talking to me out of despair anyway.

    • The structure where we would talk to Nate 4h/​day for one out of every ~6 weeks was pretty bad for feedback loops. A short meeting every week would have been better, but Nate said this would be more costly for him.

    • In my frustration at the lack of concrete problems I asked Nate what research he would approve of outside of the main direction. We thought of two ideas: getting a white-box system to solve Winograd schemas, and understanding style transfer in neural nets. I worked on these on and off for a few months without much progress, then went back to Nate to ask for advice. Nate clarified that he was not actually very excited about these directions himself, and it was more like “I don’t see the relevance here, but if you feel excited by these, I could see this not being totally useless”. (Nate gives his perspective on this below).

    Other

    • Someone on the project should have had more research experience. Peter has a MS but it’s in physics; no one else had more than 1.5 years and many of us had only done SERI MATS.

    • I wish we were like, spoon-fed concrete computer science problems. This is not very compatible with the goals of the project though, which were deconfusion.

    • Nate wouldn’t tell us about most of his object-level models, because they were adjacent to capabilities insights. Nate thought that there were some benefits to sharing with us, but if we were good at research we’d be able to do good work without it, and the probability of our succeeding was low enough that the risk-reward calculation didn’t pan out.

    • I think we were overly cautious with infosec. The model was something like: Nate and Eliezer have a mindset that’s good for both capabilities and alignment, and so if we talk to other alignment researchers about our work, the mindset will diffuse into the alignment community, and thence to OpenAI, where it would speed up capabilities. I think we didn’t have enough evidence to believe this, and should have shared more.

      • Because we were stuck, the infohazards are probably not severe, and the value of mentorship increases.

      • I think it’s unlikely that we’d produce a breakthrough all at once. Therefore object level infohazards could be limited by being public until we start producing impressive results, then going private.

      • We talked to Buck recently and he pointed out enough problems with our research and theory of change that I would have left the project earlier if I’d heard them earlier. (The main piece of information was that our ideas weren’t sufficiently different from other CS people that there would be lots of low-hanging fruit to pluck.)

    • Some of our work was about GOFAI (symbolic reasoning systems) and we could have engaged with the modern descendants of the field. We looked at SAT solvers enough to implement CDCL from scratch and read a few papers, but could have looked at other areas like logic. I had the impression that most efforts at GOFAI failed but didn’t and still don’t have a good idea why.

  • Gotcha.

    I have more object-level questions, but maybe immediate next question is something like “how are you feeling about this?”

  • I’m also curious about: why did you join MIRI in the first place? To what extent were the things a surprise to you, and to what extent did you expect them?

  • I feel annoyed or angry I guess? When I reflect I don’t have much resentment towards Nate though, he just has different models and a different style than most researchers and was acting rationally given those. I hope I’m not misrepresenting anything for emotional reasons, and if I am we can go back and fix afterwards.

    Edit 9/​26: I still endorse everything above.

  • I maybe have two lines of thoughts when I read this.

    One is something like “yeah, that kinda matches other things I’ve heard about Nate/​MIRI, and it seems moderately likely to me that something should change there”. But, all the conversations I’ve had about it had a similar confidentiality-cloud around them, so I don’t know many details, and wouldn’t be very confident about what sort of changes would actually be net beneficial.

    Another train of thought is something like “what if you guys had just… not had Nate at all? Would that have been better or worse? Can you imagine plans where actually you guys just struck out on your own as an independent alignment research team?”

  • One of my models is that “Whelp, people are spiky. Often, the things that are inconvenient about people are tightly entwined with the things that are valuable about them. Often people can change, but not in many of the obvious ways you might naively think.” So I’m sort of modeling Nate (or Eliezer, although he doesn’t sound as relevant) as sort of a fixed cognitive resource, without too much flexibility on how they can deploy that resource.

  • I joined MIRI because I was impressed with Vivek, and was betting on a long tail outcome from his collaboration with Nate. This could have been a few things...

    • Maybe everyone trying to understand cognition (capabilities and alignment) is screwing up, and we can just do it properly and build an aligned AGI, or get >20% of the way there.

    • Maybe everyone in alignment is neglecting certain failure modes Nate understood, e.g. what would eventually become Nate’s post on Deep Deceptiveness, or the sharp left turn. It turned out that ARC was already thinking about these (ontology translation problems, inner search, etc.).

    • Maybe we could publish a thing distilling Nate’s views like Shard Theory in Nine Theses. This would be because Nate is bad at communication (see eg MIRI dialogues failing to resolve a bunch of disagreements)

    • Maybe working with Nate would make us much better at alignment research (empirical or conceptual). This is the closest to being true, and I feel like I have a much better view of the problem than a year ago (although there is the huge confounder of the whole field having more concrete research directions than ever before)

    Basically I wanted to speed up whatever Vivek was doing here, and early on I had lots of good conversations with him. (I was pretty excited about this despite Nate saying his availability for mentorship would be limited, which he also touches on below).

    About whether we’d have done better without Nate ex ante, definitely no. Understanding Nate’s research style and views, and making a bet on it, was the entire point of the project. There would be nothing separating us from other junior researchers, and agent foundations is hard enough that we would be really unlikely to succeed.

  • Ok, maybe a few conversational tracts that seem interesting to me are:

    • Is there more stuff you generally wish people knew about your experience, or about MIRI?

    • …something like “what life lessons can you learn here, that don’t depend so much on changing MIRI/​Nate?”

    • Any object level meta-research-process stuff you learned that seems interesting?

    • Any thoughts on… I dunno, basically anything not covered by the first three categories, but maybe a more specific option is “is there a problem here that you feel sort of confused about that you could use help thinking through?”

  • One big takeaway is that general research experience, that you gain over time, is a real thing. I’d heard this from others like Mark Xu, but I actually believe it in my heart now.

    Furthermore, I think there are some research domains that are more difficult than others just based on how hard it is to come up with a line of attack on a problem that’s both tractable and relevant. If you fail at either of these then your research is useless. Several times we abandoned directions because they would be intractable or irrelevant—and I feel we learned very little from these experiences. It was super unforgiving and I now think this is a domain where you need way more experience to have a hope of succeeding.

  • I’m reminded about a scene in Oppenheimer. General Groves is pushing hard for compartmentalisation—a strategy of, as I understood it, splitting the project into pieces where a person could work hard on one piece, but still lack enough other clues to piece together dangerous details about the plan as a whole (that could then presumably have been leaked to an adversary).

    But at one point the researchers rebel against this by hosting cross-team seminars. They claim they can’t think clearly about the problem if they’re not able to wrangle the whole situation in their heads. Groves reluctantly accedes, but imposes limits on how many researchers can attend those seminars. (If I recall correctly.)

    This is similar to complaints I’ve heard from past MIRI employees. It seemed pretty sensible to me for management to want to try a structure similar to the Manhattan Project one (not sure if directly inspired) to help with info-hazards, but also as a direct report I imagine it was pretty tricky to be asked to build Widget X, but without an understanding of what Widget X would ultimately be used for, and the tradeoff space surrounding how it was supposed to work.

    I’m curious whether this dynamic also matches experiences you had at MIRI, and if you have any thoughts on it.

  • I have also heard this kind of thing from past MIRI employees. But I think our situation was different; we didn’t have concrete pieces of the project that management had sliced off for us. Instead we had to, basically from scratch, come up with a solution to some chunk of the problem, without any object-level models.

  • Right. So given that, and insofar as you can share this without violating any confidentiality agreements, I am interested in how infohazard-prevention was implemented in your case?

  • There was one info bubble for the project. Ideas that Nate shared with us would not go outside this bubble without his approval. Our own ideas would be shared at our discretion. I’m not sure that the project would have been more successful if we had had full access to all of Nate’s models, but it seems more likely.

  • So things Nate shared with you would not go outside the bubble without his permission. But given how open-ended the domain was, I’m confused how you made that delineation. If Nate said a thing that was a kind of a weird, abstract frame, and you then used that frame to have a derivative insight about agent foundations stuff… some of the insight was due to Nate and some due to your own reasoning… How do you know what you can and cannot share?

  • Yeah this was a problem. Since the entire point of the project was absorbing Nate’s mindset, and all the info-hazards would come from this mindset defusing towards frontier labs, the sharing situation was really difficult. I think this really disincentivized us from sharing anything, because before doing so we would have to have a long conversation about exactly how much of the insights touched on which ideas, and which other ideas they were descended from, and so on. This was pretty bad.

  • Okay, so you say in the past tense that this disincentivized you from sharing anything. How, if at all, do you feel it’s affecting you right now? Do parts of your cognition still feel tied up in promises of confidentiality, in handicapping ways? Or do the promises feel siloed to what you worked on back then, with fewer implications these days?

  • Since we tried for a year and haven’t produced any impressive research I’m less concerned about sharing most things now.


  • Switching tracks a bit… is there a question you kind of wish we would ask?

  • Yes. Here are some:

    What are you doing next?

    • Trying to find projects with better feedback loops. I’m most excited about interpretability and control, where there has been an explosion of really good work lately. I think my time at MIRI gave me a desire to work on problems that are clean in a computer science way and also tractable, and not as conceptually fucked as “abstractions”, “terminal values”, etc. I want to work my way up towards mildly confused concepts like “faithfulness”, “mild optimization”, or “lost purposes”.

      I’m actively looking for projects especially with people who have some amount of experience (a PhD in something computer science related, or two papers at top ML conferences). If this is you and you want to work with me, send me a DM.

    Who should work at MIRI? (already touched on but still could answer)

    • As far as I know MIRI is not hiring for the specific theory we were doing. But in a similar situation to mine, you should have a few years of research experience, have read the MIRI curriculum, and ideally start off with a better understanding of Nate’s worldview and failure modes than I did. It might help to have deep experience in other areas that aren’t in the MIRI curriculum, like modern ML or RL or microeconomics or something, for the chance to bring in new ideas. I don’t know what the qualifications are for other roles like engineering.

    Any object level insights about alignment from the last year?

    • A couple of recurring themes were dealing with reflection /​ metacognition, and ensuring that a system’s ontology continues to mean what you want it to mean. I wasn’t super involved but I think these killed a bunch of proposals.

  • Do you endorse them killing those proposals? Did they feel like important dealbreakers, or just tangential fixations?

  • They’re key to the problem of creating an agent that’s robustly pointed at something, which I think is important (in the EA sense) but maybe not tractable. Superintelligences will surely have some kind of metacognition, and it seems likely that early AGIs will too, and metacognition adds so many potential failure modes. I think we should solve as many subproblems as tractable, even if in practice some won’t be an issue or we’ll be able to deal with them on the fly.

    Solving the full problem despite reflection /​ metacognition seems pretty out of reach for now. In the worst case, if an agent reflects, it can be taken over by subagents, refactor all of its concepts into a more efficient language, invent a new branch of moral philosophy that changes its priorities, or a dozen other things. There’s just way too much to worry about, and the ability to do these things is—at least in humans—possibly tightly connected to why we’re good at science. Maybe in the future, we discover empirically that we can make assumptions about what form reflection takes, and get better results.


  • One of my major thought-processes right now is “How can we improve feedback loops for ‘confusing research’?”. I’m interested in both improving feedback for ‘training’ (so you’re, i.e. gaining skills faster/​more-reliably, but not necessarily outputting an object level finished product that’s that valuable), and improving feedback for your outputs, so your on-the-job work can veer closer to your ultimate goal.

    I think it’s useful to ask “how can we make feedback-loops a) faster, b) less noisy, c) richer, i.e. give you more data per unit-time.”

    A few possible followup questions from here are:

    • Do you have any general ideas of how to improve research feedback loops? Is there anything you think you could have done different last year in this department?

    • What subskills do you think Alignment Research requires? Are there different subskills for Agent Foundations research?

    • It seems right now you’re interested in “let’s work on some concrete computer science problems, where the ‘deconfusion/​specification’ part is already done.” A) doublechecking that’s a good summary? B) While I have some guesses, I thought it’d be helpful to spell out why this seems helpful, or “the right next step?”.

    Do you have any thoughts on any of those?

  • Yeah this seems a bit meta, but I have two thoughts:

    First, I have a take on how people should approach preparadigmatic research, which I wrote up in a shortform here. When I read Kuhn’s book, my takeaway was that paradigms are kind of fake. They’re just approaches that catch on because they have been proven to solve lots of people’s problems—including people with initially different approaches. So I think that people should try to solve the problems that they think are important based on their view, and if this succeeds just try to solve other people’s problems, and not worry too much about “confusing” research or “preparadigmatic” research

    A big component of Nate’s deconfusion research methodology was to take two intuitions in tension—an intuition that X is possible, and an intuition that X is impossible—and explore concrete cases and try to prove theorems until you resolve this tension. I talked to Lawrence Chan, and he said that CHAI people did something similar with inverse RL. The tensions there were the impossibility theorems about inferring utility functions from behavior, and the fact that in practice I could look at your demonstrations and get a good idea of your preferences. Now inverse RL didn’t go anywhere, but I feel this validates the basic idea. Still, in CHAI’s case, they exited the deconfusion stage fairly quickly and got to doing actual CS theory and empirical work. At MIRI we never got there.

  • Is that because CHAI was better at deconfusion, or the problem was simpler, or they just… moved on faster (and maybe were still somewhat confused?)

  • I don’t know the details, but my guess is the problem was simpler.

  • My current line-of-inquiry, these past 2 months, is about “okay, what do you do when you have an opaque, intractable problem? Are there generalizable skills on how to break impossible problems down?”

    A worry I have is that people see alignment theoretical work and go “well this just seems fucked”, and then end up doing machine learning research that’s basically just capabilities (or, less ‘differentially favoring alignment’) because it feels easier to get traction on.

  • I don’t have much actual machine learning experience, but from talking to dozens of people at ICML, my view is that if the average ICML author reached this kind of opaque intractable problem, they would just give up and do something else that would yield a paper. Obviously in alignment our projects are constrained by our end goal, but I think it’s still not crazy to give up and work on a different angle. Sometimes the problem is actually too hard, and even if the actual problem that you need to solve to align the AI is easier we don’t know what simplifying assumptions to make; this might depend on theoretical advances we don’t have yet or knowing more about the AGI architecture which we don’t have.

    As for people doing easier and more capabilities relevant work because the more alignment-relevant problems are too hard, this could just be a product of their different worldviews; maybe they think alignment is easier.

    Here’s a mental model I have about this:

    AGI capabilities are on the x-axis, and alignment progress on the y-axis. Some research agendas proceed like the purple arrow (lots of capabilities externalities), whereas others proceed like the tiny black arrow.

    If your view is that the difficulty of alignment is the dotted line, pursuing the purple arrow will look much better, as it will reach alignment before it reaches AGI. However, if you believe the difficulty is the upper solid black line, the purple approach is net negative: since it differentially moves us closer to pure capability than alignment.


    If it’s ok, I’ll now add other members of the team so they can share their experiences.

  • I’d like to push back somewhat against the vibe of Thomas Kwa’s experience.

    I joined the team in May 2023, via working on similar topics in a project with Thomas Larsen. My experience was strongly positive, I’ve learned more in the last six months than in any other part of my life. On the other hand, by the standard of actually solving any part of alignment, the project was a failure. More research experience would have been valuable for us, but this is always true. It does seem like communication problems between Nate and the team slowed down the project by ~2x, but from my perspective it looked like Nate did a good job given his goals.

    The infosec requirements seemed basically reasonable to me. The mentorship Nate gave us on research methodology in the last few months was incredibly valuable (basically looking at our research and pointing out when it was going off the rails, different flags and pointers for avoiding similar mistakes, etc.).

    I found Thomas Kwa’s frustration at the lack of concrete problems to be odd, because a large part of the skill we were trying to learn was the skill of turning vague conceptual problems/​confusion into concrete problems. IMO we had quite a number of successes at this (although of course the vast majority of the time is not spent doing the fun work of solving the concrete problems, and also the majority of the time the direction doesn’t lead somewhere useful).

    I agree with Thomas that we did make a number of research mistakes along the way that slowed us down, and one of them was sometimes not looking at past literature enough. But there were also times where I spent too much time going through irrelevant papers and should have just worked on the problem from scratch, so I think at least later on I mostly got the balance right.

    My plan for the next couple of months is a) gaining research experience on as-similar-as-possible problems b) writing up and posting parts of my models I consider valuable (and safe), and c) exploring two research problems that came up last month that still seem promising to me.

    The problems in alignment that seem highest value for me to continue to work on seem to be the “conceptually fucked” ones, mostly because they still seem neglected (where one central example IMO is formalizing abstractions). I am wary of getting stuck on intractable problems forever and think it’s important to recognize when no progress is being made (by me) and move on to something more valuable. I don’t think I’ve reached that point yet for the problems I’m working on, but the secondary goal of (a) is to test my own competence at a particular class of problems so I can better know whether to give up.

  • Thanks, appreciate adding your perspective here @Jeremy. I found this bit pretty interested

    My plan for the next couple of months is a) gaining research experience on as-similar-as-possible problems b) writing up and posting parts of my model I consider valuable, and c) exploring two research problems that came up last month that seem promising to me.

    The problems in alignment that seem highest value for me to continue to work on seem to be the “conceptually fucked” ones, mostly because they still seem neglected. I am wary of getting stuck on intractable problems forever and think it’s important to recognize when no progress is being made (by me) and move on to something more valuable. I don’t think I’ve reached that point yet for the problems I’m working on, but the secondary goals of (a) is to test my own competence at a particular class of problems so I can better know whether to give up.

    I’d be interested in hearing more details about this. A few specific things:

    My plan [...] is gaining research experience on as-similar-as-possible problems.

    Could you say more about what sort of similar problems you’re thinking of?

    And here:

    is to test my own competence at a particular class of problems so I can better know whether to give up.

    Do you have a sense of how you’d figuring out when it’s time to give up?

  • Could you say more about what sort of similar problems you’re thinking of?

    I only made a list of these the other day, so I don’t have much detail on this. But here are some categories from that list:

    • Soundness guarantees for optimization and inference algorithms, and an understanding of the assumptions needed to prove these.

    • Toy environments where we have to specify difficult-to-observe goals for a toy-model-agent inside that environment, without cheating by putting a known-to-me-correct-ontology into the agent and specifying the goal directly in term of that ontology.

    • Problems similar to formalizing heuristic arguments.

    • Toy problems that involve relatively easy types of self reference.

    I think experience with these sorts of problems would have been useful for lots of sub-problems we ran into. Some of them are a little idiosyncratic to me, others on the team disagree about their relevance. I haven’t turned these categories into specific problems yet.

    Do you have a sense of how you’d figuring out when it’s time to give up?

    I think mostly I judge by how excited I am about the value of the research even after internally simulating outside view perspectives, or arguing with people who disagree, and trying to sum it all together. This is similar to the advice Nate gave us a couple of weeks ago, to pursue whatever source of hope feels the most real.

  • I found Thomas Kwa’s frustration at the lack of concrete problems to be odd, because a large part of the skill we were trying to learn was the skill of turning vague conceptual problems/​confusion into concrete problems.

    I think the lack of concrete problems bothers me for three reasons:

    • If our goal is to get better at deconfusion, we’re choosing problems where the deconfusion is too hard. When trying to gain skill at anything, you should probably match the difficulty to your skill level such that you succeed >=50% of the time, and we weren’t hitting that.

    • It indicates we’re not making progress fast enough, and so the project is less good than we expected. Maybe this is due to inexperience, maybe there wasn’t anything to find.

    • It’s less fun for me.

  • If our goal is to get better at deconfusion, we’re choosing problems where the deconfusion is too hard. When trying to gain skill at anything, you should probably match the difficulty to your skill level such that you succeed >=50% of the time, and we weren’t hitting that.

    This point feels pretty central to what I was getting at with “what are the subskills, and what are the feedbackloops”, btw.

    I think “actually practice the deconfusion step” would be a good thing to develop better practices around (and to, like, design a curriculum for that actually gets people to a satisfactory level on it).

  • IMO we had quite a number of successes at this (although of course the vast majority of the time is not spent doing the fun work of solving the concrete problems, and also the majority of the time the direction doesn’t lead somewhere useful).

    @Jeremy Gillen I’d be interested in hearing a couple details about some of the more successful instances according to you (in particular where you feel like you successfully deconfused yourself on a topic. And, maybe then went on to solve the concrete problem that resulted?).

    A thing I’m specifically interested in here is “What are the subskills that go into the deconfusion → concrete-problem-solution pipeline? And how can we train those subskills?”.

    The actual topics might be confidential, but curious if you could share more metadata about how the process worked and which bits felt hard.

  • I’m a bit skeptical that early-stage deconfusion is worth investing lots of resources in. There are two questions here.

    • Is getting much better at early stage deconfusion possible?

    • Is it a bottleneck for alignment research?

    I just want to get at the first question in this comment. I think deconfusion is basically an iterative process where you go back and forth between two steps, until you get enough clarity that you slowly generate theory and testable hypotheses:

    1. Generate a pair of intuitions that are in tension. [1]

    2. Poke at these intuitions by doing philosophy and math and computer science.

    I never got step 1 and don’t really know how Nate does it. Maybe Jeremy or Vivek are better at it, since they had more meetings with Nate. But it’s pretty mysterious to me what mental moves you do to generate intuitions, and when other people on the team try to share them they either seem dumb or inscrutable. The problem is that your intuitions have to be amenable to rigorous analysis and development. But I do feel a lot more competent at step 2.

    One exercise we worked through was resolving free will, which is apparently an ancient rationalist tradition. Suppose that Asher drives by a cliff on their way to work but didn’t swerve. The conflicting intuitions are “Asher felt like they could have chosen to swerve off the cliff”, and “the universe is deterministic, so Asher could not have swerved off the cliff”. But to me, this felt like a confusion about the definition of the word “could”, and not some exciting conflict—it’s probably only exciting when you’re in a certain mindset. [edit: I elaborate in a comment]

    I’m not sure how logical decision theory was initially developed, but it might fit this pattern of two intuitions in tension. Early on, people like Eliezer and Wei Dai realized some problems with non-logical decision theories. The intuitions here are that desiderata like the dominance principle imply a decision theory like CDT, but CDT loses in Newcomb’s problem, which shouldn’t happen predictably if you’re actually rational. Eventually you become deconfused enough to say that “maximize utility under your priors” is a more important desideratum than “develop decision procedures that follow principles thought to be rational”. Then you just need to generate more test cases and be concrete enough to notice subtle problems. [2]

    IMO having some sort of exciting confusion to resolve is necessary for early stage deconfusion research.; not having one implies you have no line of attack. But it’s really unclear to me how to reliably get one. Also the framing of “two intuitions in tension” might be overly specific.

    I think step two—examining these intuitions to get observations and eventually turn them into theory—is basically just normal research, but doing this in the domain of AF is harder than average because we have few examples of the systems we’re trying to study. I’m being vague and leaving things out due to uncertainty about exactly how much Nate thinks is ok to share, but my guess is that standard research advice is good: Polya’s book, Mark Xu’s post, et cetera.

    Overall it seems reasonable that people could get way better at early stage deconfusion, though there’s a kind of modesty argument to overcome.

  • Now the second question: is early-stage confusion a bottleneck for alignment research? I think the answer is not in AF but maybe in interpretability. AF is just not progressing fast enough and I’d guess it’s not necessary for AF theory to race ahead of our ML knowledge in order to succeed at alignment.

    But even in interpretability, maybe we need to focus on getting more data (empirical feedback loops) rather than squeezing more insights out of our existing data (early-stage confusion). I have several reasons for this.

    • The field of ML at large is very empirical.

    • The neuroscientists I’ve talked to say that a new scanning technology that could measure individual neurons would revolutionize neuroscience, much more than a theoretical breakthrough. But in interpretability we already have this, and we’re just missing the software.

    • Eliezer has written about how Einstein cleverly used very limited data to discover relativity. But we could have discovered relativity easily if we observed not only the precession of Mercury, but also the drifting of GPS clocks, gravitational lensing of distant galaxies, gravitational waves, etc. [edit: aysja’s comment changed my mind on this]

    • I heard David Bau say something interesting at the ICML safety workshop: in the 1940s and 1950s lots of people were trying to unlock the basic mysteries of life from first principles. How was hereditary information transmitted? Von Neumann designed a universal constructor in a cellular automaton, and even managed to reason that hereditary information was transmitted digitally for error correction, but didn’t get further. But it was Crick, Franklin, and Watson who used crystallography data to discover the structure of DNA, unraveling far more mysteries. Since then basically all advances in biochemistry have been empirical. Biochemistry is a case study where philosophy and theory failed to solve the problems but empirical work succeeded, and maybe interpretability and intelligence are similar. [edit: aysja’s comment adds important context; I suggest reading this and my reply]

      • It’s a compelling idea that there are simple principles behind intelligence that we could discover using theory, but it probably would have been equally compelling to Von Neumann that there are simple principles behind biochemistry that could be discovered using theory. In reality theory did not get us nearly far enough to even start designing artificial bacteria whose descendants can only survive on cellulose (or whatever), so probably theory won’t get us far in designing artificial agents that will only ever want to maximize diamond.

  • I don’t feel like writing a long comment for this dialogue right now, but want to note that:

    • I’m still somewhat excited about the general direction I started on around end of June (which definitely wouldn’t have happened without Nate’s mentorship), and it’s a live direction I’m working on

    • I don’t share the sentiment of not having enough of a concrete problem. I don’t think the thing I’m working on is close to being a formally defined problem, but it still feels concrete enough to work on. I think I’m naturally more comfortable than Thomas with open-ended problems where there’s some intuitive “spirit of the problem” that’s load-bearing. It’s also the case that non-formalized problems depend on having a particular intuition, so it’s possible for one person to have enough sense of the problem to work on it but not be able to transfer that to anyone else.

  • I think for my feelings about the project I fall somewhere between Thomas Kwa and Jeremy Gillen. I’m pretty disappointed by our lack of progress on object level things, but I do feel like we managed to learn some stuff that would be hard to learn elsewhere.

    Our research didn’t initially focus on trying to understand cognition, although that is where we ended up. At the start we were nominally trying to do something like “understand the Sharp Left Turn”, maybe eventually with the goal of writing up the equivalent of Risks from Learned Optimization. Initially we were mainly asking the question “in what ways do various alignment agendas break (usually due to the SLT), and how can we get around these failures?”. I think we weren’t entirely unsuccessful here, but we also didn’t understand the SLT or Nate’s models at this point and so I don’t think our work was exceptional. Here we were still thinking of things from a more “mainstream alignment” perspective.

    Following this we started focusing on trying to understand Nate’s worldview and models. This pushed the project in more of a direction of trying to understand cognition more. This is because Nate (or Peter’s Straw-Nate) thinks that you need to do this in order to see failure modes, and this is the ~only way to build a safe AGI.

    Later it seemed like Nate thought that the goal of the project from the beginning was to understand cognition, but I don’t remember this being the case. I think if we had had this frame from the start then we would have probably focused on different (and in my opinion better) things early in the project. I guess that (Straw-)Nate didn’t want to say this explicitly because he either wanted us to arrive at good directions independently as a test of our research taste or because he thinks that telling people to do things doesn’t go well. I think the research taste thing is partially fair, although it would have been good to be explicit about this. I also think that you can tell people to do things and sometimes they will do them.

    Later in the project (around June 2023) I felt like there was something of a step change, where we understood what Nate wanted out of us and also started trying to learn various specific research methods. (We had also done a fair but of learning Nate’s research methods before this as well). At this point it was too little, too late though.

    I think we often lost multiple months by working on things that Nate didn’t see as promising, and I think this should have been avoidable.

    I also want to echo Thomas’s points about difficulties communicating. Nate is certainly brilliant, he’s able to think in a pretty unique way, but this often lead to us talking past each other using different ontologies. I think these conversations were also just pretty uncomfortable and difficult in the ways Thomas described.

    I just said a lot of negative comments but I do think that we (or some of us) got a lot out of the project and working with Nate. There is a pretty specific worldview and picture of the alignment problem that I don’t think you can really get from just reading MIRI’s public outputs. I think I am just much better at thinking about alignment than I was at the start, and I can (somewhat) put on my Nate-hat and be like “ahh yes, this alignment proposal will break for these reasons”. I really feel like much less of an idiot when it comes to thinking about alignment, and looking back on what I thought a year ago I can see ways in which my thinking was confused or fuzzy.

    I’m currently in a process of stepping back and trying to work out what I actually believe. I have some complicated feelings like “Yep, the project does seem to be a failure, but we do (somewhat) understand Nate’s picture of the alignment problem, and if it is correct then this is maybe the ~only kind of technical research that helps”. There’s some vibe that almost everyone else is working in some confused ontology that just can’t see the important parts. If you buy the MIRI worldview, this does mean that it is high value to work on problems that are “conceptually fucked”; but if you don’t then I think working on these are way less valuable.

    I’m also much less pessimistic about the possibility of communicating with people than Nate/​MIRI seem to be. It really does seem like if you’re correct about something, you should be able to talk to people and convince them. (”People” here being other alignment people, I’m a bit less optimistic about policy people, but not totally hopeless here). I am currently trying to work out even if we fully buy the MIRI worldview of alignment, is the technical side too hard and we should just do communications and politics instead.

    I’m also excited for all the things Jeremy said he was planning on doing for the next couple of months.

  • @Jeremy Gillen I’d be interested in hearing a couple details about some of the more successful instances according to you (in particular where you feel like you successfully deconfused yourself on a topic. And, maybe then went on to solve the concrete problem that resulted?).

    Here’s one example that I think was fairly typical, which helped me think more clearly about several things but didn’t lead to any insights that were new (for other people). I initially wanted a simple-as-possible model of an agent that could learn more accurate beliefs about the world while doing something like “keeping the same ontology” or at least “not changing what the goal means”. One model I started with was an agent built around a Bayes net model of the world. This lets it overwrite some parts of the model with a more detailed model, without affecting the goal (kinda, sometimes, there is some detail here about how it helps that I’m omitting because it could get long. Simple version is: in Solomonoff induction each hypothesis is completely separate, and in a Bayes net you can construct it such that models overlap and share variables).

    There’s a couple of directions we went from there, but one of them that Peter pointed out was “it looks like this agent can’t plan to acquire evidence”. So the next step was working through a few examples and modifying the model until it looked like it could work. This lead us directly to the problem of the agent having to model itself planning in the future. So there were further iterations of proposing some patch, working through the implications in concrete problems, noticing ways that the new model was broken, and iterating further. Working through that gave me a lot more understanding and respect for work on Vingean reflection and logical induction, because we ran into a motivation for some of that work without really intending to.

    This was typical in that it didn’t feel completely successful (there are still plenty of unpatched problems), but we did patch a few and gained a lot of understanding along the way. And the work meandered through several iterations/​branches of “vague confusion → basic model that tries to solve just that problem → more concrete model → noticing issue with the concrete model (i.e. problems that it can’t solve or ways that it isn’t clean) → modifying to satisfy one of the other constraints”.

    The main bottleneck for me was making a concrete model. Usually potential problems felt small and patchable until I made a tiny super concrete version of the problem and try to work through patching it. This was hard because it had to be simple enough to work through examples on paper while capturing the intuitions I wanted. I expect if I had a bigger library of toy models and toy problems in my head and had experience messing around with them on paper this would have helped. And building up my experience working with different types of toy models was a really valuable part of the work.

  • That’s cool.

    I’m not sure how easy this is to answer without getting extremely complicated, but I’d be interested in understanding what the model here actually “was”. Like, was it a literal computer program that you could literally run that could do simple tasks, or like a sketch of a model that could hypothetically run with infinite compute that you reasoned about, or some third thing?

  • Mostly the second thing, although most versions of it had a small world model (because that makes it easier for me to think about) so we could have actually programmed it.

  • Gotcha.

    Following up on Peter’s comment:

    At the start we were nominally trying to do something like “understand the Sharp Left Turn”, maybe eventually with the goal of writing up the equivalent of Risks from Learned Optimization. At the start we were mainly thinking asking the question “in what ways do various alignment agendas break (usually due to the SLT), and how can we get around these failures?”.

    Did you ever write up anything close to “Risks from Learned Optimization [for ‘the Sharp Left Turn’]”?

    (It sounds like the answer is mostly “no”, but curious if there’s like an 8020 that you could do fairly easily that’d make for a better explanation of the concept than what’s currently out there, even if it’s not as comprehensive as Risks from Learned Optimization)

  • I think the sharp left turn is not really a well-defined failure mode. It’s just the observation that under some conditions (e.g. the presence of internal feedback, or when alignment is enforced by some shallow overseer), capabilities will sometimes generalize farther than alignment, and the claim that practical AGI designs will very likely have such conditions. If generality of capabilities, dangerous capability levels, and something that breaks your safety invariants all come at the same time, alignment is harder. Fast takeoff makes it worse but is not strictly required. As for why this happens, Nate has models but I may or may not believe them, and haven’t even heard some of them due to infosec.

    As for whether we have RLO for the sharp left turn, the answer is no. We thought a bit about under what circumstances capabilities generalize farther than alignment, but I don’t think our thoughts are groundbreaking here. There are also probably points related to the sharp left turn proper that I feel we have a better appreciation of. I think we’ll write these up if we have any idea how to, or maybe it’ll be another dialogue.

    I will say that Nate’s original post on the sharp left turn now feels perfectly natural to me rather than vague and confusing, even if I do have <80% that it is basically true. (I disagree with the part where Nate claims alignment is of comparable difficulty to other scientific problems humanity has solved because I think problem difficulties are ~power-law distributed, but it’s not crazy if we have the ability to iterate on AGI designs and just need to solve the engineering problem.)


  • We wrote this dialogue without input from Nate Soares then reached out to him afterwards. I think his comment adds important context:

    at risk of giving the false impression that i’ve done more than skim the beginning of this conversation plus a smattering of the replies from different people:

    on an extremely sparse skim, the main thing that feels missing to me is context—the context was not that i was like “hey, want some mentorship?”, the context (iirc, which i may not at a year’s remove) is that vivek was like “hey, what do you think of all these attempts to think concretely about sharp left turn?” and i was like “i’m not particularly sold on any of the arguments but it’s more of an attempt at concrete thinking than i usually see” and vivek was like “maybe you should mentor me and some of my friends” and i was like “i’m happy to throw money at you but am reluctant to invest significant personal attention” and vivek was like “what if we settle for intermittent personal attention (while being sad about this)”, and we gave it a sad/​grudging shot. (as, from my perspective, explains a bunch of the suboptimal stuff. where, without that context, a bunch of the suboptimal stuff looks more like unforced errors.)

    the other note i’d maybe make is that something feels off to me also about the context of the “no concrete problems” stuff, in a way that it’s harder for me to quickly put my finger on. an inarticulate attempt to gesture at it is that it seemed to me like thomas kwa was often like “but how can i learn these abstract skills you speak of by studying GPT-2.5 small’s weights” and i was like “boy wouldn’t that be nice” and he was like “would this fake project work?” and i was like “i don’t see how, but if it feels to you like you have something to learn from doing that project then go for it (part of what i’m trying to convey here is a method of following your own curiosity and learning from the results)” and thomas was like “ok i tried it but it felt very fake” and i was like “look dude if i saw a way to solve alignment by concrete personal experiments on GPT-2.5 small i’d be doing them”

    a related inarticulate attempt is that the parts i have skimmed have caused me to want to say something like “*i’m* not the ones whose hopes were nominally driving this operation”.

    maybe that’s enough to biangulate my sense that something was missing here, but probably not /​shrug.