Alignment Research Field Guide
This field guide was written by the MIRI team with MIRIx groups in mind, though the advice may be relevant to others working on AI alignment research.
Preamble I: Decision Theory
Hello! You may notice that you are reading a document.
This fact comes with certain implications. For instance, why are you reading this? Will you finish it? What decisions will you come to as a result? What will you do next?
Notice that, whatever you end up doing, it’s likely that there are dozens or even hundreds of other people, quite similar to you and in quite similar positions, who will follow reasoning which strongly resembles yours, and make choices which correspondingly match.
Given that, it’s our recommendation that you make your next few decisions by asking the question “What policy, if followed by all agents similar to me, would result in the most good, and what does that policy suggest in my particular case?” It’s less of a question of trying to decide for all agents sufficiently-similar-to-you (which might cause you to make the wrong choice out of guilt or pressure) and more something like “if I were in charge of all agents in my reference class, how would I treat instances of that class with my specific characteristics?”
If that kind of thinking leads you to read further, great. If it leads you to set up a MIRIx chapter, even better. In the meantime, we will proceed as if the only people reading this document are those who justifiably expect to find it reasonably useful.
Preamble II: Surface Area
Imagine that you have been tasked with moving a cube of solid iron that is one meter on a side. Given that such a cube weighs ~16000 pounds, and that an average human can lift ~100 pounds, a naïve estimation tells you that you can solve this problem with ~150 willing friends.
But of course, a meter cube can fit at most something like 10 people around it. It doesn’t matter if you have the theoretical power to move the cube if you can’t bring that power to bear in an effective manner. The problem is constrained by its surface area.
MIRIx chapters are one of the best ways to increase the surface area of people thinking about and working on the technical problem of AI alignment. And just as it would be a bad idea to decree “the 10 people who happen to currently be closest to the metal cube are the only ones allowed to think about how to think about this problem”, we don’t want MIRI to become the bottleneck or authority on what kinds of thinking can and should be done in the realm of embedded agency and other relevant fields of research.
The hope is that you and others like you will help actually solve the problem, not just follow directions or read what’s already been written. This document is designed to support people who are interested in doing real groundbreaking research themselves.
You and your research
Logistics of getting started
Models of social dynamics
Other useful thoughts and questions
1. You and your research
We sometimes hear questions of the form “Even a summer internship feels too short to make meaningful progress on real problems. How can anyone expect to meet and do real research in a single afternoon?”
There’s a Zeno-esque sense in which you can’t make research progress in a million years if you can’t also do it in five minutes. It’s easy to fall into a trap of (either implicitly or explicitly) conceptualizing “research” as “first studying and learning what’s already been figured out, and then attempting to push the boundaries and contribute new content.”
The problem with this frame (according to us) is that it leads people to optimize for absorbing information, rather than seeking it instrumentally, as a precursor to understanding. (Be mindful of what you’re optimizing in your research!)
There’s always going to be more pre-existing, learnable content out there. It’s hard to predict, in advance, how much you need to know before you’re qualified to do your own original thinking and seeing, and it’s easy to Dunning-Kruger or impostor-syndrome yourself into endless hesitation or an over-reliance on existing authority.
Instead, we recommend throwing out the whole question of authority. Just follow the threads that feel alive and interesting. Don’t think of research as “study, then contribute.” Focus on your own understanding, and let the questions themselves determine how often you need to go back and read papers or study proofs.
Approaching research with that attitude makes the question “How can meaningful research be done in an afternoon?” dissolve. Meaningful progress seems very difficult if you try to measure yourself by objective external metrics. It is much easier when your own taste drives you forward.
No procedure for doing research will fit for everyone. However, what follows are steps which you can try either on your own or in a group setting (such as MIRIx) in order to practice the kind of curiosity-driven research just described.
1. Write a list of questions.
If you are doing this as a group, put the list on a whiteboard or other place where everyone can see.
Focus on what you don’t know how to do, or what you feel confused about.
If no questions come to mind, say to yourself (or the group), “excellent, I must know how to solve the whole problem” and try to give details of the solution until you get stuck.
It’s also OK for things on the list to be ideas you’d like to develop further, or thoughts you’d like the group to critique, rather than questions.
2. Choose one of the questions to focus on, based on what feels most interesting.
If you are in a group of more than three people, consider splitting the group up. Each group can discuss its own question, or have parallel discussions on the same overall question. Agree on a time to come back together and discuss what you thought about.
It can be good to keep the whole list of questions somewhere visible, so that you have a reminder of other interesting topics to switch to if thoughts peter out on the question originally chosen.
3. Clarify your curiosity. What is desired? What do you think might be possible?
In a group, usually the person who proposed a topic will have some things to say in order to get everyone on the same page.
Working on your own, it can be useful to just start writing down everything you think you know, and what you think you don’t know. Write down anything potentially relevant which comes to mind. Don’t worry initially about whether your claims are true or whether your questions are meaningful. Then, go back and try to make sense of it. Try to formalize your claims and questions until they turn into something which is definitely either true or false.
4. Keep clarifying.
Keep stating sub-questions and making claims which may or may not be true, starting informally and working towards formal rigor.
Notice where your curiosity waxes and wanes, and avoid dutiful completeness. Look for the simplest possible cases that you are still confused about, and try to work through them.
Allow yourself to get sidetracked. Allow yourself to play. So long as everyone in the discussion is curious and engaged, it’s working to build understanding. Be open to getting nerd-sniped by “irrelevant” math questions; they may eventually turn out to be more relevant than they seem. You’re building your own capability, even if it isn’t directly useful to the problem you’re working on.
If you do arrive at a concrete mathematical result which captures something interesting, or even a concrete mathematical question, write it up properly. A good write-up often adds a lot to your own understanding, besides the value of communicating your ideas to others.
This resembles how much of the progress at MIRI happens. It’s very different from the attractor of “just read lots of papers,” and it’s very different from the attractor of “try to figure out top-down what the field as a whole needs.”
An easy mistake is to think of yourself as trying to contribute to the world’s collective knowledge, and thereby neglecting to prioritize your own knowledge and understanding. “Just read papers” may sound like it’s prioritizing your own knowledge, but it often reflects a mindset that’s tacitly assuming that others know exactly what you need to know. “Optimize for your own understanding” is a mindset with a faster feedback loop.
There’s nothing inherently wrong with reading papers—even if it’s just because they’re in the field and you want a broad overview of the field. But throughout, you should be trying to form a picture of what you personally do and don’t know how to do, and what you’d need to know how to do in order to solve the problem. That’s hard, and maybe you’re sure that the first five ideas you write down will be wrong. Still, write them down anyway, and try to get them to work, so you can see what happens and discover what goes wrong.
We don’t want a hundred bright minds all asking the exact same questions, and taking the exact same set of assumptions. We want a field full of explorers, not exploiters. Put another way, the best way to become a researcher is to practice the skill of independent thought right from the beginning, rather than exercising your “sit back and absorb information for its own sake” muscles.
So don’t ask “What are the open questions in this field?” Ask: “What are my questions in this field?”
2. Logistics of getting started
Let’s say you’ve tried some things that resemble the above, you enjoyed them, and you want to move forward on starting your own MIRIx chapter.
Our first recommendation is that you find ONE or TWO other people (not three+), and try doing research together once or a few times. There’s more detail below in the social dynamics section about how exactly that might look, but the idea is that you want to establish a tone and flow with a small number of people first. Negotiating a direction for the group tends to be much harder if you start with a larger number of people.
Another important choice which can be difficult to negotiate with a large number of people is schedule. Finding a time and place which is good for everyone can become intractable, and changing it meeting to meeting to try to make it work for everyone can be de-motivating. Choose a schedule which is good for the founding core of the group. What day of the week is good for you? How often do you want to meet? How long do you want meetings to be? We recommend meetings be monthly, weekly, or every other week. Meeting length can be anywhere from an hour to a whole day, depending on what makes sense for you.
Once you find a partner or two that you genuinely enjoy making progress with, your next step is to plan and advertise for a first large meetup (where “large” means something like “three to six new people” and definitely doesn’t mean “twenty or thirty attendees”).
Try to find a venue that is private and sound-isolated, has flat surfaces and comfortable seating, and has whiteboards on the walls. Universities often have spaces like this, as do public libraries, but someone’s living room is fine if you can minimize the number of intrusions and interruptions. If you can’t find a space with whiteboards, look for easels and easel pads, and in either case be sure to bring your own markers. Also bring along spare paper, pens, and clipboards, and assign someone to make sure that there are snacks and drinks.
(A note about snacks and drinks: people almost always underestimate the importance of the quality and quantity of food, anchoring on something like “I dunno, maybe just spend ten bucks on some chips or something?” Instead, ask yourself: what dollar value would I put on a 15% increase in the group’s ability to think, overall mood, and ultimate satisfaction with the event? That’s how much you should consider spending (/ asking MIRI to spend) on snacks, especially for the first meeting. Don’t buy only junk food. It may give you more energy temporarily, but it will make you worse at thinking later. So, especially for longer meetings, healthy snacks are critical. Longer meetings should also include a meal, perhaps at a nearby restaurant. This also serves as a good break.)
At that first large meeting, you’ll want to start by formally electing a president. This is an important piece of common-knowledge culture—many times, the president won’t do much, but it’s extremely useful to have a single person with the moral authority to set agendas, choose between various good options, and keep the group on track. You may also end up electing a secretary/record-keeper, or possibly a coordinator to handle venue and food, or other offices (or you could do this after a few meetings).
Next, you’ll want to model the process that has already been working for you. Perhaps this means sharing a list of pre-existing questions, and seeing which capture the interest of your participants. Perhaps it means discussing the broader thrust of your research thus far before brainstorming some topics. Regardless, you’ll want to get down to actual thinking, writing, proving, and discussing as soon as you can. Breaking into smaller groups is often helpful if more than four people are at a meeting. If you do this, schedule a time to come back and share ideas.
Try to include breaks in your structure to keep everyone fresh. It can be difficult to remember to take a break when things get going, so it’s worth setting the intention ahead of time. Short breaks every hour in which people get up and walk around are very helpful.
It can be helpful to keep a public list (on a whiteboard or shared Google doc) of questions you have, needed concepts, and promising ideas. This is an easy source of new topics if a conversation runs dry.
One possible structure incorporating the above advice and the research procedure from the previous section:
At the beginning of each meeting, everyone lists questions/topics/confusions, which are written on a public list.
People make bids to start groups on topics they’re excited about, and split off.
Groups talk for 45 minutes.
Everyone re-gathers, and discusses what happened in the smaller groups for a few minutes.
Five or ten minute break, depending on how people are feeling.
New questions/ideas are added to the board, and the process repeats as desired. (If you plan to do several cycles, also include a longer break such as a meal somewhere.)
At the end of the meeting, schedule the next event. You may have settled on a rough schedule which works for the core of the group, but you’ll still be adjusting it meeting-to-meeting to account for holidays and other absences. Confirming the next meeting time with everyone present is also important for attendance, even if the meeting times are set in stone. Make sure to establish at the outset that you’re not going to try to optimize for everyone’s availability at once; it’s good to have meetups that people feel okay skipping from time to time, as long as there’s something like 70-90% consistency in the group. If one or two people can’t make it to the second meeting, be sure to get information from them so that you can prioritize their schedules a little more when planning the third.
3. Models of social dynamics
What follows are some half-baked, ad-hoc models of what makes for a good research group, or a good collaborative enterprise in general. You should consider all of the following to be true in spirit but false in detail, and should try to derive your own value rather than treating these as actual suggestions to follow.
3A. Transmitters and receivers
We’ve found in our own research that conversations tend to go better when they are primarily between two people. This is not to say that you shouldn’t have three or more people involved in the conversation, but in any given five-minute span of time, there should mostly be just two people talking—one who is currently trying to convey something, and another who is trying to understand (and whose understanding the first is specifically optimizing for; discussing a topic at a level such that four or five different people can all follow everything is usually worse on net).
Call these two roles the “transmitter” and the “receiver.” Things you might transmit:
A specific question or confusion
A model or chain of reasoning
A piece of relevant background information that needs to be deeply understood in order for the conversation to proceed
Things the receiver might do:
Mirror back to the transmitter what the transmitter just said, in different words. This lets the transmitter check where the transmission has succeeded or failed.
Take notes on a whiteboard, or attempt to draw diagrams, and have the transmitter verify or correct them. Do this as formally as you can. Try to write down statements in logic and turn informal arguments into proofs. Type theory is good for this kind of receiving; just writing down precise data types corresponding to what’s being discussed can be very helpful.
Resist the impulse to round off what the transmitter is saying to something you already understand. A good way to guard against this: attempt to find at least two interpretations, and ask questions which differentiate between them.
Hold tight to the assumption that the thing the transmitter is trying to convey is interesting. Avoid “critic” mode that will tend to make it harder for the transmitter to think and express freely. Even if there is a fatal flaw in what the transmitter is explicitly saying, your job is to help them dig up the spark of intuition which made them go down that path, so that they can turn it into a useful idea if possible.
Stay closely in touch with confusion, and speak up where things don’t seem to make sense. Ask clarifying questions. Your job as receiver isn’t to just nod along or make the transmitter feel understood. Be gentle when necessary to help the transmitter get in touch with what they’re trying to convey; but once they’re in touch, your job is to really get it out of them, in detail!
If the transmitter’s idea seems quite clear, the receiver can start red-teaming it, which means looking for attacks to make the approach fail. Being the critic when an idea isn’t properly out yet blocks things up, but once there’s a firm proposal which seems to make sense, it’s open season.
Look for implications of what the transmitter is saying. (“Ah, so then X!”; or, “Would that mean that X?”, etc.) This serves at least three purposes. First, it lets the transmitter know that you see why their idea would be totally awesome if it worked. After all, you’re doing all these useful things with the idea. This helps keep things going. Second, it tests whether you see what they’re getting at. Third, a totally absurd implication can suggest that you’re down a wrong track and should back up to see where you took a wrong turn.
White-hat trolling or gadfly-ing. Sometimes there’s not much doing with the transmitter (or there’s no active transmitter; no one having ideas). Play the role of a mischievous Socrates. Ask questions about seemingly basic things and try to show why nothing anybody thinks makes any sense. Or, defend an absurd position. (A troll may sometimes seem like a transmitter, but is actually a receiver.)
The transmitter should feel as free as possible to just make claims, including “totally fake” claims, as long as they are keeping in touch with their intuitions; try to establish a norm where you can ask receivers to collaborate with you in uncovering the kernel of truth in what you’re saying rather than shooting down half-formed ideas because they’re still half-wrong. No matter how nonjudgemental the receivers are, it may help the transmitter to say things like “everything I’m about to say is totally wrong, but” every so often.
The transmitter should also remain in touch with their intuition and curiosity, steering the conversation to what they think is most interesting rather than trying to perform or entertain. The transmitter is under no obligation to answer the receiver’s questions; feel free to say “that’s not what I want to think about right now.”
The key idea is that the receiver is helping midwife what the transmitter is saying. In that moment, it is the transmitter’s thinking that should take priority, and the receiver is acting as a sounding board, a living intuition pump, and a source of confusion and (minor) chaos.
Meanwhile, any third parties in the audience should be trying to serve as facilitators/translators. They should be watching both the transmitter and the receiver and seeking to model what’s going on for those people. Where are they missing each other, and talking past each other? Where are they running up against confirmation bias, or the double illusion of transparency? Where are they both agreeing that something makes sense without actually understanding it?
The audience members should speak up from time to time (probably less than 10% of the total words) to inject relevant thoughts or models or questions. Sometimes, such an interjection will be the cause of a role switch, with an audience member taking on a new role as either transmitter or receiver, and one of the other parties rotating out.
3B. High standards for membership
It’s awkward to not-invite someone or to turn them away after one or two meetings, but it’s even more awkward to wreck your entire MIRIx chapter because you were too shy or too uncertain to protect it.
Have a clear distinction between “welcome to come to a meeting” and “is now a full part of the group.” Make sure that there is a known decision-maker or set of decision-makers, and empower them to make calls by fiat, without having to justify or explain. (If you don’t trust their judgment without explanation, don’t have them be part of the decision-making.) Trust your own instincts; if you don’t feel like someone is a good match for the vibe you have going, then don’t invite them in. Consider requiring multiple recommendations, or having an interview process. These may seem unnecessary, but it can be difficult to turn people away, and a formal process makes it feel more fair.
Also consider having formal ethical guidelines, or a group pledge or set of commitments, which people sign at the moment that they fully join. Make sure that any standards you set are ones you are willing to actually enforce (e.g. “you must come to half of all meetings” or “content discussed here is confidential unless otherwise stated”).
3C. Escalating asks and rewards
Consider the model of a martial arts academy. When you first arrive, the instructors ask a few small things of you (e.g. kick this target, yell out loudly when you do so). Soon, they reward you for these things with a belt and some status.
At that point, the asks escalate. Perhaps now, as a yellow belt, you are put in charge of watching some white belts for a few minutes, and correcting their form. In return, they are told to bow to you and call you “sir” or “ma’am.”
As time goes on, the asks increase, and the rewards increase commensurately. This cycle fosters commitment and investment—it’s a process of slowly proving to the individual “if I put something into this system, I will get something out of it, and the more I put in, the more I’ll get out.” Eventually, you will receive a black belt, and possibly be asked to join as a paid instructor or found your own branch of the school.
There is a similar dynamic in most groups and organizations. Groups which ask little or nothing of their members do not receive loyalty in return. Individuals feel bought-in to a group to the extent that that group allows them to tell positive or epic stories about themselves.
The same will be true of your MIRIx chapter. Consider having some small, early asks that are the same for most newcomers (e.g., read such-and-such paper, or give a ten-minute talk on a topic of interest at your third meeting). Try to build a pipeline of greater asks and rewards over time (e.g., on your fifth-ish meeting, we’d like you to take charge of setting the agenda and dividing up the groups).
3D. Structure and elbow room
Related to the previous, it’s important that you balance top-down and bottom-up structure in your MIRIx. If there’s no clear sense of “how we do things,” then newcomers will flounder and have a bad time. You want there to be a pre-existing structure that people can evaluate, to determine whether or not they feel like they fit into it. You want the “what’s this like?” of your group to be clearly visible, right from the get-go, so that both people who are well-suited to it and people who aren’t can (for the most part) accurately self-assort.
At the same time, you don’t want that structure to feel limiting or confining in the long run. Just as martial artists eventually earn the right to determine some of their own training and the ability to contribute to the agenda-setting and curriculum of newer students, so too do you want the “pie” of your MIRIx to grow as time goes on. Otherwise, people will grow frustrated by their inability to bring the fullness of their own interests and priorities, and will leave to find a better context for their own growth and research.
3E. Social norms
That which is normal and accepted is that which goes unchallenged. If there is behavior that you want to discourage, you need to make sure not only that you challenge it when it occurs, but also that you openly, vocally, and publicly support others who are challenging it. It is the job of the group to ensure that someone who is following the rules/trying to do it right is never alone when they are in conflict with someone who isn’t.
Consider in advance, and be explicit about, things like the acceptability of interruptions or off-topic discussion. Cultivate a culture of disagreement, but be deliberate about building in politeness and support so that disagreement is net-positive and doesn’t turn into abuse or delegitimization. Protect whatever decision-making structures you decide to put in place, and be consistent about what constitutes each person’s domain and what marks the end of discussion.
4. Other thoughts and questions
Try to have a mix of topics or activities, so that every meeting doesn’t follow the exact same pattern. Read papers, give presentations, hold discussions, write formal proofs or essays, etc. Try to have fewer than 50% of your meetings center on reading and/or discussing pre-existing material. (Ideally, fewer than 33%.)
Consider setting long-term agendas, i.e., six months or a year of meetings that stay near a particular swath of the territory and allow for the group to build up a body of knowledge and progress.
If setting a long-term agenda, build in wiggle room for things that aren’t part of that swath (e.g., every third meeting is deliberately not consistent with the overall arc).
Consider assigning someone to take minutes and collate them in a permanent place, such that you can look back over the arc of a given season or year. Consider whether or not it feels valuable to go over minutes from the previous meeting at the start of each meeting.
At the end of a meeting, assign someone to collect and email out questions that people intend to mull over, or thoughts that will lead into the next meeting. Consider deciding in advance who will be leading what at the next meeting, so they have a reason to prepare and to show up.
Ensure that you have up-to-date contact information for all full members and associated/interested parties. Think in advance about whether you want to do email messages, FB groups, individual texts, etc.
Consider what relationship you want to have with other MIRIx groups, such as sharing minutes or questions or occasionally sending or receiving ambassadors. Take agentic action in causing such things to happen, if you want them—remember that you’re part of a class, and if you want it but never take steps to bring it about, this is probably true of lots of other people as well.
Consider whether you want to run events for the general public or potential new recruits (e.g., in math or CS departments). Consider whether you want to try more ambitious projects, like the Human-Aligned AI Summer School, and reach out to people with knowledge and resources to do it well rather than reinventing the wheel.
If your MIRIx chapter is in an academic setting, be sure to figure out what sort of pipeline you want to form, so that you have underclassmen who are invested and ready to take over when the older students graduate. If not in an academic setting, consider how you want to go about recruiting new members. Note that a large influx of new members is rarely useful, and compounds the culture problem; it’s better by far to add new people one or two at a time, with plenty of time to acculturate.
Remember that the quality of the research and discussions and the MIRIx chapter as a whole is dependent on the actions of individuals, and how those actions combine. Be sure to impress this upon every member—your MIRIx is only as good as each of you individually chooses to make it.
You’ve nearly reached the end of the document! Hopefully, this contained non-zero useful information, as well as a healthy amount of food-for-thought. Before you go, we recommend that you take 30 seconds or so to ponder each of the following questions:
Why did we choose to write this document? What were we expecting from it, and what caused us to select this particular format and content, out of all of the possibilities?
Where are you still hungry or frustrated or dissatisfied? What’s missing from this document, that we failed to address? How did you come to be aware of this/these thing/s that we missed?
What sort of document would you write? How would you know if it was a good idea to write one, or not? How would you decide what to put into it?
How the hell does progress even get made?
- The MIRI research team