Feedbackloop-first Rationality

I’ve been workshopping a new rationality training paradigm. (By “rationality training paradigm”, I mean an approach to learning/​teaching the skill of “noticing what cognitive strategies are useful, and getting better at them.”)

I think the paradigm has promise. I’ve beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering.

The goal of this post is to:

  • Convey the framework

  • See if people find it compelling in its current form

  • Solicit ideas for improvements, before I decide whether to invest heavily into a larger experiment around it.


Rationality needs better feedback loops

Claim: Feedback loops are the most important thing ever. Hard things are hard because they have bad feedback loops. Some of the most important things (e.g. x-risk mitigation research) have the worst feedback loops.

Bold prediction: You can learn to think better, even about confusing, poor-feedback domains. This requires developing the art of inventing feedback loops. And then, actually putting in a lot of deliberate practice effort.

I’ve long been haunted by this Romeo Stevens comment (slightly paraphrased)[1]

Deliberate practice deliberate practice until you get really good identifying good feedback loops, and working with them.

People have a really hard time with interventions often because they literally do not have a functioning causal model of the skill in question. People who apply deliberate practice to a working causal model often level up astonishingly quickly. Don’t know if you have the appropriate causal model? Well, when you apply deliberate practice do you not get better? You’re pulling on fake levers.

In the past, I’ve tried to practice thinking. I’ve done explicit puzzle-solving exercises, and I have a day job that forces me to think about challenging questions on a regular basis. I sometimes have tried to refactor my day-job into something deliberate practice-shaped, but it never gelled.

I think I’ve gotten better at thinking in the past 12 years. But I haven’t gotten overwhelmingly obviously better at thinking. I recently decided to deliberate practicing “solve confusing problems”, until I was demonstrably better at it, and to host some workshops where I tried helping other people practice too.

I ended up settling into a paradigm of rationality training with five elements:

  • Deliberate Practice. Do challenging cognitive exercises, at the edge of your ability, in a variety of domains, where it’s obvious how well you’re doing (i.e. clear cut answers, or you’re making a metric go up).

  • Metacognition. After deciding on the final answer for the exercise and finding out if you got it right, reflect on what you could have done better. Try to extract as much insight/​wisdom/​tools as you can from each exercise.

  • Improve your practice feedback loop. Then, find or design better exercises, that cut more closely to your ultimate goals. Optimize exercises both for being concrete (i.e. you can tell if you succeeded), and for extracting as much insight/​tools as possible during the metacognition step (i.e. they are a good difficulty in a domain I haven’t already exhausted for insight)

  • Improve your real-life feedback loop. Think about what sort of cognitive challenges you run into your day-job or main project, where you’re bottlenecked in your ability to reason. How can you do better meta-reflection in those fuzzier, longer-timescale domains?

  • Illegible goodness. In addition to the formal structure implied by the previous four bullets, also try random stuff that feels vaguely relevant and helpful, even if it you can’t explain why. (I think some previous rationality training approaches leaned too much in this direction, but you still need some illegible goodness IMO)

I think the two biggest concepts here are:

1. Actually do the goddamn practice

2. The feedback loop is the primary product.

This last point is really important so I’m going to say it again in big letters:

The feedback loop is the primary product

In “Feedbackloop-first Rationality Training” you’re trying to get the right answers to concrete questions, yes. And you’re trying to learn from those answers. But more importantly, you’re trying to build an overall better feedback loop for yourself that reliably helps you improve at thinking over time.

Rationality training has some feedback mechanisms, but IMO they kinda suck. They are either short/​reliable (but don’t super track the things we ultimately care about), or hella long, in a noisy world where random non-rationality effects often dominate.

So I think, at least for me, it makes sense to think of my primary goal as inventing good, short feedback loops that help me point in the right direction. It seems important both individually, as someone aspiring to learn to think better, and collectively, as someone hoping to contribute to the overall Art of Human Rationality project.

I don’t know that everyone focused on applied rationality should prioritize this, but I do think it’s the most important open problem in the field. I think it’d be worthwhile for everyone working on rationality training to think about this for at least a few days. And, my hope with this post is to excite you and give you some traction on making this a major focus of your rationality-training approach, for awhile, until good feedback loops are no longer the bottleneck.

My starting loop: Thinking Physics

A cognitive feedback loop looks something like this:

(I notice drawing this out it’s basically an OODA loop)

I may write a followup post that delves into “what exactly is a feedback loop, really?” if that seems helpful. But I figured I’d start by just laying out what I concretely did, and seeing what questions or concerns people had.

I started out with the goal of getting better at Thinking Physics (a collection of physics puzzles, designed such that you can solve them without much formal math or physics background). This involved three nested feedback loops.

The basic loop here was:

  • Solve a puzzle. Aim to get the answer right with 95% confidence.

  • Reflect on how I could have solved it better.

  • Grade myself partly on whether I got the question right, but primarily on whether I learned new stuff about how to think from it.

  • Choose a new puzzle to solve (optimizing in part for a challenge that I expect to learn a lot from)

Then, there’s an outer loop of checking:

  • Am I getting better at solving Thinking Physics puzzles?

Then, a further outer loop of:

  • Do I seem to be getting better at my day-job? What other exercises do I expect to most help me on the fuzzy real world problems I work on each day?

A good outcome from feedbackloop-first rationality would be to find a better introductory loop than “random-ish Thinking Physics puzzles”. I’m choosing this because it was easily available and I could get started right away, not because I think it’ll ultimately be the best thing.

A Spectrum of “Concreteness”

Much of my prior exposure to rationality training came from observing CFAR and Leverage Research. I have an impression that they both approached cognitive training with a mindset of: “Our actual goals are very messy/​nuanced. We want good feedback loops, but we don’t want to goodhart ourselves and lose sight of our real goal. So let’s take our ultimate fuzzy goal, and step backwards towards ‘more concreteness’, until we something tractable enough that we can work on.”

i.e. they started on the right side of this diagram, and took one or two steps to the left.

Leverage’s goal was to solve psychology, as part of a broader plan to effect large social change. In the process, they found it was hard to really know anything about what was going on in people’s heads, so they worked on developing a better methodology of introspection training to get better data.

I think CFAR initially had a goal of “generally train formidable, clear thinking people”, and tracked progress with some combination of “do the participants seem to understand and find value in the workshop?” and “do they go on to do impressive things later?”

I learned a bunch from both orgs. I use a lot of explicit techniques I learned from each of them pretty frequently. But I feel unsatisfied with how their approaches played out, and I decided this time I wanted to start from the opposite direction.

So I started with the goal of “get good at Thinking Physics to prove I can get good at anything at all”. Thinking Physics exercises are towards the left side of the diagram (maybe slot 3). They’re closer to my ultimate goal than, say, “memorizing cognitive biases” or “practicing OpenPhil’s calibration game”, but still many steps away from my day job of “design forum software that helps with intellectual progress or existential risk somehow.”

I figured I’d work upwards, towards fuzzier but still semi-grounded exercises. Maybe I’d do short research projects that lasted ~a week, something like that.

Bridging Concreteness from Both Directions

But along the way, I noticed it felt a lot more promising if I also worked backwards from the fuzzier goals.

This particular came up when talking with people who came to my beta-test workshops, who, unlike me, weren’t necessarily ready to invest months into a speculative training program. (But, liked the idea of putting a few days of work in).

After solving some physics puzzles, it seemed a useful exercise to ask:

  • “How does this connect back to your day job?”

  • “What skills transfer or generalize?”

  • “What specific skills do you wish you were better at for your day-job, and what exercises would help you get better at them?”

  • “If you can’t design an explicit exercise, can you at least find a way to integrate more meta-reflection into your day job?”

My current guess is that cognitive training works best if you approach from both ends of the concreteness spectrum. This seems better for reaching a state where we have a smooth feedback gradient from “concrete/​less-relevant to “highly-relevant but less concrete” (and, ideally pushing the pareto frontier forward – finding exercises that are strictly more relevant and concrete than the current ones available).

And it also seems better for hitting a medium term goal of “develop a workshop or school that helps people more immediately.”

A Sketch of an Expensive Plan

I feel pretty confident in the core hypotheses “actually do the goddamn practice” and “build better feedback loops” being at least pretty useful, and silly not try seriously at least once.

Here is a more speculative plan, to give you some sense of where this is going. I might not stick to this exact plan but it seems like a good starting place.

I’m interested in getting to a place where we can train “research taste in novel domains.”

I’m particularly motivated by alignment research. The state of the field is that there are tons of arguments and counterarguments, but the most respected researchers still disagree deeply about many foundational issues. They disagree about what parts will be hard, and what counts as progress, and in many cases “what does good thinking about this look like?”

A central question is whether something in the future that has never happened before will be really important, and we have to get it right on the first try. A particular disagreement in the field is “how valuable are the quick-ish feedback loops of ‘align or study current ML systems’ to that thing that might-or-might-not-happen in the future?”

Here is a first draft of a fairly expensive plan that seems workable-in-theory, as an “upper level feedback loop to aspire to”.

  • Find a few different domains that involve difficult-problem-solving, with concrete exercises. They should be as different from each other as possible while satisfying that requirement. Exercises should take at least a couple hours to solve, and possibly up to a couple days.

  • Have a large group of people attempt to practice problems from each domain, randomizing the order that they each tackle the problems in. (The ideal version of this takes a few months)

  • While the participants work on each problem, they record predictions about whether their current approach is likely to pan out, or turn out to be a dead end.

  • As part of each problem, they do meta-reflection on “how to think better”, aiming specifically to extract general insights and intuitions. They check what processes seemed to actually lead to the answer, even when they switch to a new domain they haven’t studied before.

A primary question I’d want to investigate is whether you can gain a clear sense of which of your intuitions transfer between domains, and then see if you can do better-than-average on new domains that you haven’t specifically trained on.

(I might start by having everyone do an IQ test or similar standardized measurement, and again at the end, partly because that just seems generally useful, and partly to allow for some comparisons with other educational literature)

Building a laboratory for rationality training

The overall metric here is “do people who get randomly assigned a problem later in the program do better than people who got assigned that problem earlier in the program.” (And, meanwhile also having some control group that doesn’t do the whole program)

My hope is that the baseline program turns out to be pretty valuable on it’s own (if for no reason other than “~a semester of deliberate practice on novel/​confusing problems where you can tell how well you did” reasons), enough that students can come through the program and actively gain something from it… and also, it can make for a good laboratory for aspiring rationality instructors. If you have a technique that you think should help researchers, you can try teaching it to people in this program and see if they do better than baseline.

Reminder: First draft

I don’t think this current plan is necessarily great, or exactly what I’d do next. The point of the “Feedbackloop Rationality” focus is to find better (and cheaper) feedback loops.

My next steps would probably include “do some lit-reviews of education literature” and “try one second domain after Thinking Physics and see how it goes.” But, I list this comprehensive plan give a sense of where this might be going.

What’s next? Questions?

I haven’t decided yet whether I’m continuing on this. It’ll depend on what other projects the Lightcone team is considering and whether they seem more promising or time-sensitive. It’ll also depend on how excited other people are for the project, and whether anyone wants to join in as either a cofounder, or a “serious test subject who puts in full-time effort.”

I have some specific ideas for what to do next, but I think I want to start be asking “Does this make sense? Do you have any questions? Can you think of cheaper ways to test the hypotheses here without spending multiple weeks/​months of a bunch of smart people’s time?”

  1. ^

    This is actually one “definitely real” comment combined with another comment I vaguely remember and which he responded positively about when I sort-of-quoted him later in the thread.