Comment reply: my low-quality thoughts on why CFAR didn’t get farther with a “real/​efficacious art of rationality”

Hi! I was writing this originally as a comment-reply to this thread, but my reply is long, so I am factoring it out into its own post for easier reading/​critique.

This is more comment-reply-quality than blog post quality, so read at your own risk. I do think the topic is interesting.

Short version of my thesis: It seems to me that CFAR got less far with “make a real art of rationality, that helps people actually make progress on tricky issues such as AI risk” than one might have hoped. My lead guess is that the barriers and tricky spots we ran into are somewhat similar to those that lots of efforts at self-help /​ human potential movement /​ etc. things have run into, and are basically “it’s easy and locally reinforcing to follow gradients toward what one might call ‘guessing the student’s password’, and much harder and much less locally reinforcing to reason/​test/​whatever one’s way toward a real art of rationality. Also, the process of following these gradients tends to corrupt one’s ability to reason/​care/​build real stuff, as does assimilation into many parts of wider society.”

Epistemic status: “personal guesswork”. In some sense, ~every sentence in the post deserves repeated hedge-wording and caveats; but I’m skipping most of those hedges in an effort to make my hypotheses clear and readable, so please note that everything below this is guesswork and might be wrong. I am sharing only my own personal opinions here; others from past or current CFAR, or elsewhere, have other views.

Conversational context, leading up to this post-length comment-reply:

I wrote:

In terms of whether there is some interesting thing we [at CFAR] discovered that caused us to abandon e.g. the mainline [workshops, that we at CFAR used to run]: I can’t speak for more than myself here either. But for my own take, I think we ran to some extent into the same problem that something-like-every self-help /​ hippy /​ human potential movement since the 60′s or so has run into, which e.g. the documentary (read: 4-hour somewhat intense propaganda film) Century of the Self is a pretty good introduction to. I separately or also think the old mainline workshops provided a pretty good amount of real value to a lot of people, both directly (via the way folks encountered the workshop) and via networks (by introducing a bunch of people to each other who then hit it off and had a good time and good collaborations later). But there’s a thing near “self-help” that I’ll be trying to dodge in later iterations of mainline-esque workshops, if there are later iterations. I think. If you like, you can think with some accuracy of the small workshop we’re running this week, and its predecessor workshop a couple months ago, as experiments toward having a workshop where people stay outward-directed (stay focused on inquiring into outside things, or building stuff, or otherwise staring at the world outside their own heads) rather than focusing on e.g. acquiring “rationality habits” that involve a conforming of one’s own habits/​internal mental states with some premade plan. [1]

gjm replied:

You refer to “the same problem that something like every self-help /​ hippy /​ human potential movement since the 60s has run into”, but then don’t say what that problem is (beyond gesturing to a “4-hour-long propaganda film”).

I can think of a number of possible problems that all such movements might have run into (or might credibly be thought to have run into) but it’s not obvious to me which of them, if any, you’re referring to.

Could you either clarify or be explicit that you intended not to say explicitly what you meant? Thanks!

And later, gjm again:

I’ll list my leading hypotheses so you have something concrete to point at and say “no, not that” about.

  • It turns out that actually it’s incredibly difficult to improve any of the things that actually stop people fulfilling what it seems should be their potential; whatever is getting in the way isn’t very fixable by training.

  • “Every cause wants to be a cult”, and self-help-y causes are particularly vulnerable to this and tend to get dangerously culty dangerously quickly.

  • Regardless of what’s happening to the cause as a whole, there are dangerously many opportunities for individuals to behave badly and ruin things for everyone.

  • In this space it is difficult to distinguish effective organizations from ineffective ones, and/​or responsible ones from cultish/​abusive ones, which means that if you’re trying to run an effective, responsible one you’re liable to find that your potential clients get seduced by the ineffective irresponsible ones that put more of their efforts into marketing.

  • In this space it is difficult to distinguish effective from ineffective interventions, which means that individuals and organizations are at risk of drifting into unfalsifiable woo.

So, that’s the prior conversational context. Now for my long-winded attempt to reply, and to explain my best current guess at why CFAR didn’t make more progress toward an actually-useful-for-understanding-AI-or-other-outside-things art of rationality.

I’ll write it by quoting some of gjm’s hypotheses, with some of my own added, in an order that is convenient to me, and with my own numbering added. I’ll skip the hypotheses that seem inapplicable/​inaccurate to me, and just quote the ones that I think are at least partially descriptive of what happened.

Re: “Hypothesis 1 (from gjm): In [the self-help/​rationality] space it is difficult to distinguish effective from ineffective interventions, which means that individuals and organizations are at risk of drifting into unfalsifiable woo.”

Yes. It is hard (beyond my skill level, and beyond the skill level of others I know AFAICT) to figure out the full intended functions of various parts of the psyche.

So, when people try to re-order their own or other peoples’ psyches based on theories of what’s useful, it’s easy to mess things up.

For example, I’ve heard several stories from adults who, as kids, decided to e.g. “never get angry” (read: “to dissociate from their anger”), in an effort not to be like an angry parent or similar.

Most people would not make that particular mistake as adults, but IMO there are a lot of other questions that are tricky even as an adult, including for me (e.g.: what is suffering, is it good for anything, is it okay to mostly avoid states of mind that seem to induce it, what’s up with denial and mental flinches, is it okay to remove that, does a particular thing that looks like ‘removing’ it remove it all the way or just dissociate things, what’s up with the many places where humans don’t seem very goal-pursuing/​very agent-like, is it okay to become able to ‘do my work’ a lot more often, is it good/​okay to become poly, is it workable to avoid having children despite really wanting to or does this risk something like introducing a sign error deep in your psychology …)

So, IMO, the history of efforts at self-improvement or rationality or the human potential movement or similar is full of efforts to rewire the psyche into molds that seem like a good idea initially, and sometimes seem like a bad idea in hindsight. And this sort of error is a bit tricky to recover from, because, if you’re changing how your mind works or how your social scene works, you are thereby messing with the faculties you’ll later need to use to evaluate the change, and to notice and recover from errors.

I think this is a significant piece of how we got stuck.

Re: “[Hypothesis 2] (from me): The motives used in rewiring parts of the psyche (e.g. at ‘self-help’ programs) are often impure. This impurity leads to additional errors in how to rewire oneself/​others, beyond those that would otherwise be made by someone making a simple/​honest best guess at what’s good.”

I suspect that “impure motives” (motives aimed at some local goal, and not simply at “help this mind be free and rational”) were also a major contributor to what kept us from getting farther at CFAR, and that this interacted with and exacerbated the “model gaps” I was listing in hypothesis 1.

Some examples of the “impure” motives I have in mind:

In groups:

  • Participants may want to learn “rationality”/​“CFAR techniques”/​etc. so that they can feel cool, so others will think they’re cool, so they can be part of the group, so they can gain the favor of a “teacher” or other power structure, etc.

  • Instructor-y types may want people to think our techniques are cool so that we’ll be in positions of power/​influence/​control; (or so that we can think we’re cool, or so we can keep covering up some blind spot of our own by getting everyone else to have it too)

  • Organizations may want to select people who appear to have a certain sort of psyche so they can predict or control those people, and this can itself cause would-be recruits for that organization to want to appear to be that way, and so on. (CS Lewis’s “inner ring” concept covers some of this, but it can get extra wonky once you bring in explicit techniques for modeling and changing the psyche, IMO.)

(“Wanting”, here, doesn’t need to mean “conscious, endorsed wanting”; it can mean “doing gradient-descent from these motives without consciously realizing what you’re doing.”)

Things get more easily wonky in groups, but even in the simpler case of a single individual there is IMO lots of opportunity for “impure” motives:

  • Often there’s a “subsystem” of the psyche that can try to “grab hold of the steering wheel” in ways that mess things up. For example:

    • Egos/​plans like to be in control of the psyche as a whole, even where they sort of “shouldn’t” be. For example, I see many people request help with an aim to act more hours of the day/​year on their current goals, and to shortcut those pesky depressions/​burnouts/​downtimes. Similarly, I see many people asking for help “convincing” themselves of things they “already know” are true, trying to get their motivations and beliefs not to waver, etc. (From my POV, this is iffy — it’s the current plan scheming to retain/​extend control, even though downtime, boredom, unplanned shifts in goals and views in contact with new experiences, etc. often allow a kind of change that seems to me to serve good in the long run.)

    • People sometimes try to block out fear, pain, uncertainty, etc., even where this blocking-out seems to me to be worse for overall functioning. (I am again tempted to see this as a Goodhardt-esque problem, where a portion of the psyche is locking in increased control for the local goals it is pursuing, such as local-pain-avoidance, in ways that trample some of the structures in the larger psyche.)

So, in summary: people’s desires to control one another, or fool one another, can combine poorly with techniques for psychological self- or other- modification. So, too, with people’s desires to control their own psyches, or to fool themselves, or to dodge uncertainty. Such failure modes are particularly easy because we do not have good models of how the psyche, or the sociology, ought to work, and it is relatively easy to manage to be “honestly mistaken” in convenient ways in view of that ignorance.

Re: Hypothesis 3: (from me): “Insufficient feedback loops between our ‘rationality’ and real traction on puzzles about the physical world”

In Something to Protect , Eliezer argues that the real power in rationality will come when it is developed for the sake of some outside thing-worth-caring-about that a person cares deeply about, rather than developed for the sake of “being very rational” or some such.

Relatedly, in Mandatory Secret Identities, Eliezer advocates requiring that teachers of rationality have a serious day job /​ hobby /​ non-rationality-teacher engagement with how to do something difficult, and that they do enough real accomplishment to warrant respect in this other domain, and that no one be respected more as a teacher of rationality than as an accomplisher of other real stuff. That is, he suggested we try to get respect for real traction on real, non-”rationality” tasks into any rationality dojo’s social incentives.

Unfortunately, our CFAR curriculum development efforts mostly had no such strong outside mooring. That is, CFAR units rose or fell based on e.g. how much we were personally convinced they were useful, and how much the students seemed to like them and seemed to be changed by them, how much we liked the resultant changes in our students, (both immediately and at follow-ups months or years later), etc. -- but not based (much/​enough) on whether those units helped us/​them/​whoever make real, long-term progress on outside problems.

In hindsight, I wish we had tried harder to tether our art-development to “does it help us with real outside puzzles/​work/​investigations/​building tasks of some sort.” This seems like the sort of factor that could in principle keep an “art of rationality” on a path to being about the outside world.

At the same time, taking such outside traction seriously seems quite difficult to pull off, and in ways I expect would also have made it difficult for most other groups in our shoes to pull off (and that I suspect also affected e.g. most self-help /​ human potential movement/​ etc. efforts). So I’d like to sketch why this is hard.

a) Taking “does this ‘rationality technique’ help with the real world?” seriously, pulls against local incentive gradients.

Doing things with the real-world “slows you down” and makes your efforts less predictable-to-yourself (which I and others often experience as threatening/​unpleasant, vs being more able to ‘make up’ which things can be viewed as successful). Furthermore, relatedly, such outside “check how this works in real tasks” steps are unlikely to “feel rewarding”, or to cause others to think you’re cool, or to cause your units to feel more compelling locally to the social group. (Appearing to have done real-world checks might make your units more socially compelling in some groups. But unfortunately this creates a pull toward “feeling as though you’ve done it” or “causing others to feel as though you’ve done it,” not toward the difficult, hard-to-track work of having actually sussed out what helps in a puzzling real-world domain.)

Thus, it’s easy for those strands within an organization/​effort that attempt to take real-world traction seriously, to be locally outcompeted by strands not attempting such safeguards.

That is, a CFAR instructor /​ curriculum-developer who initially has some interest in both approaches, will “naturally” find their attention occupied more and more by curriculum-development efforts that skip the slow/​unpredictable loop of “check whether this helps with real-world problem-solving. Similarly, an ecosystem involving several “rationality developers,” some of whom do the one and some the other, will “naturally” find more of its attention heading to the person who is more like “guessing the students’ passwords”, and less like “tracking whether this helps with building real-world maps that match the territory, in slow, real-world, messy domains.”

b) In practice, many/​most domains locally incentivize social manipulation, rather than rationality.

Lots of people who came to CFAR’s past workshops (like people ~everywhere) wanted to succeed at lots of different things-in-their-lives. E.g. they wanted to do well in grad school or in careers, or to have good relationships with particular people, or get better at public speaking, or get more done at their EA job, or etc.

One might have hoped (I originally did hope) that folks’ varied personal goals would provide lots of fodder for developing rationality skill, and that this process would provide lots of fodder for developing an art of rationality.

However, I now like asking about a person’s notions of doing “well” in a domain, whether local signals-they-will-interpret-as-progress are more easily obtained by:

  1. “marketing skill”—causing other people to see things particular ways (e.g., by persuasive speaking/​writing), causing themselves to do particular things that they wouldn’t by default do, etc.; or by

  2. “skill at predicting, or directly building things in, the physical world”—for example, skill at doing math, or programming, or wood-working, or doing alignment research that directly improves your or others’ understanding of alignment, or etc.

It unfortunately seems to me that for most of the goals people come in with, and for most of the ways that people tend initially to evaluate whether they are making progress on that goal, the “help them feel as though they’re making progress on this goal” gradient tends more like toward skill at manipulating themselves and/​or others, and less like toward skill at predicting and manipulating the physical world.

So, if a person is to take “does this so-called ‘rationality technique’ actually help with real-world stuff?” seriously as a feed-in to the developments of a real and grounded art of rationality, they’ll need to carefully pick domains of real-world stuff that are actually about the ability to model the physical world, which on my model are unfortunately relatively rare. (E.g., “doing science” works, but “being regarded as having done good science” only sort-of works; some parts of finance seem to me to work, but some parts really don’t; etc.)

c) In practice, my caring about AI safety (plus the way I was conceptualizing it) often pulled me toward patterns for influencing people, rather than toward a real art of rationality

I might have hoped that “solve AI, allow human survival” would be an instance of “something to protect” for some of us, and that our caring about AI safety would help ground/​safeguard the rationality curriculum. I.e., I might have hoped that my/​our desire to have humanity survive, would lead us to want to get real rationality techniques that really work, and would lead us away from dynamics such as those in Schools Proliferating without Evidence, and toward something grounded and real.

But, no. Or at least, not nearly as much as was needed. AI safety was indeed highly motivating for some of us (at minimum, for me), but the feedback loops were too long for “is X actually helping with AI safety?” to give the “but does it actually work in reality?” tether to our art. (Though we got some of that sometimes; the programs attempting to aid MIRI research were sometimes pretty fruitful and interesting, with the thoughts on AI feeding back into better understandings of how to reason, and with some techniques, e.g. “Gendlin’s ‘Focusing’ for research” gaining standing as a result of their role in concrete research progress sometimes.)

And in addition to the paucity of data as to whether our techniques were helping with research, there was a presence of lots and lots of data and incentives as to whether our techniques were e.g. moving people to take up careers in AI safety, moving people to think we were cool, moving people to seem like they were likely to defer to MIRI or others I thought were basically good or on-path, etc. On my model, these other incentives, and my responses to them, made my and some others’ efforts worse.

I did ‘try’ to be virtuous. But reality is a harsh place, with standards that may seem unfair

I… did try to have my efforts to influence people on AI risk be based in “epistemic rationality,” as I saw it. That is, I had a model in which folks’ difficulty taking AI risk seriously was downstream of gaps in their epistemic rationality, and in which it was crucial that persuasion toward AI safety work be done via improving folks’ general-purpose epistemic rationality, and not through e.g. causing people to like the disconnected phrase “AI safety.”

I endorsed many of the correct keywords (“help people think, don’t try to persuade people of anything”).

Nevertheless: the feedbacks that in-practice shaped which techniques I liked/​used/​taught were often feedbacks from “does this cause people to look like people who will help with AI risk as I see it, i.e. does it help with my desire for a certain ideology to be in control of people,” and less feedbacks from “are they making real research progress now” or other grounded-in-the-physical-world successes/​failures. (Although, again, there was some of the good/​researchy kind of feedback, and I value this quite a bit.)

Case study: the “internal double crux” technique, used on AI risk

To give an example of the somewhat-wonky way my rationality development often went: I developed a technique called “Internal double crux,” and ran a lot of people through a ~90-minute exercise called “internal double crux on AI risk.” The basic idea in this technique, is that you have a conversation with yourself about whether AI risk is real, and e.g. whether the component words such as “AI” even refer to anything real/​sensible/​grounded, and you thereby try to pool the knowledge possessed by your visceral “what do I actually expect to see happen” self with the knowledge you hold more abstractly, and to hash things out, until you have a view that all of yourself signs onto and that is hopefully more likely to be correct.

I developed the “internal double crux” technique in part by thinking about the process that I and many ‘math people’ naturally do when reading a math textbook, where a person reads a claim, asks themselves if it is true, finds something like “okay, the proof is solid, so the claim is true, but still, it is not viscerally obvious yet that it is true, how do I see at a glance that it has to be this way?”, and something like dialogues with themselves, back and forth, until they can see why the theorem holds. (Aka, I developed the technique at least partly by trying to be virtuous, and to ‘boost epistemic rationality’ rather than persuade.)

Still, the feedbacks that led to me putting the technique in a prominent place in the curriculum of CFAR’s “AI risk for computer scientists” and “MIRI summer fellows” workshops were significantly that it seemed to often persuade people to take AI risk seriously.

And… my guess in hindsight is that the “internal double crux” technique often led, in practice, to people confusing/​overpowering less verbal parts of their mind with more-verbal reasoning, even in cases where the more-verbal reasoning was mistaken. For example, I once used the “internal double crux” technique with a person I will here call “Bob”, who had been badly burnt out by his past attempts to do direct work on AI safety. After our internal double crux session, Bob happily reported that he was no longer very worried about this, proceeded to go into direct AI safety work again, and… got badly burnt out by the work a year or so later. I have a number of other stories a bit like this one (though with different people and different topics of internal disagreement) that, as a cluster, lead me to believe that “internal double crux” in practice often worked as a tool for a person to convince themselves of things they had some ulterior motive for wanting to convince themselves of. Which… makes some sense from the feedbacks that led me to elevate the technique, independent of what I told myself I was ‘trying’ to do.

A couple other pulls from my goal of “try not to die of AI” to my promotion of social fuckery in the workshops

A related problem was that, in practice, it was too tempting to approach “aid AI safety” via social fuckery, and social fuckery is bad for making a real art of rationality.

For example, during e.g. the “AI risk for computer scientists” workshops that we ran partly as a MIRI recruiting aid in 2018-2020, I aimed to make the workshops impressive, and to make them showcase our thinking skill.

My reasoning at the time was that, since we could not talk directly about MIRI’s non-public research programs, it was important that participants be able to see MIRI’s/​our thinking skill in other ways, so that they could have some shot at evaluating-by-proxy whether MIRI had a shot at being quite good at research.

(That is: I phrased the thing to myself as being about exposing people to true evidence, but I backchained it from wanting to convince them to trust me and to trust the structures I was working with.)

In practice, this goal led me to such actions as:

  • Not asking dumb/​ignorant questions about neural nets;

  • Venturing opinions where I could see how to back (my guess at Eliezer’s views /​ my guess at correct views) on AI alignment questions, but often refraining from sharing my inside views in cases where those seemed probably-dumb-looking or in cases where those pulled against the MIRI orthodoxy and seemed likely to be incorrect;

  • Choosing which classes to include, and other aspects of how to structure the workshop, based partly on creating a workshop structure that would lead participants to think we were cool/​competent/​etc.

I suspect these features made the workshop worse than it would otherwise have been at allowing real conversations, allowing workshop participants, me, other staff, etc. to develop a real/​cooperative art of rationality, etc. (Even though these sorts of “minor deceptiveness” are pretty “normal”; doing something at the standard most people hit doesn’t necessarily mean doing it well enough not to get bitten by bad effects.)

In Summary:

Thus, AI safety did not end up serving a “reality tether” function for us, or at least not sufficiently. (Feedbacks from “can people do research” did help in some ways, even though feedbacks from “try to be persuasive/​impressive to people, or to get into a social configuration that will allow influencing their future actions” harmed in other ways.) Nor did anything else tether us adequately, although I suspect that mundane tasks such as workshop logistics were at least a bit helpful.

A caveat, about something missing from my write-up here and in many places:

There were lots of other people at CFAR, or outside of CFAR but aiding our efforts in important ways, including lots who were agentic and interesting and developed a lot of interesting content and changed our directions and outcomes. I’m mostly leaving them out of my writing here, but this seems bad, because they were a lot of what happened, and a lot of the agency behind what happened. At the same time, I’m focusing here on a lot of things that… weren’t the best decisions, and didn’t end up with great outcomes, and I’m doing this without having consulted most of the people who played roles at CFAR and its programs in the past, and it seems a lot less socially complicated to talk about my own role in sad outcomes than to attempt to talk about anyone else’s role in such things, especially when my guesses are low-confidence and others are likely to disagree. So I’m mostly sticking to describing my own roles in past stuff at CFAR, while adding this note to try to make that less confusing.

Re: Hypothesis 4 (from gjm): “Every cause wants to be a cult”, and self-help-y causes are particularly vulnerable to this and tend to get dangerously culty dangerously quickly.”

Yes. This hypothesis seems right to me. With my draft at some of the mechanisms, above.

An additional mechanism worth naming:

  • Once you have a norm of “people in class X can solve your psychological bugs, and if it’s getting in the way of ‘the cause’ it’s probably a psychological bug”, it’s easy to also get pressure on the detailed psychology of individuals from people who are in positions of social or organizational power over them in ways that are not good.

(I suspect this list of mechanisms is still quite partial. There are probably just lots of Goodhardt-like dynamics, whereby groups of people who are initially pursuing X may trend, over time, to pursuing “things that kind of look like X” and “things that give power/​resources/​control to those who seem likely to pursue things that kind of look like X” and so on.)

One reason the “every cause wants to be a cult” thing is harder to dodge well than one might think (though perhaps I am being defensive here):

IMO, large parts of “the mainstream” are also cults in the sense of “entities that restrict others’ thinking in ways that are not accurate, and that have been optimized over time for the survival of the system of thought-restrictions rather than for the good of the individual, and that make it difficult or impossible for those within these systems reason freely/​well.”

For example, academia is optimized to keep people in academia. Also, the mainstream culture among college-educated /​ middle-class Americans seems to me to be optimized to keep people believing “normal” things and shunning weird/​dangerous-sounding opinions and ideas, i.e. to keep people deferring to the culture of middle-class America and shunning influences that might disrupt this deferral, even in cases where this makes it hard for people to reason, to notice what they care about, or to choose freely. More generally, it seems to me there are lots of mainstream institutions and practices that condition people against thinking freely and speaking their minds. (cf “reason as memetic immune disorder” and “moral mazes”).

I bring this up partly because I suspect the “true spirit of rationality” was more alive in the rationality community of 2008-2010 than it was in the CFAR of 2018 (say), or in a lot of parts of the EA community, and I further suspect that mimicry of some mainstream practices (e.g. management practices, PR practices) is one vehicle by which the “suppression of free individual caring and reasoning and self-direction, in favor of something like allowing groups to synchronize” occurred.

I bring this up also because in my head at least there are those who would respond to events thus far in parts of rationality-space with something like “gosh, your group ended up kinda culty, maybe you should avoid deviating from mainstream positions in the future,” and I’m not into that, because reasoning seems useful and necessary, and because in the long run I don’t trust mainstream institutions to allow the kind of epistemology we need to do anything real.

  1. ^

    Some of these experiments are by a nascent group, rather than “CFAR-directed” in a narrow sense, and that group may fork off of CFAR as their own thing at some point, but it’s not ready for the internet yet, and may never become so, but I don’t mean to say they are only CFAR in a classic sense.