Simulating the *rest* of the political disagreement

There’s a mistake I made a couple times and didn’t really internalize the lesson as fast as I’d like. Moreover, it wasn’t even a failure to generalize, it was basically a failure to even have a single update stick about a single situation.

The particular example was me saying, roughly:

Look, I’m 60%+ on “Alignment is quite hard, in a way that’s unlikely to be solved without a 6+ year pause.” I can imagine believing it was lower, but it feels crazy to me to think it’s lower than like 15%. And at 15%, it’s still horrendously irresponsible to solve AI takeoff via rushing forward and winging-it than “everybody stop, and actually give yourselves time think.”

The error mode here is something like “I was imagining what I’d think if you slid this one belief slider from ~60%+ to 15%, without imagining all the other beliefs that would probably be different if I earnestly believed the 15%.”

That error feels like a “reasonable honest mistake.”

But, the part where I was like “C’mon guys, even if you only, like, sorta-kinda agreed with me on this point, you’d still obviously be part of my political coalition for a global halt that is able to last 10+ years, right?”

...that feels like a more pernicious, political error. A desire to live in the world where my political coalition has more power, and a bit of an attempt to incept others into thinking it’s true.

(This is an epistemic error, not necessarily a strategic error. Political coalitions are often won by people believing in them harder than it made sense to. But, given that I’ve also staked my macrostrategy on “LessWrong is a place for shared mapmaking, and putting a lot of effort to hold onto that even as the incentives push towards political manuevering,” I’d have to count it as a strategic error for me in this context)

The specific counterarguments I heard were:

  • If “Superalignment is real hard risk” is only like 15%, you might have primary threat models that are pretty differently shaped, and be focusing your efforts on reducing risk in the other 85% of worlds.

  • Relatedly, my phrasing made more sense if the goal was to cut risk down to something “acceptable” (like, <5%). You might think it’s more useful to focus on strategies that are more likely to work, and which cut risk down from, say, 70% to 35%. (which does seem more plausible to me if I believed alignment wouldn’t likely require 6+ year pauses to get right).

Now I’m not arguing that those rejoinders are slam dunks. But, I hadn’t thought of them when I was making the argument, and I don’t currently have a strong counter-counterargument at the moment. Upon reflection, I can see a little slippery-graspy move I was doing where I was hoping to skip over the hard work of fully simulating another perspective and addressing all their points.

(to spell out: the above arguments are specifically against “if AI alignment is only 15% likely to difficult enough to require a substantial pause, you should [be angling a bit to either pause or at least preserve option-value to pause”. It’s not an argument against alignment likely requiring a pause)

...

I do still overall think we need a long pause to have a decent chance of non-horrible things happening. And I still feel like something epistemically slippery is going on in the worldviews of most people who are hopeful about survival in a world where companies continue mostly rushing towards superintelligence.

But, seems good for me to acknowledge when I did something epistemically slippery myself. In particular given that I think that epistemic-slipperiness is a fairly central problem in the public conversation about AI, and it’d probablyhelp to get better at public convos about it.

The Takeaways

Notice thoughts like “anyone who even believes a weak version of My Thing should end up agreeing with my ultimate conclusion”, and hold them with at least a bit of skepticism. (The exact TAP probably depends a bit on the situation)

More generally, remember if that variation in belief often doesn’t just turn on a single knob, if someone disagrees with one piece they probably disagreeabout a bunch of other pieces. Disagreements are more frustratingly fractal than you might hope.

(See also: “You can’t possibly succeed without [My Pet Issue]”)

Appendix: The prior arguments

I first made this-sort-of-claim in a conversation with Zac Hatfield-Dodds that I’d later recount on Anthropic, and taking “technical philosophy” more seriously. I don’t think I actually made the error here exactly). But in the comments, Ryan Greenblatt replied with some counterarguments and I said “oh, yeah that makes sense”, and later in The Problem I ended up running through the same loop with Buck.