Simulating the *rest* of the political disagreement
There’s a mistake I made a couple times and didn’t really internalize the lesson as fast as I’d like. Moreover, it wasn’t even a failure to generalize, it was basically a failure to even have a single update stick about a single situation.
The particular example was me saying, roughly:
Look, I’m 60%+ on “Alignment is quite hard, in a way that’s unlikely to be solved without a 6+ year pause.” I can imagine believing it was lower, but it feels crazy to me to think it’s lower than like 15%. And at 15%, it’s still horrendously irresponsible to solve AI takeoff via rushing forward and winging-it than “everybody stop, and actually give yourselves time think.”
The error mode here is something like “I was imagining what I’d think if you slid this one belief slider from ~60%+ to 15%, without imagining all the other beliefs that would probably be different if I earnestly believed the 15%.”
That error feels like a “reasonable honest mistake.”
But, the part where I was like “C’mon guys, even if you only, like, sorta-kinda agreed with me on this point, you’d still obviously be part of my political coalition for a global halt that is able to last 10+ years, right?”
...that feels like a more pernicious, political error. A desire to live in the world where my political coalition has more power, and a bit of an attempt to incept others into thinking it’s true.
(This is an epistemic error, not necessarily a strategic error. Political coalitions are often won by people believing in them harder than it made sense to. But, given that I’ve also staked my macrostrategy on “LessWrong is a place for shared mapmaking, and putting a lot of effort to hold onto that even as the incentives push towards political manuevering,” I’d have to count it as a strategic error for me in this context)
The specific counterarguments I heard were:
If “Superalignment is real hard risk” is only like 15%, you might have primary threat models that are pretty differently shaped, and be focusing your efforts on reducing risk in the other 85% of worlds.
Relatedly, my phrasing made more sense if the goal was to cut risk down to something “acceptable” (like, <5%). You might think it’s more useful to focus on strategies that are more likely to work, and which cut risk down from, say, 70% to 35%. (which does seem more plausible to me if I believed alignment wouldn’t likely require 6+ year pauses to get right).
Now I’m not arguing that those rejoinders are slam dunks. But, I hadn’t thought of them when I was making the argument, and I don’t currently have a strong counter-counterargument at the moment. Upon reflection, I can see a little slippery-graspy move I was doing where I was hoping to skip over the hard work of fully simulating another perspective and addressing all their points.
(to spell out: the above arguments are specifically against “if AI alignment is only 15% likely to difficult enough to require a substantial pause, you should [be angling a bit to either pause or at least preserve option-value to pause”. It’s not an argument against alignment likely requiring a pause)
...
I do still overall think we need a long pause to have a decent chance of non-horrible things happening. And I still feel like something epistemically slippery is going on in the worldviews of most people who are hopeful about survival in a world where companies continue mostly rushing towards superintelligence.
But, seems good for me to acknowledge when I did something epistemically slippery myself. In particular given that I think that epistemic-slipperiness is a fairly central problem in the public conversation about AI, and it’d probablyhelp to get better at public convos about it.
The Takeaways
Notice thoughts like “anyone who even believes a weak version of My Thing should end up agreeing with my ultimate conclusion”, and hold them with at least a bit of skepticism. (The exact TAP probably depends a bit on the situation)
More generally, remember if that variation in belief often doesn’t just turn on a single knob, if someone disagrees with one piece they probably disagreeabout a bunch of other pieces. Disagreements are more frustratingly fractal than you might hope.
(See also: “You can’t possibly succeed without [My Pet Issue]”)
Appendix: The prior arguments
I first made this-sort-of-claim in a conversation with Zac Hatfield-Dodds that I’d later recount on Anthropic, and taking “technical philosophy” more seriously. I don’t think I actually made the error here exactly). But in the comments, Ryan Greenblatt replied with some counterarguments and I said “oh, yeah that makes sense”, and later in The Problem I ended up running through the same loop with Buck.
On the epistemic point: yes, and this is something current-gen LLMs seem actually useful for, with little risk. This is where their idea generation is useful, and their poor taste and sycophancy doesn’t matter.
I’ve had success asking LLMs for counterarguments. Most of them are dumb and you can dismiss them. But they’re smart enough to come up with some good ones once you’ve steelmanned them and judged their worth for yourself.
This seems less helpful than getting pushback from informed people. But that’s hard to find; I’ve had experiences like yours with Zac HD, in which a conversation fails to surface pretty obvious-in-hindsight counterarguments, just because the conversation focused elsewhere. And I have gotten good pushback by asking LLMs repeatedly in different ways, as far back as o1.
On the object level on your example: I assume a lot of us aren’t very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks. It seems very likely, that we’re gonna barrel forward through any plausible pause movement, but not clear (even after trying to steelman every major alignment difficulty argument) that alignment is insoluble—if we can just collectively pull our shit halfway together while racing toward that cliff.
Nod. I do just, like, don’t think that’s actually that great a strategy – it presupposes it is actually easier to get from 70% to 35% than from 35% to 5%. I don’t see Anthropic-et-al actually really getting ready to ask the sort of questions that would IMO be necessary to actually do-the-reducing.
I’m not getting your 35% to 5% reference? I just have no hope of getting as low as 5%, but a lot of hope for improving on just letting the labs take a swing.
I fully agree that Anthropic and the other labs doen’t seem engaged with the relevant hard parts of the problem. That’s why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn’t or can be put out.
It may not be that great a strategy, but to me it seems way better than hoping for pause. I think we can get a public freakout before gametime, but even that won’t produce a pause once the government and military is fully AGI-pilled.
This is a deep issue I’ve been wanting to write about, but haven’t figured out how to address without risking further polarization within the alignment community. I’m sure there’s a way to do it productively.
FYI something similar to this was basically my “last year’s plan”, and it’s on hold because I think it is plausible right now to meaningfully move the overton window around pauses or at least dramatic slowdowns. (This is based on seeing the amount of traffic AI 2027 got, and the number of NatSec endorsements that If Anyone Builds It Got, and having recently gotten to read it and thinking it is pretty good)
I think if Yoshua Bengio, Geoffrey Hinton, or Dario actually really tried to move overton windows instead of sort of trying to manuever within the current one, it’d make a huge difference. (I don’t think this means it’s necessarily tractable for most people to help. It’s a high-skill operation)
(Another reason for me putting “increase the rate of people able to think seriously about the problem” on hold is that my plans there weren’t getting that much traction. I have some models of what I’d try next when/if I return to it but it wasn’t a slam dunk to keep going)
What I think would be really useful is more dialogue across “party lines” on strategies. I think I’m seeing nontrivial polarization, because attempted dialogues seem to usually end in frustration rather than progress.
I’m thinking of a slightly different plan than “increase the rate of people being able to think seriously about the problem” I’d like to convince people who already understand the problem to accept that pause is unlikely and alignment is not known to be impossibly hard even on short timelines. If they agreed with both of those it seems like they’d want to work on aligning LLM-based AGI, on what looks like the current default path. I think just a few more might help nontrivially. The number of people going “straight for the throat” is very small.
I’m interested in the opposite variant too, trying to convince people working on “aligning” current LLMs to focus more on the hard parts we haven’t encountered yet.
I do think shifting the Overton window is possible. Actually I think it’s almost inevitable; I just don’t know if it happens soon enough to help. I just think a pause is unlikely even if the public screams for it—but I’m not sure, particularly if that happens sooner than I think. Public opinion can shift rapidly.
The Bengio/Hinton/Dario efforts seem like they are changing the Overton window, but cautiously. PR seems to require both skill and status.
Getting entirely new people to understand the hard parts of the problem and then understand all of the technical skills or theoretical subtleties is another route. I haven’t thought as much about that one because I don’t have a public platform, but I do try to engage newcomers to LW in case they’re the type to actually figure things out enough to really help.
I think it’s useful to think of “rate of competent people think seriously about the right problems” is, like, the “units” of success for various flavors of plans here. There are different bottlenecks.
I currently think the rate-limiting reagent is “people who understand the problem”. And I think that’s in turn rate-limited on:
“the problem is sort of wonky and hard with bad feedbackloops and there’s a cluster of attitudes and skills you need to have any traction sitting and grokking the problem.”
“we don’t have much ability to evaluate progress on the problem, which in turn means it’s harder to provide a good funding/management infrastructure for it.”
Better education can help with your first problem, although that pulls people who understand the problem away from working on it.
I agree that the difficulty of evaluating progress is a big problem. One solution is to just fund more alignment research. I am dismayed if it’s true that Open Phil is holding back available funding because they don’t see good projects. Just fund them and get more donations later when the whole world is properly more freaked out. If it’s bad research now, at least those people will spend some time thi8nking about and debating what might be better research.
I’d also love to see funding directly on people understanding the whole problem including the several hard parts. It is a lot easier to evaluate whether someone is learning a curriculum than doing good research. Exposing people to a lot of perspectives and arguments and sort of paying and forcing them to think hard about it should at least improve their choice of research and understanding of the problem.
I definitely agree that understanding the problem is the rate-limiting factor. I’d argue that it’s not just the technical problem you need to understand, but the surrounding factors, eg how likely is a pause or slowdown and how likely is it we reach AGI how soon on the default path. I’m afraid some of our best technical thinkers understand the technical problem but are confused about how unlikely it is that any approach but directLLM descendents will be the first critical attempt at aligning AGI. But arguments for or against that are quite complex.
I think “moving an overton window” is a sort of different operation than what Bengio/Hinton/Dario are doing. (Or, like, yes, they are expanding an overton window, but, their entire strategy for doing so seems predicated on a certain kind of caution/incrementalness)
I think there are two pretty different workable strategies:
say things somewhat outside the window, picking your battles
make bold claims, while believing in your convictions with enough strength and without looking “attackable for mispeaking”.
Going halfway from one to the other doesn’t actually work, and the second one doesn’t really work unless you actually do have those convictions. There are a few people trying to do the latter, but, most of them just don’t actually have the reputation that’d make anyone care (and also there’s a lot of skill to doing it right). I think if at least one of Yoshua/Geoffrey/Dario/Demis switched strategies it’d make a big difference.
i.e, if basically anything other than a long pause will be insufficient to actually work, you might as well swing for the pause.
Sure. If you’re confident of that.
It drives me a bit nuts that many of our otherwise best thinkers are confident of aligniing LLM AGI being almost impossible, when the arguments they’re citing just don’t stack to near certainty even with active steelmanning. I’ve been immersing myself in the arguments for inner misalignment as a strong default. They’re strong, they should make you afraid, but they’re nowhere near certainty.
Few people who take that aligment difficulty seriously are even proposing ways around it for LLM AGI. We have barely begun working on the most relevant hard problem. Calling it hopeless without working on it is… questionable. I see why you might, for various reasons, but at this point I think it’s a huge mistake.
We can call for pause/slowdown and emphasize the difficulty, while also working on alignment on the default path. We’re in a bad situation, and that looks to me like our biggest potential out by an order of magnitude.
Deserves a name.
Agree. My first thought was something like “belief correlation” or “belief interconnectedness”, as in “I forgot that beliefs are correlated”. Also a vague reference to everything is correlated.
Do you mean like a short pithy name for the fallacy/failure-mode?
Yes.
I was at this weird party where everyone started drinking poison. I tried explaining it to them but I didn’t have any proof or anything and I said “look that’s poison sir” and “m’am if you think there’s any chance I’m right you should stop drinking that” but no luck. They said “I’m thirsty” or “i already have this cup in my hand” or “maybe water is poison and this is antidote, you know more people die from drowning than random poisoning”. I realized I was failing to think about it from their perspective. If I had known all these people for years and this big party was basically what we’ve been planning for a while and it was blowing up online then I would definitely drink it too.
After all the poison-is-actually-antidote guy could’ve also said “if there’s any chance I’m right you have to drink it right now” which would be just political manuvineering clearly.
Anyways I miss those guys they were fun