Theoretical AI alignment (and relevant upskilling) in my free time. My current view of the field is here (part 1) and here (part 2).
How many times has someone expressed “I’m worried about ‘goal-directed optimizers’, but I’m not sure what exactly they are, so I’m going to work on deconfusion.”? There’s something weird about this sentiment, don’t you think?
I disagree, and I will take you up on this!
“Optimization” is a real, meaningful thing to fear, because:
We don’t understand human values, or even necessarily meta-understand them.
Therefore, we should be highly open to the idea that a goal (or meta-goal) that we encode (or meta-encode) would be bad for anything powerful to base-level care about.
And most importantly, high optimization power breaks insufficiently-strong security assumptions. That, in itself, is why something like “security mindset” is useful without necessarily thinking of a powerful AI as an “enemy” in war-like terms.
Here “security assumptions” is used in a broad sense, the same way that “writing assumptions” (the ones needed to design a word-processor software) could include seemingly-trivial things like “there is an input device we can access” and “we have the right permissions on this OS”.
Ah, yeah that’s right.
If it helps clarify: I (and some others) break down the alignment problem into “being able to steer it at all” and “what to steer it at”. This post is about the danger of having the former solved, without the latter being solved well (e.g. through some kind of CEV).
Love this event series! Can’t come this week, but next one I can!
No worries! I make similar mistakes all the time (just check my comment history ;-;)
And I do think your comment is useful, in the same way that Rohin’s original comment (which my post is responding to) is useful :)
FWIW, I have an underlying intuition here that’s something like “if you’re going to go Dark Arts, then go big or go home”, but I don’t really know how to operationalize that in detail and am generally confused and sad. In general, I think people who have things like “logical connectives are relevant to the content of the text” threaded through enough of their mindset tend to fall into a trap analogous to the “Average Familiarity” xkcd or to Hofstadter’s Law when they try truly-mass communication unless they’re willing to wrench things around in what are often very painful ways to them, and (per the analogies) that this happens even when they’re specifically trying to correct for it.
I disagree with the first sentence, but agree strongly with the rest of it. My whole point is that it may be literally possible to make:
about extinction risk from AI
that don’t involve lying.
Maybe we mean different things by “Dark Arts” here? I don’t actually consider (going hard with messaging like) “This issue is complicated, but you [the audience member] understandably don’t want to deal with it, so we should go harder on preventing risk for now based on the everyday risk-avoidance you probably practice yourself.” as lying or manipulation. You could call it Dark Arts if you drew the “Dark Arts” cluster really wide, but I would disagree with that cluster-drawing.
Now, I do separately observe a subset of more normie-feeling/working-class people who don’t loudly profess the above lines and are willing to e.g. openly use some generative-model art here and there in a way that suggests they don’t have the same loud emotions about the current AI-technology explosion. I’m not as sure what main challenges we would run into with that crowd, and maybe that’s whom you mean to target.
That’s… basically what my proposal is? Yeah? People that aren’t already terminally-online about AI, but may still use chatGPT and/or StableDiffusion for fun or even work. Or (more common) those who don’t even have that much interaction, who just see AI as yet another random thingy in the headlines.
Yeah, mostly agreed. My main subquestion (that led me to write the review, besides this post being referenced in Leake’s work) was/sort-of-still-is “Where do the ratios in value-handshakes come from?”. The default (at least in the tag description quote from SSC) is uncertainty in war-winning, but that seems neither fully-principled nor nice-things-giving (small power differences can still lead to huge win-% chances, and superintelligences would presumably be interested in increasing accuracy). And I thought maybe ROSE bargaining could be related to that.
The relation in my mind was less ROSE --> DT, and more ROSE --?--> value-handshakes --> value-changes --?--> DT.
(On my beliefs, which I acknowledge not everyone shares, expecting something better than “mass delusion of incorrect beliefs that implies that AGI is risky” if you do wide-scale outreach now is assuming your way out of reality.)
I’m from the future, January 2024, and you get some Bayes Points for this!
The “educated savvy left-leaning online person” consensus (as far as I can gather) is something like: “AI art is bad, the real danger is capitalism, and the extinction danger is some kind of fake regulatory-capture hype techbro thing which (if we even bother to look at the LW/EA spaces at all) is adjacent to racists and cryptobros”.
Still seems too early to tell whether or not people are getting lots of false beliefs that are still pushing them towards believing-AGI-is-an-X-risk, especially since that case seems to be made (in the largest platform) indirectly in congressional hearings that nobody outside tech/politics actually watches.
But it really doesn’t seem great that my case for wide-scale outreach being good is “maybe if we create a mass delusion of incorrect beliefs that implies that AGI is risky, then we’ll slow down, and the extra years of time will help”. So overall my guess is that this is net negative.
To devil’s steelman some of this: I think there’s still an angle that few have tried in a really public way. namely, ignorance and asymmetry. (There is definitely a better term or two for what I’m about to describe, but I forgot it. Probably from Taleb or something.)
A high percentage of voting-eligible people in the US… don’t vote. An even higher percentage vote in only the presidential elections, or only some presidential elections. I’d bet a lot of money that most of these people aren’t working under a Caplan-style non-voting logic, but instead under something like “I’m too busy” or “it doesn’t matter to me / either way / from just my vote”.
Many of these people, being politically disengaged, would not be well-informed about political issues (or even have strong and/or coherent values related to those issues). What I want to see is an empirical study that asks these people “are you aware of this?” and “does that awareness, in turn, factor into you not-voting?”.
I think there’s a world, which we might live in, where lots of non-voters believe something akin to “Why should I vote, if I’m clueless about it? Let the others handle this lmao, just like how the nice smart people somewhere make my bills come in.”
In a relevant sense, I think there’s an epistemically-legitimate and persuasive way to communicate “AGI labs are trying to build something smarter than humans, and you don’t have to be an expert (or have much of a gears-level view of what’s going on) to think this is scary. If our smartest experts still disagree on this, and the mistake-asymmetry is ‘unnecessary slowdown VS human extinction’, then it’s perfectly fine to say ‘shut it down until [someone/some group] figures out what’s going on’”.
To be clear, there’s still a ton of ways to get this wrong, and those who think otherwise are deluding themselves out of reality. I’m claiming that real-human-doable advocacy can get this right, and it’s been mostly left untried.
EXTRA RISK NOTE: Most persuasion, including digital, is one-to-many “broadcast”-style; “going viral” usually just means “some broadcast happened that nobody heard of”, like an algorithm suggesting a video to a lot of people at once. Given this, plus anchoring bias, you should expect and be very paranoid about the “first thing people hear = sets the conversation” thing. (Think of how many people’s opinions are copypasted from the first classy video essay mass-market John Oliver video they saw about the subject, or the first Fox News commentary on it.)
Not only does the case for X-risk need to be made first, but it needs to be right (even in a restricted way like my above suggestion) the first time. Actually, that’s another reason why my restricted-version suggestion should be prioritized, since it’s more-explicitly robust to small issues.
(If somebody does this in real life, you need to clearly end on something like “Even if a minor detail like [name a specific X] or [name a specific Y] is wrong, it doesn’t change the underlying danger, because the labs are still working towards Earth’s next intelligent species, and there’s nothing remotely strong about the ‘safety’ currently in place.”)
So there’s a sorta-crux about how much DT alignment researchers would have to encode into the-AI-we-want-to-be-aligned, before that AI is turned on. Right now I’m leaning towards “an AI that implements CEV well, would either turn-out-to-have or quickly-develop good DT on its own”, but I can see it going either way. (This was especially true yesterday when I wrote this review.)
And I was trying to think through some of the “DT relevance to alignment” question, and I looked at relevant posts by [Tamsin Leake](https://www.lesswrong.com/users/tamsin-leake) (whose alignment research/thoughts I generally agree with). And that led me to thinking more about value-handshakes, son-of-CDT (see Arbital), and systems like ROSE bargaining. Any/all of which, under certain assumptions, could determine (or hint at) answers to the “DT relevance” thing.
Selection Bias Rules (Debatably Literally) Everything Around Us
Currently, I think this is a big crux in how to “do alignment research at all”. Debatably “the biggest” or even “the only real” crux.
(As you can tell, I’m still uncertain about it.)
Decision theory is hard. In trying to figure out why DT is useful (needed?) for AI alignment in the first place, I keep running into weirdness, including with bargaining.
Without getting too in-the-weeds: I’m pretty damn glad that some people out there are working on DT and bargaining.
Still seems too early to tell if this is right, but man is it a crux (explicit or implicit).
Terence Tao seems to have gotten some use out of the most recent LLMs.
if you take into account the 4-5 staff months these cost to make each year, we net lost money on these
For the record, if each book-set had cost $40 or even $50, I still would have bought them, right on release, every time. (This was before my financial situation improved, and before the present USD inflation.)
I can’t speak for everyone’s financial situation, though. But I (personally) mentally categorize these as “community-endorsement luxury-type goods”, since all the posts are already online anyway.
The rationality community is unusually good about not selling ingroup-merch when it doesn’t need or want to. These book sets are the perfect exceptions.
to quote a fiend, “your mind is a labyrinth of anti-cognitive-bias safeguards, huh?”
The implied context/story this is from sure sounds interesting. Mind telling it?
I don’t think of governments as being… among other things “unified” enough to be superintelligences.
Also, see “Things That Are Not Superintelligence” and “Maybe The Real Superintelligent AI Is Extremely Smart Computers”.
The alignment research that is done will be lower quality due to less access to compute, capability knowhow, and cutting edge AI systems.
I think this is false, though it’s a crux in any case.
Capabilities withdrawal is good because we don’t need big models to do the best alignment work, because that is theoretical work! Theoretical breakthroughs can make empirical research more efficient. It’s OK to stop doing capabilities-promoting empirical alignment, and focus on theory for a while.
(The overall idea of “if all alignment-knowledgeable capabilities people withdraw, then all capabilities will be done by people who don’t know/care about alignment” is still debatable, but distinct. One possible solution: safety-promoting AGI labs stop their capabilities work, but continue to hire capabilities people, partly to prevent them from working elsewhere. This is complicated, but not central to my objection above.)
I see this asymmetry a lot and may write a post on it:
If theoretical researchers are wrong, but you do follow their caution anyway, then empirical alignment goes slower… and capabilities research slows down even more. If theoretical researchers are right, but you don’t follow their caution, you continue or speedup AI capabilities to do less-useful alignment work.
Good catch! Most of it is hunches to be tested (and/or theorized on, but really tested) currently. Fixed