A Quick List of Some Problems in AI Alignment As A Field

Nicholas Kross21 Jun 2022 23:23 UTC

75 points

1. MIRI as central point of failure for… a few things...

For the past decade or more, if you read an article saying “AI safety is important”, and you thought, “I need to donate or apply to work somewhere”, MIRI was the default option. If you looked at FLI or FHI or similar groups, you’d say “they seem helpful, but they’re not focused solely on AI safety/alignment, so I should go to MIRI for the best impact.”

2. MIRI as central point of failure for learning and secrecy.

MIRI’s secrecy (understandable) and their intelligent and creatively-thinking staff (good) have combined into a weird situation: for some research areas, nobody really knows what they’ve tried and failed/succeeded at, nor the details of how that came to be. Yudkowsky did link some corrigibility papers he labels as failed, but neither he nor MIRI have done similar (or more in-depth) autopsies of their approaches, to my knowledge.

As a result, nobody else can double-check that or learn from MIRI’s mistakes. Sure, MIRI people write up their meta-mistakes, but that has limited usefulness, and people still (understandably) disbelieve their approaches anyway. This leads either to making the same meta-mistakes (bad), or to blindly trusting MIRI’s approach/meta-approach (bad because...)

3. We need more uncorrelated (“diverse”) approaches to alignment.

MIRI was the central point for anyone with any alignment approach, for a very long time. Recently-started alignment groups (Redwood, ARC, Anthropic, Ought, etc.) are different from MIRI, but their approaches are correlated with each other. They all relate to things like corrigibility, the current ML paradigm, IDA, and other approaches that e.g. Paul Christiano would be interested in.

I’m not saying these approaches are guaranteed to fail (or work). I am saying that surviving worlds would have, if not way more alignment groups, definitely way more uncorrelated approaches to alignment. This need not lead to extra risk as long as the approaches are theoretical in nature. Think early-1900s physics gedankenexperiments, and how diverse they may have been.

Or, if you want more hope and less hope at the same time, look at how many wildly incompatible theories have been proposed to explain quantum mechanics. A surviving world would have at least this much of a Cambrian explosion in theories, and would also be better at handling this than we are in real-life handling the actual list of quantum theories (in absence of better experimental evidence).

Simply put, if evidence is dangerous to collect, and every existing theoretical approach is deeply flawed along some axis, then let schools proliferate with little evidence, dammit! This isn’t psych, where stuff fails to replicate and people keep doing it. AI alignment is somewhat better coordinated than other theoretical fields… we just overcorrected to putting all our eggs in a few approach baskets.

(Note: if MIRI is willing and able, it could continue being a/the central group for AI alignment, given the points in (1), but it would need to proliferate many schools of thought internally, as per (5) below.)

One problem with this ^[1], is that the AI alignment field as a whole may not have the resources (or the time) to pursue this hits-based strategy. In that case, AI alignment would appear to be bottlenecked on funding, rather than talent directly. That’s… news to me. In either case, this requires either more fundraising, and/or more money-efficient ways to get similar effects to what I’m talking about. (If we’re too talent-constrained to pursue a hits-based approach strategy, it’s even more imperative to fix the talent constraints first, as per (4) below.)

Another problem is whether the “winning” approach might come from deeper searching along the existing paths, rather than broader searching in weirder areas. In that case, it could maybe still make sense to proliferate sub-approaches under the existing paths. The rest of the points (especially (4) below) would still apply, and this still relies on the existing paths being… broken enough to call “doom”, but not broken enough to try anything too different. This is possible.

EDIT Sept. 9, 2022: John S Wentworth explains here why “just fund randos” is not the way to solve this, and how to do better.

4. How do people get good at this shit?

MIRI wants to hire the most competent people they can. People apply, and are turned away for not being smart/self-taught/security-mindset enough. So far so good.

But then… how do people get good at alignment skills before they’re good enough to work at MIRI, or whatever group has the best approach? How they get good enough to recognize, choose, and/or create the best approaches (which, remember, we need more of)?

Academia is loaded with problems. Existing orgs are already small and selective. Independent research is promising, yet still relies on a patchwork of grants and stuff. By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff. Unpaid, mind you, and likely with another job/school/whatever taking up your brain cycles.

Here’s a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just… work a normal job for years to save up, or modestly conclude they’re not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.

Yes, I know EA and the alignment community are both improving at noob-friendliness. I’m glad of this. I’d be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever’s unteachable, and teaching what is.

5. Secret good ideas + collaboration + more work needed = ???

The good thing about having a central org to coordinate around, is it solves the conflicting requirements of “intellectual sharing” and “infohazard secrecy”. One org where the best researchers go, open on the inside, closed to the outside. Good good.

But, as noted in (1), MIRI has not lived up to its potential in this regard ^[3]. MIRI could kill two birds with one stone, and act as a secrecy/collaboration coordination point while also having multiple small internal teams working on disparate approaches and thus having a high absolute headcount (helping (5) and (4)) while avoiding many issues common to big gangly organizations.

Then again, Zvi and others have written extensively on why big organizations are doomed to cancer and maybe theoretically impossible to align. Okay. Not promising. Then maybe we need approaches that get similar benefits (secrecy, collaboration, coordination, many schools) without making a large group. Perhaps a big closed-door annual conference? More MIRIx chapters? Something?

6. The hard problem of smart people working on a hard problem.

Remember “The Bitter Lesson”? Where AI researchers go for approaches using human expertise and galaxy-brained solutions, instead of brute scale?

Sutton’s reasoning for this is (at least partly) that researchers have human vanity. “I’m a smart person, therefore my solution should be sufficiently-complicated.” ^[4]

I think similar reasons of vanity (and related social-status) reasons are holding back some AI alignment progress.

I think people are afraid to suggest sufficiently weird/far-out ideas (which, recall, need to be quite different from existing flawed approaches), because they have a mental model of semi-adequate MIRI trying and failing something, and then not prioritizing writing-up-the-failure (or keeping the failure secret for some reason).

Sure, there are good security-mindset and iffy-teachability reasons why many new ideas can and should be rejected on-sight. But, as noted in (4), these problems should not be impossible to get around. And in actual cybersecurity and cryptography, where people are presumably selected at least a tad for having security mindset, there’s not exactly a shortage of creative ideas and moon math solutions. Given our field’s relatively-high coordination and self-reflection, surely we can do better?

This relates to a point I’ve made elsewhere, that in the face of lots of things not working, we need to try more hokey, wacky, cheesy, low-hanging, “dumb” ideas. I’m disappointed that I couldn’t find any LessWrong post suggesting like “Let’s divvy up team members where each one represents a cortex of the brain, then we can divide intellectual labor!”. The idea is dumb, it likely won’t work, but surviving worlds don’t leave that stone unturned. If famously-wacky early LessWrong didn’t have this lying around, how do I know MIRI hasn’t secretly tried and failed at it?

Related to division of intellectual labor: I also think Yudkowsky’s example of Einstein, in the Sequences, may make people afraid to offer incremental ideas, critiques, solutions, etc. “If I can’t solve all of alignment (or all of [big alignment subproblem]) in one or two groundbreaking papers, like Einstein did with Relativity, I’m not smart enough to work in alignment.” So, uh, don’t be afraid to take even half-baked ideas to the level of a LaTeX-formatted paper. (If you can solve alignment in one paper, obviously do that!)

7. Concluding paragraph because you have a crippling addiction to prose (ok, same, fair).

Here’s an example of something that combines many solution-ideas noted in (6). If it becomes more accepted to write ideas in bullet points, then:

It lowers the barrier to entry for people who think better/more easily than they write.
It lowers the mental “status-grab” barrier for people who are subtly intimidated by prose quality.
- This, in turn, signals to more people who already don’t care about status, that their blunt ideas are welcome on alignment spaces.
It makes prose quality less able to influence readers’ evaluations of idea quality, which is good for examining ideas’ truth values.
It may be easier even for people who already have little problem writing prose.
People can (and probably should) still write prose when they’re more comfortable with it / when needed for other purposes (explicitly persuading people?) anyway. Making bullet points more common does not necessarily entail forcibly limiting prose.

↩︎
H/T my co-blogger Devin, as is the case with my articles’ editing in general, and noticing gaps in my logic in particular.
↩︎
If you’re in this situation, DM me for moral support and untested advice.
↩︎
Or maybe it has! We don’t know! See (2)!
↩︎
See also, uh, that list of explanations of quantum mechanics.