But What’s Your *New Alignment Insight,* out of a Future-Textbook Paragraph?
This is something I’ve been thinking about a good amount while considering my model of Eliezer’s model of alignment. After tweaking it a bunch, it sure looks like a messy retread of much of what Richard says here; I don’t claim to assemble any new, previously unassembled insights here.
Tl;dr: For impossibly difficult problems like AGI alignment, the worlds in which we solve the problem will be worlds that came up with some new, intuitively compelling insights. On our priors about impossibly difficult problems, worlds without new intuitive insights don’t survive AGI.
I once knew a fellow who was convinced that his system of wheels and gears would produce reactionless thrust, and he had an Excel spreadsheet that would prove this—which of course he couldn’t show us because he was still developing the system. In classical mechanics, violating Conservation of Momentum is provably impossible. So any Excel spreadsheet calculated according to the rules of classical mechanics must necessarily show that no reactionless thrust exists—unless your machine is complicated enough that you have made a mistake in the calculations.
And similarly, when half-trained or tenth-trained rationalists abandon their art and try to believe without evidence just this once, they often build vast edifices of justification, confusing themselves just enough to conceal the magical steps.
It can be quite a pain to nail down where the magic occurs—their structure of argument tends to morph and squirm away as you interrogate them. But there’s always some step where a tiny probability turns into a large one—where they try to believe without evidence—where they step into the unknown, thinking, “No one can prove me wrong”.
Hey, maybe if you add enough wheels and gears to your argument, it’ll turn warm water into electricity and ice cubes! Or, rather, you will no longer see why this couldn’t be the case.
“Right! I can’t see why couldn’t be the case! So maybe it is!”
Another gear? That just makes your machine even less efficient. It wasn’t a perpetual motion machine before, and each extra gear you add makes it even less efficient than that.
Each extra detail in your argument necessarily decreases the joint probability. The probability that you’ve violated the Second Law of Thermodynamics without knowing exactly how, by guessing the exact state of boiling water without evidence, so that you can stick your finger in without getting burned, is, necessarily, even less than the probability of sticking in your finger into boiling water without getting burned.
I say all this, because people really do construct these huge edifices of argument in the course of believing without evidence. One must learn to see this as analogous to all the wheels and gears that fellow added onto his reactionless drive, until he finally collected enough complications to make a mistake in his Excel spreadsheet.
If I read all such papers, then I wouldn’t have time for anything else. It’s an interesting question how you decide whether a given paper crosses the plausibility threshold or not … Suppose someone sends you a complicated solution to a famous decades-old math problem, like P vs. NP. How can you decide, in ten minutes or less, whether the solution is worth reading?
The techniques just seem too wimpy for the problem at hand. Of all ten tests, this is the slipperiest and hardest to apply — but also the decisive one in many cases. As an analogy, suppose your friend in Boston blindfolded you, drove you around for twenty minutes, then took the blindfold off and claimed you were now in Beijing. Yes, you do see Chinese signs and pagoda roofs, and no, you can’t immediately disprove him — but based on your knowledge of both cars and geography, isn’t it more likely you’re just in Chinatown? I know it’s trite, but this is exactly how I feel when I see (for example) a paper that uses category theory to prove NL≠NP. We start in Boston, we end up in Beijing, and at no point is anything resembling an ocean ever crossed.
What’s going on in the above cases is argumentation from “genre savviness” about our physical world: knowing, based on the reference class that a purported feat would fall into, the probabilities of feat success conditional on its having or lacking various features. These meta-level arguments rely on knowledge about what belongs in which reference class, rather than on in-the-weeds object-level arguments about the proposed solution itself. It’s better to reason about things concretely, when possible, but in these cases the meta-level heuristic has a well-substantiated track record.
Successful feats will all have a certain superficial shape, so you can sometimes evaluate a purported feat based on its superficial features alone. One instance where we might really care about doing this is where we only get one shot at a feat, such as aligning AGI, and if we fail our save everyone dies. In that case, we will not get lots of postmortem time to poke through how we failed and learn the object-level insights after the fact. We just die. We’ll have to evaluate our possible feats in light of their non-hindsight-based features, then.
Let’s look at the same kind of argument, courtesy Eliezer, about alignment schemes:
I remark that this intuition matches what the wise might learn from Scott’s parable of K’th’ranga V: If you know how to do something then you know how to do it directly rather than by weird recursion, and what you imagine yourself doing by weird recursion you probably can’t really do at all. When you want an airplane you don’t obtain it by figuring out how to build birds and then aggregating lots of birds into a platform that can carry more weight than any one bird and then aggregating platforms into megaplatforms until you have an airplane; either you understand aerodynamics well enough to build an airplane, or you don’t, the weird recursion isn’t really doing the work. It is by no means clear that we would have a superior government free of exploitative politicians if all the voters elected representatives whom they believed to be only slightly smarter than themselves, until a chain of delegation reached up to the top level of government; either you know how to build a less corruptible relationship between voters and politicians, or you don’t, the weirdly recursive part doesn’t really help. It is no coincidence that modern ML systems do not work by weird recursion because all the discoveries are of how to just do stuff, not how to do stuff using weird recursion. (Even with AlphaGo which is arguably recursive if you squint at it hard enough, you’re looking at something that is not weirdly recursive the way I think Paul’s stuff is weirdly recursive, and for more on that see https://intelligence.org/2018/05/19/challenges-to-christianos-capability-amplification-proposal/.)
It’s in this same sense that I intuit that if you could inspect the local elements of a modular system for properties that globally added to aligned corrigible intelligence, it would mean you had the knowledge to build an aligned corrigible AGI out of parts that worked like that, not that you could aggregate systems that corrigibly learned to put together sequences of corrigible thoughts into larger corrigible thoughts starting from gradient descent on data humans have labeled with their own judgments of corrigibility.
Eliezer often asks, “Where’s your couple-paragraph-length insight from the Textbook from the Future”? Alignment schemes are purported solutions to problems in the reference class of impossibly difficult problems, in which we’re actually doing something new, like inventing mathematical physics for the very first time, and doing so playing against emerging superintelligent optimizers. As far as I can tell, Eliezer’s worry is that proposed alignment schemes spin these long arguments for success that just amount to burying the problem deep enough to fool yourself. That’s why any proposed solution to alignment has to yield a core insight or five that we didn’t have before—conditional on an alignment scheme looking good without a simple new insight, you’ve probably just buried the hard core of the problem deep enough in your arguments to fool your brain.
So it’s fair to ask any alignment scheme what its new central insight into AI is, in a paragraph or two. If these couple of paragraphs read like something from the Textbook from the Future, then the scheme might be in business. If the paragraphs contain no brand new, intuitively compelling insights, then the scheme probably doesn’t contain the necessary insights but-distributed-across-its-whole-body either.
Though this doesn’t mean that pursuing that line of research further couldn’t lead to the necessary insights. The science just has to eventually get to those insights if alignment is to work.