Intuitions about solving hard problems
Solving hard scientific problems usually requires compelling insights
Here’s a heuristic which plays an important role in my reasoning about solving hard scientific problems: that when you’ve made an important breakthrough, you should be able to explain the key insight(s) behind that breakthrough in an intuitively compelling way. By “intuitively compelling” I don’t mean “listeners should be easily persuaded that the idea solves the problem”, but instead: “listeners should be easily persuaded that this is the type of idea which, if true, would constitute a big insight”.
The best examples are probably from Einstein: time being relative, and gravity being equivalent to acceleration, are both insights in this category. The same for Malthus and Darwin and Godel; the same for Galileo and Newton and Shannon.
Another angle on this heuristic comes from Scott Aaronson’s list of signs that a claimed P≠NP proof is wrong. In particular, see:
#6: the paper lacks a coherent overview, clearly explaining how and why it overcomes the barriers that foiled previous attempts.
And #1: the author can’t immediately explain why the proof fails for 2SAT, XOR-SAT, or other slight variants of NP-complete problems that are known to be in P.
I read these as Aaronson claiming that a successful solution to this very hard problem is likely to contain big insights that can be clearly explained.
Perhaps the best counterexample is the invention of Turing machines. Even after Turing explained the whole construction, it seems reasonable to still be uncertain whether there’s actually something interesting there, or whether he’s just presented you with a complicated mess. I think that uncertainty would be particularly reasonable if we imagine trying to understand the formalism before Turing figures out how to implement any nontrivial algorithm (like prime factorisation) on a Turing machine, or how to prove any theorems about universal Turing machines.
Other counterexamples might include quantum mechanics, where quantization was originally seen as a hack to make the equations work; or formal logic, where I’m not sure if there were any big insights that could be grasped in advance of actually seeing the formalisms in action.
Using the compelling insight heuristic to evaluate alignment research directions
It’s possible that alignment will in practice end up being more of an engineering problem than a scientific problem like the ones I described above. E.g. perhaps we’re in a world where, with sufficient caution about scaling up existing algorithms, we’ll produce aligned AIs capable of solving the full version of the problem for us. But suppose we’re trying to produce a fully scalable solution ourselves; are there existing insights which might be sufficient for that? Here are some candidates, which I’ll only discuss very briefly, and plan to discuss in more detail in a forthcoming post (I’d also welcome suggestions for any I’ve missed):
“Trustworthy imitation of human external behavior would avert many default dooms as they manifest in external behavior unlike human behavior.”
This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).
Decomposing supervision of complex tasks allows better human oversight.
Again, I’ve found this less compelling over time—in this case because I’ve realized that decomposition is the “default” approach we follow whenever we evaluate things, and so the real “work” of the insight needs to be in describing how we’ll decompose tasks, which I don’t think we’ve made much progress on (with techniques like cross-examination being possible exceptions).
Weight-sharing makes deception much harder.
I think this is the main argument pushing me towards optimism about ELK; thanks to Ajeya for articulating it to me.
Uncertainty about human preferences makes agents corrigible.
This is Stuart Russell’s claim about why assistance games are a good potential solution to alignment; I basically don’t buy it at all, for the same reasons as Yudkowsky (but kudos to Stuart for stating the proposed insight clearly enough that the disagreement is obvious).
Myopic agents can be capable while lacking incentives for long-term misbehavior.
This claim seems to drive a bunch of Evan Hubinger’s work, but I don’t buy it. In order for an agent’s behavior to be competent over long time horizons, it needs to be doing some kind of cognition aimed towards long time horizons, and we don’t know how to stop that cognition from being goal-directed.
Problems that arise in limited-data regimes (e.g. inner misalignment) go away when you have methods of procedurally generating realistic data (e.g. separable world-models).
This claim was made to me by Michael Cohen. It’s interesting, but I don’t think it solves the core alignment problem, because we don’t understand cognition well enough to efficiently factor out world-models from policies. E.g. training a world-model to predict observations step-by-step seems like it loses out on all the benefits of thinking in terms of abstractions; whereas training it just on long-term predictive accuracy makes the intermediate computations uninterpretable and therefore unusable.
By default we’ll train models to perform bounded tasks of bounded scope, and then achieve more complex tasks by combining them.
This seems like the core claim motivating Eric Drexler’s CAIS framework. I think it dramatically underrates the importance of general intelligence, and the returns to scaling up single models, for reasons I explain further here.
Functional decision theory.
I don’t think this directly addresses the alignment problem, but it feels like the level of insight I’m looking for, in a related domain.
Note that I do think each of these claims gestures towards interesting research possibilities which might move the needle in worlds where the alignment problem is easy. But I don’t think any of them are sufficiently powerful insights to scalably solve the hard version of the alignment problem. Why do many of the smart people listed above think otherwise? I think it’s because they’re not accounting properly for the sense in which the alignment problem is an adversarial one: that by default, optimization which pushes towards general intelligence will also push towards misalignment, and we’ll need to do something unusual to be confident we’re separating them. In other words, the set of insights about consequentialism and optimization which made us worry about the alignment problem in the first place (along with closely-related insights like the orthogonality thesis and instrumental convergence) are sufficiently high-level, and sufficiently robust, that unless you’re guided by other powerful insights you’re less likely to find exceptions to those principles, and more likely to find proposals where you can no longer spot the flaw.
This claim is very counterintuitive from a ML perspective, where loosely-directed exploration of new algorithms often leads to measurable improvements. I don’t know how to persuade people that applying this approach to alignment leads to proposals which are deceptively appealing, except by getting them to analyze each of the proposals above until they convince themselves that the purported insights are insufficient to solve the problem. Unfortunately, this is very time-consuming. To save effort, I’d like to promote a norm for proposals for alignment techniques to be very explicit about where the hard work is done, i.e. which part is surprising or insightful or novel enough to make us think that it could solve alignment even in worlds where that’s quite difficult. Or, alternatively, if the proposal is only aimed at worlds where the problem is relatively easy, please tell me that explicitly. E.g. I spent quite a while being confused about which part of the ELK agenda was meant to do the hard work of solving ontology identification; after asking Paul about it, though, his main response was “maybe ontology identification is easy”, which I feel very skeptical about. (He also suggested something to do with the structure of explanations as a potential solving-ELK-level-insight; but I don’t understand this well enough to discuss it in detail.)
Using the compelling insight heuristic to generate alignment research directions
If we need more major insights about intelligence/consequentialism/goals to solve alignment, how might we get them? Getting more evidence from seeing more advanced systems will make this much easier, so one strategy is just to keep pushing on empirical and engineering work, while keeping an eye out for novel phenomena which might point us in the right directions.
But for those who want to try to generate those insights directly, some tentative options:
Studying how existing large models think, and try to extrapolate from that
Understand human minds well enough to identify insights about human intelligence (e.g. things like predictive coding, or multi-agent models of minds, or dual process theory) which can be applied to alignment
Understanding how groups think (e.g. how task decomposition occurs in cultural evolution, or in corporations, or…)
Agent foundations research
Of course, the whole point of major insights is that it’s hard to predict them; so I’d be excited about others pursuing potential insights that don’t fall into any of these categories (with the caveat that, as the work gets increasingly abstract, it’s necessary to be increasingly careful for it to have any chance of succeeding).
I model Eliezer as agreeing with most of the claims I make in this post, but strongly disagreeing with this sentence, because he thinks that the core problem is so hard that no amount of prosaic engineering effort could plausibly prevent catastrophe in the absence of major novel insights.
Some brief intuitions about why: I think the hardest part of human cognition is generating and merging different ontologies. Thinking “within” an ontology is like doing normal research in a scientific field; reasoning about different ontologies is like doing philosophy, or doing paradigm-breaking research, and so it seems like a particularly difficult thing to generate a training signal for.
Thanks to Nathan Helm-Burger for reminding me of this, with his comment.