Three Stories for How AGI Comes Before FAI
Epistemic status: fake framework
To do effective differential technological development for AI safety, we’d like to know which combinations of AI insights are more likely to lead to FAI vs UFAI. This is an overarching strategic consideration which feeds into questions like how to think about the value of AI capabilities research.
As far as I can tell, there are actually several different stories for how we may end up with a set of AI insights which makes UFAI more likely than FAI, and these stories aren’t entirely compatible with one another.
Note: In this document, when I say “FAI”, I mean any superintelligent system which does a good job of helping humans (so an “aligned Task AGI” also counts).
Story #1: The Roadblock Story
Nate Soares describes the roadblock story in this comment:
...if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are “your team runs into a capabilities roadblock and can’t achieve AGI” or “your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time.”
The roadblock story happens if there are key safety insights that FAI needs but AGI doesn’t need. In this story, the knowledge needed for FAI is a superset of the knowledge needed for AGI. If the safety insights are difficult to obtain, or no one is working to obtain them, we could find ourselves in a situation where we have all the AGI insights without having all the FAI insights.
There is subtlety here. In order to make a strong argument for the existence of insights like this, it’s not enough to point to failures of existing systems, or describe hypothetical failures of future systems. You also need to explain why the insights necessary to create AGI wouldn’t be sufficient to fix the problems.
Some possible ways the roadblock story could come about:
Maybe safety insights are more or less agnostic to the chosen AGI technology and can be discovered in parallel. (Stuart Russell has pushed against this, saying that in the same way making sure bridges don’t fall down is part of civil engineering, safety should be part of mainstream AI research.)
Maybe safety insights require AGI insights as a prerequisite, leaving us in a precarious position where we will have acquired the capability to build an AGI before we begin critical FAI research.
This could be the case if the needed safety insights are mostly about how to safely assemble AGI insights into an FAI. It’s possible we could do a bit of this work in advance by developing “contingency plans” for how we would construct FAI in the event of combinations of capabilities advances that seem plausible.
Paul Christiano’s IDA framework could be considered a contingency plan for the case where we develop much more powerful imitation learning.
Contingency plans could also be helpful for directing differential technological development, since we’d get a sense of the difficulty of FAI under various tech development scenarios.
Maybe there will be multiple subsets of the insights needed for FAI which are sufficient for AGI.
In this case, we’d like to speed the discovery of whichever FAI insight will be discovered last.
Story #2: The Security Story
CORAL: You know, back in mainstream computer security, when you propose a new way of securing a system, it’s considered traditional and wise for everyone to gather around and try to come up with reasons why your idea might not work. It’s understood that no matter how smart you are, most seemingly bright ideas turn out to be flawed, and that you shouldn’t be touchy about people trying to shoot them down.
The main difference between the security story and the roadblock story is that in the security story, it’s not obvious that the system is misaligned.
We can subdivide the security story based on the ease of fixing a flaw if we’re able to detect it in advance. For example, vulnerability #1 on the OWASP Top 10 is injection, which is typically easy to patch once it’s discovered. Insecure systems are often right next to secure systems in program space.
If the security story is what we are worried about, it could be wise to try & develop the AI equivalent of OWASP’s Cheat Sheet Series, to make it easier for people to find security problems with AI systems. Of course, many items on the cheat sheet would be speculative, since AGI doesn’t actually exist yet. But it could still serve as a useful starting point for brainstorming flaws.
Differential technological development could be useful in the security story if we push for the development of AI tech that is easier to secure. However, it’s not clear how confident we can be in our intuitions about what will or won’t be easy to secure. In his book Thinking Fast and Slow, Daniel Kahneman describes his adversarial collaboration with expertise researcher Gary Klein. Kahneman was an expertise skeptic, and Klein an expertise booster:
We eventually concluded that our disagreement was due in part to the fact that we had different experts in mind. Klein had spent much time with fireground commanders, clinical nurses, and other professionals who have real expertise. I had spent more time thinking about clinicians, stock pickers, and political scientists trying to make unsupportable long-term forecasts. Not surprisingly, his default attitude was trust and respect; mine was skepticism.
When do judgments reflect true expertise? … The answer comes from the two basic conditions for acquiring a skill:
an environment that is sufficiently regular to be predictable
an opportunity to learn these regularities through prolonged practice
In a less regular, or low-validity, environment, the heuristics of judgment are invoked. System 1 is often able to produce quick answers to difficult questions by substitution, creating coherence where there is none. The question that is answered is not the one that was intended, but the answer is produced quickly and may be sufficiently plausible to pass the lax and lenient review of System 2. You may want to forecast the commercial future of a company, for example, and believe that this is what you are judging, while in fact your evaluation is dominated by your impressions of the energy and competence of its current executives. Because substitution occurs automatically, you often do not know the origin of a judgment that you (your System 2) endorse and adopt. If it is the only one that comes to mind, it may be subjectively undistinguishable from valid judgments that you make with expert confidence. This is why subjective confidence is not a good diagnostic of accuracy: judgments that answer the wrong question can also be made with high confidence.
Our intuitions are only as good as the data we’ve seen. “Gathering data” for an AI security cheat sheet could helpful for developing security intuition. But I think we should be skeptical of intuition anyway, given the speculative nature of the topic.
Story #3: The Alchemy Story
Batch Norm is a technique that speeds up gradient descent on deep nets. You sprinkle it between your layers and gradient descent goes faster. I think it’s ok to use techniques we don’t understand. I only vaguely understand how an airplane works, and I was fine taking one to this conference. But it’s always better if we build systems on top of things we do understand deeply? This is what we know about why batch norm works well. But don’t you want to understand why reducing internal covariate shift speeds up gradient descent? Don’t you want to see evidence that Batch Norm reduces internal covariate shift? Don’t you want to know what internal covariate shift is? Batch Norm has become a foundational operation for machine learning. It works amazingly well. But we know almost nothing about it.
The alchemy story has similarities to both the roadblock story and the security story.
From the perspective of the roadblock story, “alchemical” insights could be viewed as insights which could be useful if we only cared about creating AGI, but are too unreliable to use in an FAI. (It’s possible there are other insights which fall into the “usable for AGI but not FAI” category due to something other than their alchemical nature—if you can think of any, I’d be interested to hear.)
In some ways, alchemy could be worse than a clear roadblock. It might be that not everyone agrees whether the systems are reliable enough to form the basis of an FAI, and then we’re looking at a unilateralist’s curse scenario.
Just like chemistry only came after alchemy, it’s possible that we’ll first develop the capability to create AGI via alchemical means, and only acquire the deeper understanding necessary to create a reliable FAI later. (This is a scenario from the roadblock section, where FAI insights require AGI insights as a prerequisite.) To prevent this, we could try & deepen our understanding of components we expect to fail in subtle ways, and retard the development of components we expect to “just work” without any surprises once invented.
From the perspective of the security story, “alchemical” insights could be viewed as components which are clearly prone to vulnerabilities. Alchemical components could produce failures which are hard to understand or summarize, let alone fix. From a differential technological development point of view, the best approach may be to differentially advance less alchemical, more interpretable AI paradigms, developing the AI equivalent of reliable cryptographic primitives. (Note that explainability is inferior to interpretability.)
Trying to create an FAI from alchemical components is obviously not the best idea. But it’s not totally clear how much of a risk these components pose, because if the components don’t work reliably, an AGI built from them may not work well enough to pose a threat. Such an AGI could work better over time if it’s able to improve its own components. In this case, we might be able to program it so it periodically re-evaluates its training data as its components get upgraded, so its understanding of human values improves as its components improve.
How plausible does each story seem?
What possibilities aren’t covered by the taxonomy provided?
What distinctions does this framework fail to capture?
Which claims are incorrect?