Technical Appendix: First safeguard?
This sequence is written to be broadly accessible, although perhaps its focus on capable AI systems assumes familiarity with basic arguments for the importance of AI alignment. The technical appendices are an exception, targeting the technically inclined.
Why do I claim that an impact measure would be “the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things – without assuming anything about the objective”?
The safeguard proposal shouldn’t have to say “and here we solve this opaque, hard problem, and then it works”. If we have the impact measure, we have the math, and then we have the code.
So what about:
Quantilizers? This seems to be the most plausible alternative; mild optimization and impact measurement share many properties. But
What happens if the agent is already powerful? A greater proportion of plans could be catastrophic, since the agent is in a better position to cause them.
Where does the base distribution come from (opaque, hard problem?), and how do we know it’s safe to sample from?
In the linked paper, Jessica Taylor suggests the idea of learning a human distribution over actions – how robustly would we need to learn this distribution? How numerous are catastrophic plans, and what is a catastrophe, defined without reference to our values in particular? (That definition requires understanding impact!)
Value learning? But
We only want this if our (human) values are learned!
Value learning is impossible without assumptions, and getting good enough assumptions could be really hard. If we don’t know if we can get value learning / reward specification right, we’d like safeguards which don’t fail because value learning goes wrong. The point of a safeguard is that it can catch you if the main thing falls through; if the safeguard fails because the main thing does, that’s pointless.
Corrigibility? At present, I’m excited about this property because I suspect it has a simple core principle. But
Even if the system is responsive to correction (and non-manipulative, and whatever other properties we associate with corrigibility), what if we become unable to correct it as a result of early actions (if the agent “moves too quickly”, so to speak)?
Paul Christiano’s take on corrigibility is much broader and an exception to this critique.
What is the core principle?
The three sections of this sequence will respectively answer three questions:
Why do we think some things are big deals?
Why are capable goal-directed AIs incentivized to catastrophically affect us by default?
How can we build agents without these incentives?
The first part of this sequence focuses on foundational concepts crucial for understanding the deeper nature of impact. We will not yet be discussing what to implement.
I’m planning on releasing the first third of the posts over the next few weeks; the remainder will come after some delay.
The best way to use this book is NOT to simply read it or study it, but to read a question and STOP. Even close the book. Even put it away and THINK about the question. Only after you have formed a reasoned opinion should you read the solution. Why torture yourself thinking? Why jog? Why do push-ups?
If you are given a hammer with which to drive nails at the age of three you may think to yourself, “OK, nice.” But if you are given a hard rock with which to drive nails at the age of three, and at the age of four you are given a hammer, you think to yourself, “What a marvellous invention!” You see, you can’t really appreciate the solution until you first appreciate the problem.
Projects are shovel-ready ways to get your hands dirty doing novel and important work. Collaboration is encouraged – in particular, feel free to message me. The first to rise to the challenge will earn a permanent spot in the sequence.
My paperclip-Balrog illustration is metaphorical: a good impact measure would hold steadfast against the daunting challenge of formally asking for the right thing from a powerful agent. The illustration does not represent an internal conflict within that agent. As water flows downhill, an impact-penalizing Frank prefers low-impact plans.
The drawing is based on gonzalokenny’s amazing work.
Some of you may have a different conception of impact; I ask that you grasp the thing that I’m pointing to. In doing so, you might come to see your mental algorithm is the same. Ask not “is this what I initially had in mind?”, but rather “does this make sense as a thing-to-call-‘impact’?”.
H/T Rohin Shah for suggesting the three key properties. Alison Bowden contributed several small drawings and enormous help with earlier drafts.