Modeling versus Implementation
Epistemic status: I feel that naming this axis deconfuses me about agent foundations about as much as writing the rest of this sequence so far—so it is worth a post even though I have less to say about it.
I think my goal in studying agent foundations is a little atypical. I am usually trying to build an abstract model of superintelligent agents and make safety claims based on that model.
For instance, AIXI models a very intelligent agent pursuing a reward signal, and allows us to conclude that it probably seizes control of the reward mechanism by default. This is nice because it makes our assumptions fairly explicit. AIXI has epistemic uncertainty but no computational bounds, which seems like a roughly appropriate model for agents much smarter than anything they need to interact with. AIXI is explicitly planning to maximize its discounted reward sum, which is different from standard RL (which trains on a reward signal, but later executes learned behaviors). We can see these things from the math.
Reflective oracles are compelling to me because they seem like an appropriate model for agents at a similar level of intelligence mutually reasoning about each other, possibly including a single agent over time (in the absence of radical intelligence upgrades?).
I’m willing to use these models where I expect them to bare weight, even if they are not “the true theory of agency.” In fact (as is probably becoming clear over the course of this sequence) I am not sure that a true theory of agency applicable to all contexts exists. The problem is that agents have a nasty habit of figuring stuff out, and anything they figure out is (at least potentially) pulled into agent theory. Agent theory does not want to stay inside a little bubble in conceptual space; it wants to devour conceptual space.
I notice a different attitude among many agent foundations researchers. As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory. Probably as a result, it seems that MIRI-adjacent researchers tend to explicitly plan on actually implementing their theory; they want it to be executable. Someday. After a lot of math has been done. This isn’t to say that they currently write a lot of code—I am only discussing their theory of impact as I understand it. To be clear, this is not a criticism; it is fine for some people to focus on theory building with an eye towards implementation and others to focus on performing implementation.
For example, I believe @abramdemski really wants to implement a version of UDT and @Vanessa Kosoy really wants to implement an IBP agent. They are both working on a normative theory which they recognize is currently slightly idealized or incomplete, but I believe that their plan routes through developing that theory to the point that it can be translated into code. Another example is the program synthesis community in computational cognitive science (e.g. Josh Tenenbaum, Zenna Tavares). They are writing functional programs to compete with deep learning right now.
For a criticism of this mindset, see my (previous in this sequence) discussion of why glass-box learners are not necessarily safer. Also, (relatedly) I suspect it will be rather hard to invent a nice paradigm that takes the lead from deep learning. However, I am glad people are working on it and I hope they succeed; and I don’t mean that in an empty way. I dabble in this quest myself—I even have a computational cognitive science paper.
I think that my post on what makes a theory of intelligence useful suffers from a failure to make explicit this dichotomy between modeling and implementation. I mostly had the modeling perspective in mind, but sometimes made claims about implementation. These are inherently different concerns.
The modeling perspective has its own problems. It is possible that agent theory is particularly unfriendly to abstract models—superintelligences apply a lot of optimization pressure, and pointing that optimization pressure in almost the right direction is not good enough. However, I am at least pretty comfortable using abstract models to predict why alignment plans won’t work. To conclude that an alignment plan will work, you need to know that your abstract model is robust to vast increases in intelligence. That is why I like models similar to AIXI, which have already “taken the limit” of increasing intelligence—even if they (explicitly) leave out the initial conditions of intelligence-escalation trajectories.
I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:
On the “implementation” extreme you prove theorems about the exact algorithm you implement in your AI, s.t. you can even use formal verification to prove these theorems about the actual code you wrote.
Marginally closer to “modeling”, you prove (or at least conjecture) theorems about some algorithm which is vaguely feasible in theory. Some civilization might have used that exact algorithm to build AI, but in our world it’s impractical, e.g. because it’s uncompetitive with other AI designs. However, your actual code is conceptually very close to the idealized algorithm, and you have good arguments why the differences don’t invalidate the safety properties of the idealized model.
Further along the spectrum, your actual algorithm is about as similar to the idealized algorithm as DQN is similar to vanilla Q-learning. Which is to say, it was “inspired” by the idealized algorithm but there’s a lot of heavy lifting done by heuristics. Nevertheless, there is some reason to hope the heuristic aspects don’t change the safety properties.
On the “modeling” extreme, your idealized model is something like AIXI: completely infeasible and bears little direct resemblance to the actual algorithm in your AI. However, there is still some reason to believe real AIs will have similar properties to the idealized model.
More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved:
How close is the object you make formal statements about to the actual code of your AI, where “closeness” is measured by the strength of the arguments you have for the analogy, on a scale from “they are literally the same” to solid theoretical and/or empirical evidence to pure hand-waving/intuition
How much evidence you have for the formal statements, on a scale from “I proved it within some widely accepted mathematical foundation (e.g. PA)” to “I proved vaguely related things, tried very hard but failed to disprove the thing and/or accumulated some empirical evidence”.
[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you’re not assuming away the core parts of the problem.]
For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as:
A backup plan which is better than nothing, more so if there is some combination of theoretical and empirical justification for the analogizing
A way to demonstrate threat models, as the OP suggests
An intermediate product that helps checking that your theory is heading in the right direction, comparing different research agendas, and maybe even making empirical tests.
Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?
We don’t know how the prove to required desiderata about the heuristic, but we can still reasonably conjecture them and support the conjectures with empirical tests.
We can’t prove or even conjecture anything useful-in-itself about the heuristic, but the way the heuristic is incorporated into the overall algorithm makes it safe. For example, maybe the heuristic produces suggestions together with formal certificates of their validity. More generally, we can imagine an oracle-machine (where the heuristic is slotted into the oracle) about which we cannot necessarily prove something like a regret bound w.r.t. the optimal policy, but we can prove (or at least conjecture) a regret bound w.r.t. some fixed simple reference policy. That is, the safety guarantee shows that no matter what the oracle does, the overall system is not worse than “doing nothing”. Maybe, modulo weak provable assumptions about the oracle, e.g. that it satisfies a particular computational complexity bound.
[Epistemic status: very fresh idea, quite speculative but intriguing.] We can’t find even a guarantee like a above for a worst-case computationally bounded oracle. However, we can prove (or at least conjecture) some kind of an “average-case” guarantee. For example, maybe we have high probability of safety for a random oracle. However, assuming a uniformly random oracle is quite weak. More optimistically, maybe we can prove safety even for any oracle that is pseudorandom against some complexity class C1 (where we want C1 to be as small as possible). Even better, maybe we can prove safety for any oracle in some complexity class C2 (where we want C2 to be as large as possible) that has access to another oracle which is pseudorandom against C1. If our heuristic is not actually in this category (in particular, C2 is smaller than P and our heuristic doesn’t lie in C2), this doesn’t formally guarantee anything, but it does provide some evidence for the “robustness” of our high-level scheme.
I agree with this—and it seems unrealistic to me that we will do better than 3 or 4.
Hold my beer ;)
For what it’s worth, IBP avoids the issue of glass-box learners not necessarily being safe by focusing on desiderata rather than specifically focusing on algorithms.
In particular, you could in principle prove stuff about black boxes, so long as the black box satisfied some desiderata rathet than trying to white box the algorithm and prove stuff on that.
@Steven Byrnes has talked about this before:
https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety
I read the post and I see what you’re pointing at, but naively I am still left with the same impression. Perhaps more serious study of IB will change my mind.
I think this misunderstands the general view of agent foundations by those who worked on it in the past. That is, “highly reliable agent design” was an eventual goal, in the same sense that someone taking high-school physics wants to use it to build rockets—they (hopefully) understand enough to know that they don’t know enough, and will need to learn more before even attempting to build anything.
That’s why Eliezer talked so much about deconfusion. The idea was to figure out what they didn’t know. This led to later talking about building safe AI as an eventual goal—not a plan, but an eventual possible outcome if they could figure out enough. They clarified this view. It was mostly understood by funders. And I helped Issa Rice write a paper laying out the different pathways that it could help—and only two of those involved building agents.
And why did they give it up? Largely because they found that the deconfusion work was so slow, and everyone was so fundamentally wrong about the basics, that as LLM-based systems were developed they didn’t think we could possible build the reliable systems in time. They didn’t think that Bayesian decision theory or glass-box agents would necessarily work, and they didn’t know what would. So I think “MIRI intended to build principled glass-box agents based on Bayesian decision theory” is not just misleading, but wrong.
What’s your opinion of value learning? If the intelligence (unlike AIXI) understands that its current utility function is imperfect and should be improved, then it can intelligently prioritize improving the utility function against optimizing the current version of it, taking Goodhart’s Law and extrapolation outside the currently-known distribution into account. Then we have a dynamic situation, and we’re interested in what the utility function converges to under this optimization.
I meant what I said at a higher level of abstraction—optimization pressure may destroy leaky abstractions. I don’t think value learning immediately solves this.
I agree that optimization pressure can destroy leaky abstractions: that’s Goodhart’s Law. Value learning means that the optimization pressure applies on both sides of the Goodhart problem: improving the utility function as well as applying it. So then the optimization pressure can also identify the leak and improve the abstraction. The question then becomes how well the (possibly super) intelligence can manage that dynamic/iterated process: does the value learning process converge to alignment and stay stable, faster than the AI/its successors can do drastic harm due to partial misalignment?
What I find promising is that, for any valid argument, problem or objection we can come up with, there’s no a-priori reason why the AI wouldn’t also be able to grasp that and attempt to avoid or correct the problem, as long as its capabilities were sufficient and its current near-alignment was good enough that it wanted to do so. So it looks rather clear to me that there is a region of convergence to full alignment from partial alignment. The questions then becomes how large that is, whether we can construct a first iteration that’s inside it, and what the process may converge to as the AI’s intelligence increases and human society evolves.