I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:
On the “implementation” extreme you prove theorems about the exact algorithm you implement in your AI, s.t. you can even use formal verification to prove these theorems about the actual code you wrote.
Marginally closer to “modeling”, you prove (or at least conjecture) theorems about some algorithm which is vaguely feasible in theory. Some civilization might have used that exact algorithm to build AI, but in our world it’s impractical, e.g. because it’s uncompetitive with other AI designs. However, your actual code is conceptually very close to the idealized algorithm, and you have good arguments why the differences don’t invalidate the safety properties of the idealized model.
Further along the spectrum, your actual algorithm is about as similar to the idealized algorithm as DQN is similar to vanilla Q-learning. Which is to say, it was “inspired” by the idealized algorithm but there’s a lot of heavy lifting done by heuristics. Nevertheless, there is some reason to hope the heuristic aspects don’t change the safety properties.
On the “modeling” extreme, your idealized model is something like AIXI: completely infeasible and bears little direct resemblance to the actual algorithm in your AI. However, there is still some reason to believe real AIs will have similar properties to the idealized model.
More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved:
How close is the object you make formal statements about to the actual code of your AI, where “closeness” is measured by the strength of the arguments you have for the analogy, on a scale from “they are literally the same” to solid theoretical and/or empirical evidence to pure hand-waving/intuition
How much evidence you have for the formal statements, on a scale from “I proved it within some widely accepted mathematical foundation (e.g. PA)” to “I proved vaguely related things, tried very hard but failed to disprove the thing and/or accumulated some empirical evidence”.
[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you’re not assuming away the core parts of the problem.]
For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as:
A backup plan which is better than nothing, more so if there is some combination of theoretical and empirical justification for the analogizing
A way to demonstrate threat models, as the OP suggests
An intermediate product that helps checking that your theory is heading in the right direction, comparing different research agendas, and maybe even making empirical tests.
Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?
We don’t know how the prove to required desiderata about the heuristic, but we can still reasonably conjecture them and support the conjectures with empirical tests.
We can’t prove or even conjecture anything useful-in-itself about the heuristic, but the way the heuristic is incorporated into the overall algorithm makes it safe. For example, maybe the heuristic produces suggestions together with formal certificates of their validity. More generally, we can imagine an oracle-machine (where the heuristic is slotted into the oracle) about which we cannot necessarily prove something like a regret bound w.r.t. the optimal policy, but we can prove (or at least conjecture) a regret bound w.r.t. some fixed simple reference policy. That is, the safety guarantee shows that no matter what the oracle does, the overall system is not worse than “doing nothing”. Maybe, modulo weak provable assumptions about the oracle, e.g. that it satisfies a particular computational complexity bound.
[Epistemic status: very fresh idea, quite speculative but intriguing.] We can’t find even a guarantee like a above for a worst-case computationally bounded oracle. However, we can prove (or at least conjecture) some kind of an “average-case” guarantee. For example, maybe we have high probability of safety for a random oracle. However, assuming a uniformly random oracle is quite weak. More optimistically, maybe we can prove safety even for any oracle that is pseudorandom against some complexity class C1 (where we want C1 to be as small as possible). Even better, maybe we can prove safety for any oracle in some complexity class C2 (where we want C2 to be as large as possible) that has access to another oracle which is pseudorandom against C1. If our heuristic is not actually in this category (in particular, C2 is smaller than P and our heuristic doesn’t lie in C2), this doesn’t formally guarantee anything, but it does provide some evidence for the “robustness” of our high-level scheme.
I see modeling vs. implementation as a spectrum more than a dichotomy. Something like:
On the “implementation” extreme you prove theorems about the exact algorithm you implement in your AI, s.t. you can even use formal verification to prove these theorems about the actual code you wrote.
Marginally closer to “modeling”, you prove (or at least conjecture) theorems about some algorithm which is vaguely feasible in theory. Some civilization might have used that exact algorithm to build AI, but in our world it’s impractical, e.g. because it’s uncompetitive with other AI designs. However, your actual code is conceptually very close to the idealized algorithm, and you have good arguments why the differences don’t invalidate the safety properties of the idealized model.
Further along the spectrum, your actual algorithm is about as similar to the idealized algorithm as DQN is similar to vanilla Q-learning. Which is to say, it was “inspired” by the idealized algorithm but there’s a lot of heavy lifting done by heuristics. Nevertheless, there is some reason to hope the heuristic aspects don’t change the safety properties.
On the “modeling” extreme, your idealized model is something like AIXI: completely infeasible and bears little direct resemblance to the actual algorithm in your AI. However, there is still some reason to believe real AIs will have similar properties to the idealized model.
More precisely, rather than a 1-dimensional spectrum, there are at least two parameters involved:
How close is the object you make formal statements about to the actual code of your AI, where “closeness” is measured by the strength of the arguments you have for the analogy, on a scale from “they are literally the same” to solid theoretical and/or empirical evidence to pure hand-waving/intuition
How much evidence you have for the formal statements, on a scale from “I proved it within some widely accepted mathematical foundation (e.g. PA)” to “I proved vaguely related things, tried very hard but failed to disprove the thing and/or accumulated some empirical evidence”.
[EDIT: And a 3rd parameter is, how justified/testable the assumptions of your model is. Ideally, you want these assumptions to be grounded in science. Some will likely be philosophical assumptions which cannot be tested empirically, but at least they should fit into a coherent holistic philosophical view. At the very least, you want to make sure you’re not assuming away the core parts of the problem.]
For the purposes of safety, you want to be as close to the implementation end of the spectrum as you can get. However, the model side of the spectrum is still useful as:
A backup plan which is better than nothing, more so if there is some combination of theoretical and empirical justification for the analogizing
A way to demonstrate threat models, as the OP suggests
An intermediate product that helps checking that your theory is heading in the right direction, comparing different research agendas, and maybe even making empirical tests.
Btw, what are some ways we can incorporate heuristics into our algorithm while staying on level 1-2?
We don’t know how the prove to required desiderata about the heuristic, but we can still reasonably conjecture them and support the conjectures with empirical tests.
We can’t prove or even conjecture anything useful-in-itself about the heuristic, but the way the heuristic is incorporated into the overall algorithm makes it safe. For example, maybe the heuristic produces suggestions together with formal certificates of their validity. More generally, we can imagine an oracle-machine (where the heuristic is slotted into the oracle) about which we cannot necessarily prove something like a regret bound w.r.t. the optimal policy, but we can prove (or at least conjecture) a regret bound w.r.t. some fixed simple reference policy. That is, the safety guarantee shows that no matter what the oracle does, the overall system is not worse than “doing nothing”. Maybe, modulo weak provable assumptions about the oracle, e.g. that it satisfies a particular computational complexity bound.
[Epistemic status: very fresh idea, quite speculative but intriguing.] We can’t find even a guarantee like a above for a worst-case computationally bounded oracle. However, we can prove (or at least conjecture) some kind of an “average-case” guarantee. For example, maybe we have high probability of safety for a random oracle. However, assuming a uniformly random oracle is quite weak. More optimistically, maybe we can prove safety even for any oracle that is pseudorandom against some complexity class C1 (where we want C1 to be as small as possible). Even better, maybe we can prove safety for any oracle in some complexity class C2 (where we want C2 to be as large as possible) that has access to another oracle which is pseudorandom against C1. If our heuristic is not actually in this category (in particular, C2 is smaller than P and our heuristic doesn’t lie in C2), this doesn’t formally guarantee anything, but it does provide some evidence for the “robustness” of our high-level scheme.
I agree with this—and it seems unrealistic to me that we will do better than 3 or 4.
Hold my beer ;)