davidad comments on An Open Agency Architecture for Safe Transformative AI

davidad 23 Dec 2022 22:23 UTC
LW: 5 AF: 3
0
AF
In response to your linked post, I do have similar intuitions about “Microscope AI” as it is typically conceived (i.e. to examine the AI for problems using mechanistic interpretability tools before deploying it). Here I propose two things that are a little bit like Microscope AI but in my view both avoid the core problem you’re pointing at (i.e. a useful neural network will always be larger than your understanding of it, and that matters):
1. Model-checking policies for formal properties. A model-checker (unlike a human interpreter) works with the entire network, not just the most interpretable parts. If it proves a property, that property is true about the actual neural network. The Model-Checking Feasibility Hypothesis says that this is feasible, regardless of the infeasibility of a human understanding the policy or any details of the proof. (We would rely on a verified verifier for the proof, of which humans would understand the details.)
2. Factoring learned information through human understanding. If we denote learning by $L$ , human understanding by $H$ , and big effects on the world by $L e \to W$ , then “factoring” means that $e = L f \to H g \to W$ (for some $f$ and $g$ ). This is in the same spirit as “human in the loop,” except not for the innermost loops of real-time action. Here, the Scientific Sufficiency Hypothesis implies that even though $L$ is “larger” than $H$ in the sense you point out, we can throw away the parts that don’t fit in $H$ and move forward with a fully-understood world model. I believe this is likely feasible for world models, but not for policies (optimal policies for simple world models, like Go, can of course be much better than anything humans understand).