Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth8 Aug 2022 18:05 UTC

LW: 130 AF: 54

Interpretability (ML & AI)AI Tool AI Corrigibility

Interpretability

I have a decent understanding of how transistors work, at least for purposes of basic digital circuitry. Apply high voltage to the gate, and current can flow between source and drain. Apply low voltage, and current doesn’t flow. (… And sometimes reverse that depending on which type of transistor we’re using.)

Transistor visual from wikipedia showing Source, Drain, Gate, and (usually ignored unless we’re really getting into the nitty gritty) Body. At a conceptual level: when voltage is applied to the Gate, the charge on the gate attracts electrons (or holes) up into the gap between Source and Drain, and those electrons (or holes) then conduct current between Source and Drain.

I also understand how to wire transistors together into a processor and memory. I understand how to write machine and assembly code to run on that processor, and how to write a compiler for a higher-level language like e.g. python. And I understand how to code up, train and run a neural network from scratch in python.

In short, I understand all the pieces from which a neural network is built at a low level, and I understand how all those pieces connect together. And yet, I do not really understand what’s going on inside of trained neural networks.

This shows that interpretability is not composable: if I take a bunch of things which I know how to interpret, and wire them together in a way I understand, I do not necessarily know how to interpret the composite system. Composing interpretable pieces does not necessarily yield an interpretable system.

Tools

The same applies to “tools”, in the sense of “tool AI”. Transistors and wires are very tool-ish: I understand what they do, they’re definitely not optimizing the broader world or trying to trick me or modelling me at all or trying to self-preserve or acting agenty in general. They’re just simple electronic tools.

And yet, assuming agenty AI is possible at all, it will be possible to assemble those tools into something agenty.

So, like interpretability, tool-ness is not composable: if I take a bunch of non-agenty tools, and wire them together in a way I understand, the composite system is not necessarily a non-agenty tool. Composing non-agenty tools does not necessarily yield a non-agenty tool.

Alignment/Corrigibility

What if I take a bunch of aligned and/or corrigible agents, and “wire them together” into a multi-agent organization? Is the resulting organization aligned/corrigible?

Actually there’s a decent argument that it is, if the individual agents are sufficiently highly capable. If the agents can model each other well enough and coordinate well enough, then they should be able to each individually predict what individual actions will cause the composite system to behave in an aligned/corrigible way, and they want to be aligned/corrigible, so they’ll do that.

However, this does not work if the individual agents are very limited and unable to model the whole big-picture system. HCH-like proposals are a good example here: humans are not typically able to model the whole big picture of a large human organization. There are too many specialized skillsets, too much local knowledge and information, too many places where complicated things happen which the spreadsheets and managerial dashboards don’t represent well. And humans certainly can’t coordinate at scale very well in general—our large-scale communication bandwidth is maybe five to seven words at best. Each individual human may be reasonably aligned/corrigible, but that doesn’t mean they aggregate together into an aligned/corrigible system.

The same applies if e.g. we magically factor a problem and then have a low-capability overseeable agent handle each piece. I could definitely oversee a logic gate during the execution of a program, make sure that it did absolutely nothing fishy, but overseeing each individual logic gate would do approximately nothing at all to prevent the program from behaving maliciously.

How These Arguments Work In Practice

In practice, nobody proposes that AI built from transistors and wires will be interpretable/tool-like/aligned/corrigible because the transistors and wires are interpretable/tool-like/aligned/corrigible. But people do often propose breaking things into very small chunks, so that each chunk is interpretable/tool-like/aligned/corrigible. For instance, interpretability people will talk about hiring ten thousand interpretability researchers to each interpret one little circuit in a net. Or problem factorization people will talk about breaking a problem into a large number of tiny little chunks each of which we can oversee.

And the issue is, the more little chunks we have to combine together, the more noncomposability becomes a problem. If we’re trying to compose interpretability/tool-ness/alignment/corrigibility of many little things, then figuring out how to turn interpretability/tool-ness/alignment/corrigibility of the parts into interpretability/tool-ness/alignment/corrigibility of the whole is the central problem, and it’s a hard (and interesting) open research problem.

What links here?