That was my interpretation as well.
I think it does look pretty alarming if we imagine that this scales, i. e., if these learned implicit concepts can build on each other. Which they almost definitely can.
The “single-step” case, of the SGD chiseling-in a new pattern which is a simple combination of two patterns explicitly represented in the training data, is indeed unexciting. But once that pattern is there, the SGD can chisel-in another pattern which uses the first implicit pattern as a primitive. Iterate on, and we have a tall tower of implicit patterns building on implicit patterns, none of which are present in the training data, and which can become arbitrarily more sophisticated and arbitrarily more alien than anything in the training set. And we don’t even know what they are, so we can’t assess their safety, and can’t even train them out (because we don’t know how to elicit them).
Which, well, yes: we already knew all of this was happening. But I think this paper is very useful in clearly showcasing this.
One interesting result here, I think, is that the LLM is then able to explicitly write down the definition of f(blah), despite the fact that the fine-tuning training set didn’t demand anything like this. That ability – to translate the latent representation of f(blah) into humanese – appeared coincidentally, as the result of the SGD chiseling-in some module for merely predicting f(blah).
Which implies some interesting things about how the representations are stored. The LLM actually “understands” what f(blah) is built out of, in a way that’s accessible to its externalized monologue. That wasn’t obvious to me, at least.
(Written while I’m at the title of “Respecting Modularity”.)
My own working definition of “corrigibility” has been something like “an AI system that obeys commands, and only produces effects through causal pathways that were white-listed by its human operators, with these properties recursively applied to its interactions with its human operators”.
In a basic case, if you tell it to do something, like “copy a strawberry” or “raise the global sanity waterline”, it’s going to give you a step-by-step outline of what it’s going to do, how these actions are going to achieve the goal, how the resultant end-state is going to be structured (the strawberry’s composition, the resultant social order), and what predictable effects all of this would have (both direct effects and side-effects).
So if it’s planning to build some sort of nanofactory that boils the oceans as a side-effect, or deploy Basilisk hacks that exploit some vulnerability in the human psyche to teach people stuff, it’s going to list these pathways, and you’d have the chance to veto them. Then you’ll get it to generate some plans that work through causal pathways you do approve of, like “normal human-like persuasion that doesn’t circumvent the interface of the human mind / doesn’t make the abstraction “the human mind” leak / doesn’t violate the boundaries of the human psyche”.
It’s also going to adhere to this continuously: e. g., if it discovers a new causal pathway and realizes the plan it’s currently executing has effects through it, it’s going to seek urgent approval from the human operators (while somehow safely halting its plan using a procedure for this that it previously designed with its human operators, or something).
And this should somehow apply recursively. The AI should only interact with the operators through pathways they’ve approved of. E. g., using only “mundane” human-like ways to convey information; no deploying Basilisk hacks to force-feed them knowledge, no directly rewriting their brains with nanomachines, not even hacking their phones to be able to talk to them while they’re outside the office.
(How do we get around the infinite recursion here? I have no idea, besides “hard-code some approved pathways into the initial design”.)
And then the relevant set of “causal pathways” probably factors through the multi-level abstract structure of the environment. For any given action, there is some set of consequences that is predictable and goes into the AI’s planning. This set is relatively small, and could be understood by a human. Every consequence outside this “small” set is unpredictable, and basically devolves into high-entropy noise; not even an ASI could predict the outcome. (Think this post.) And if we look at the structure of the predictable-consequences sets across time, we’d find rich symmetries, forming the aforementioned “pathways” through which subsystems/abstractions interact.
(I’ve now read the post.)
This seems to fit pretty well with your definition? Visibility: check, correctability: check. The “side-effects” property only partly fits – by my definition, a corrigible AI is allowed to have all sorts of side-effects, but these side-effects must be known and approved by its human operator – but I think it’s gesturing at the same idea. (Real-life tools also have lots of side effects, e. g. vibration and noise pollution from industrial drills – but we try to minimize these side-effects. And inasmuch as we fail, the resultant tools are considered “bad”, worse than the versions of these tools without the side-effects.)