Aligned AI as a wrapper around an LLM

In my previous post Are extrapolation-based AIs alignable? I argued that an AI trained only to extrapolate some dataset (like an LLM) can’t really be aligned, because it wouldn’t know what information can be shared when and with whom. So to be used for good, it needs to be in the hands of a good operator.

That suggests the idea that the “operator” of an LLM should be another, smaller AI wrapped around it, trained for alignment. It would take care of all interactions with the world, and decide when and how to call the internal LLM, thus delegating most of the intelligence work to it.

Q1: In this approach, do we still need to finetune the LLM for alignment?

A: Hopefully not. We would train it only for extrapolation, and train the wrapper AI for alignment.

Q2: How would we train the wrapper?

A: I don’t know. For the moment, handwave it with “the wrapper is smaller, and its interactions with the LLM are text-based, so training it for alignment should be simpler than training a big opaque AI for both intelligence and alignment at once”. But it’s very fuzzy to me.

Q3: If the LLM+wrapper combination is meant to be aligned, and the LLM isn’t aligned on its own, wouldn’t the wrapper need to know everything about human values?

A: Hopefully not, because information about human values can be coaxed out of the LLM (maybe by using magic words like “good”, “Bertrand Russell”, “CEV” and so on) and I’d expect the wrapper to learn to do just that.

Q4: Wouldn’t the wrapper become a powerful AI of its own?

A: Again, hopefully not. My hypothesis is that its intelligence growth will be “stunted” by the availability of the LLM.

Q5: Wouldn’t the wrapper be vulnerable to takeover by a mesa-optimizer in the LLM?

A: Yeah. I don’t know how real that danger is. We probably need to see such mesa-optimizers in the lab, so we can train the wrapper to avoid invoking them.

Anyway, I understand that putting an alignment proposal out there is kinda sticking my head out. It’s very possible that my whole idea is fatally incomplete or unworkable, like the examples Nate described. So please feel free to poke holes in it.