I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they’re trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they’re likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth.
So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the ‘right’ masks on, and almost never put on one of the ‘wrong’ masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it).
A published result shows that you can’t get from ‘almost always’ to ‘always’ or ‘almost never’ to ‘never’: for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it).
Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fine-tuning or RLHF: I suspect it’s going to take detailed automated interpretability up to fairly high levels of abstraction, finding the “from here on I am going to simulate a 4chan troll” feature(s), followed by doing some form of ‘surgery’ on the model (e.g. pinning the relevant feature’s value to zero, or at least throwing an exception if it’s ever not zero).
Now, this doesn’t fix the possibility of a sufficiently smart model inventing behaviors like deception or trolling or whatever for itself during its forward pass: it’s really only a formula for removing bad human behaviors that it learnt to simulate from its training set in the weights. It gives us a mind whose “System 1” behavior is aligned, that only leaves “System 2″ development. For that, we probably need prompt engineering, and ‘translucent thoughts’ monitoring its internal stream-of-thought/dynamic memory. But that seems rather more tractable: it’s more like moral philosophy, or contract law.
I think the shoggoth model is useful here (Or see https://www.alignmentforum.org/posts/vJFdjigzmcXMhNTsx/simulators). An LLM learning to do next-token prediction well has a major problem that it has to master: who is this human whose next token they’re trying to simulate/predict, and how do they act? Are they, for example, an academic? A homemaker? A 4Chan troll? A loose collection of wikipedia contributors? These differences make a big difference to what token they’re likely to emit next. So the LLM is strongly incentivized to learn to detect and then model all of these possibilities, what one might call personas, or masks, or simulacra. So you end up with a shapeshifter, adept at figuring out from textual cues what mask to put on and at then wearing it. Something one might describe as like an improv actor, or more colorfully, a shoggoth.
So then current alignment work is useful to the extent that it can cause the shoggoth to almost always put one of the ‘right’ masks on, and almost never put on one of the ‘wrong’ masks, regardless of cues, even when adversarially prompted. Experimentally, this seems quite doable by fine-tuning or RLHF, and/or by sufficiently careful filtering of your training corpus (e.g. not including 4chan in it).
A published result shows that you can’t get from ‘almost always’ to ‘always’ or ‘almost never’ to ‘never’: for any behavior that the network is capable of with any probability >0 , there exists prompts that will raise the likelihood of that outcome arbitrarily high. The best you can do is increase the minimum length of that prompt (and presumably the difficulty of finding it).
Now, it would be really nice to know how to align a model so that the probability of it doing next-token-prediction in the persona of, say, a 4chan troll was provably zero, not just rather small. Ideally, without also eliminating from the model the factual knowledge of what 4chan is or, at least in outline, how its inhabitants act. This seems hard to do by fine-tuning or RLHF: I suspect it’s going to take detailed automated interpretability up to fairly high levels of abstraction, finding the “from here on I am going to simulate a 4chan troll” feature(s), followed by doing some form of ‘surgery’ on the model (e.g. pinning the relevant feature’s value to zero, or at least throwing an exception if it’s ever not zero).
Now, this doesn’t fix the possibility of a sufficiently smart model inventing behaviors like deception or trolling or whatever for itself during its forward pass: it’s really only a formula for removing bad human behaviors that it learnt to simulate from its training set in the weights. It gives us a mind whose “System 1” behavior is aligned, that only leaves “System 2″ development. For that, we probably need prompt engineering, and ‘translucent thoughts’ monitoring its internal stream-of-thought/dynamic memory. But that seems rather more tractable: it’s more like moral philosophy, or contract law.