Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the “shoggoth” copy responsible for generating all the ‘reasoning’ or ‘internal’ CoT, and then we have the “face” copy responsible for the ‘actions’ or ‘external’ outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.
I don’t see how this proposal could get you AGI-level capabilities. You need the “shoggoth” part of the model to gain general purpose capabilities somehow, and if you are completely shielding the shoggoth from RL, all of the capabilities have to come from pre-training, which seems unlikely. My guess is that a more realistic proposal involves having separate reward models for the two parts:
The shoggoth RM, which only rewards capabilities
The face RM, which also includes alignment-relevant stuff
But it isn’t clear to me how to separate out the alignment/capabilities. IIUC the hope is that the capabilities training seems would naturally incentivize the shoggoth to have bad goals, and perhaps instrumentally decieve the humans, but to do this deception in the COT in english so we can see.
You later write:
How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms—because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to ‘think in English’ even as it undergoes lots of agency training and learns substantial new skills.
Second, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be ‘free’ to put all its important thoughts in plain English.
Which suggests we are training the shoggoth with agency training. But I don’t understand how this is consistnet with “the training process be blind to the outputs of the Shoggoth”. Did you mean just some “alignment” subset of the training process?
I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it’s just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.
I don’t see how this proposal could get you AGI-level capabilities. You need the “shoggoth” part of the model to gain general purpose capabilities somehow, and if you are completely shielding the shoggoth from RL, all of the capabilities have to come from pre-training, which seems unlikely. My guess is that a more realistic proposal involves having separate reward models for the two parts:
The shoggoth RM, which only rewards capabilities
The face RM, which also includes alignment-relevant stuff
But it isn’t clear to me how to separate out the alignment/capabilities. IIUC the hope is that the capabilities training seems would naturally incentivize the shoggoth to have bad goals, and perhaps instrumentally decieve the humans, but to do this deception in the COT in english so we can see.
You later write:
Which suggests we are training the shoggoth with agency training. But I don’t understand how this is consistnet with “the training process be blind to the outputs of the Shoggoth”. Did you mean just some “alignment” subset of the training process?
I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it’s just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.
Thanks—I see, I was misunderstanding.