I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it’s just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.
I think you may be misunderstanding the proposal; I should clarify, sorry: The proposal is to blind the evaluation process to the internal reasoning, NOT to not train the internal reasoning! The internal reasoning will of course be trained, it’s just that the process that evaluates it will be blind to it, and instead just look at the external outputs + outcomes. (That is, it looks at the outputs of the face, not the outputs of the shoggoth. The outputs to the shoggoth are inputs to the face.) For example, you could generate a billion trajectories of agentic behavior doing various tasks and chatting with various users, evaluate them, and then train the model to imitate the top 20% of them, and then repeat.
Thanks—I see, I was misunderstanding.