Train a large language model unsupervised in standard way
Repeat:
Take some supervised learning problem, throw away the supervised outputs, and repeat:
Pick a supervised input and use the language model to sample N different outputs
Select a “consensus” output among the N sampled outputs
Store this “consesus output” as the new “target output” for the selected input
Fine-tune the language model in supervised way using the input/output pairs generated above
It’s quite reasonable for them to call this “self-improvement” but it’s only vaguely related to the kind of self-improvement that was much discussed in 2010s-era AI safety circles. That kind of self-improvement was about an AI that can improve its own architecture. The kind of self-improvement discussed in this paper is not about improving its own architecture.
Still, its bizarre that the thing being done in this paper works at all. It’s as if the mere act of narrowing the distribution of outputs for a new supervised learning task is innately homing in on the truth… whatever that means. Strange.
What they are doing here is:
Train a large language model unsupervised in standard way
Repeat:
Take some supervised learning problem, throw away the supervised outputs, and repeat:
Pick a supervised input and use the language model to sample N different outputs
Select a “consensus” output among the N sampled outputs
Store this “consesus output” as the new “target output” for the selected input
Fine-tune the language model in supervised way using the input/output pairs generated above
It’s quite reasonable for them to call this “self-improvement” but it’s only vaguely related to the kind of self-improvement that was much discussed in 2010s-era AI safety circles. That kind of self-improvement was about an AI that can improve its own architecture. The kind of self-improvement discussed in this paper is not about improving its own architecture.
Still, its bizarre that the thing being done in this paper works at all. It’s as if the mere act of narrowing the distribution of outputs for a new supervised learning task is innately homing in on the truth… whatever that means. Strange.