I agree, but with the following caveat. We saw Mistral-7B actually learn to do verifiable arithmetics-related tasks. Suppose that the model and the humans grade ambiguous tasks based on the ground truth and the answer’s aesthetics. Then, in order to produce answers that are the best graded by the humans, the model would also have to learn the aesthetics preferred by the humans. If the model graded summarizations based on the model’s idea of the aesthetics, then the model would accomodate to its own aesthetics instead of that of humanity.
In addition, it looks like Llama-3.1-8B actually tried to learn arithmetics at the very end of the 500-epoch training run. Did anyone check Llama’s behaviour at 600 epochs?
P.S. If the models learn only to inflate the grades, then this might fail to generalise to larger models or to the AGI collective from the AI-2027 forecast. Or AI takeover would have the AI spare the humans and use them as a feedback source.
P.P.S. Suppose that we task Llama and Mistral with grading each other and learning from each other. Would such a setup cause grade inflation?
Sure, I agree with this caveat. The theoretical framework assumes we can identify “non-delusional” states to anchor the CP constraint in the Everitt & Hutter sense, but if the model’s aesthetic prior is the problem, there’s no nice inner reward ˇr to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn’t clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating… I’d guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I’m not sure what this would lead to.
I’d be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I’m sure you’d see some weird + very interesting behaviours if you got that first bit right. I’d be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they’d saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you
I agree, but with the following caveat. We saw Mistral-7B actually learn to do verifiable arithmetics-related tasks. Suppose that the model and the humans grade ambiguous tasks based on the ground truth and the answer’s aesthetics. Then, in order to produce answers that are the best graded by the humans, the model would also have to learn the aesthetics preferred by the humans. If the model graded summarizations based on the model’s idea of the aesthetics, then the model would accomodate to its own aesthetics instead of that of humanity.
In addition, it looks like Llama-3.1-8B actually tried to learn arithmetics at the very end of the 500-epoch training run. Did anyone check Llama’s behaviour at 600 epochs?
P.S. If the models learn only to inflate the grades, then this might fail to generalise to larger models or to the AGI collective from the AI-2027 forecast. Or AI takeover would have the AI spare the humans and use them as a feedback source.
P.P.S. Suppose that we task Llama and Mistral with grading each other and learning from each other. Would such a setup cause grade inflation?
Point by point:
Sure, I agree with this caveat. The theoretical framework assumes we can identify “non-delusional” states to anchor the CP constraint in the Everitt & Hutter sense, but if the model’s aesthetic prior is the problem, there’s no nice inner reward ˇr to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn’t clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating… I’d guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I’m not sure what this would lead to.
I’d be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I’m sure you’d see some weird + very interesting behaviours if you got that first bit right. I’d be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they’d saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you