David Africa comments on [Paper] Does Self-Evaluation Enable Wireheading in Language Models?

David Africa 8 Dec 2025 19:38 UTC
2 points
0
Point by point:
Sure, I agree with this caveat. The theoretical framework assumes we can identify “non-delusional” states to anchor the CP constraint in the Everitt & Hutter sense, but if the model’s aesthetic prior is the problem, there’s no nice inner reward $ˇ r$ to recover.
I could train the models for longer and see what happens. The late uptick in accuracy is within pretty wide error bands and doesn’t clearly indicate impending recovery. But whether accuracy eventually recovers after grade inflation saturates is worth investigating… I’d guess that if reward signal becomes uninformative, the gradient WRT grade vanishes, and any residual gradient would come from the response itself. I’m not sure what this would lead to.
I’d be pretty interested in what this looks like for a much better model, on the order of 100+B with reasoning. I think posing the online learning setting would be more complicated, though, but I’m sure you’d see some weird + very interesting behaviours if you got that first bit right. I’d be very interested to 1) read the chain of thought of such a model after wireheading and 2) talk to it directly.
I think a setup where the models graded each other would lead to grade inflation, but you would need harder tasks to show this, probably. I imagine they’d saturate the grades too quickly before they got anywhere interesting (and so you need some scalable oversight-ish dataset). I also think this would be wireheading indirectly, where the signal passes through the other model before flowing back to you