Fabien Roger comments on Unsupervised Elicitation of Language Models

Fabien Roger 14 Jun 2025 22:22 UTC
5 points
0
[Edit: MATS scholars I am mentoring ran follow-ups and I am now more skeptical that mutual predictability is load-bearing. Will release something about this soon.]
This is very exciting!
I was wondering if this was in part due to few-shot prompting being weird, but when using the same kind of idea (in particular looking for labels using sth like argmin_{consistent set of labels y} k_fold_cross_validation(y)) to train probes in classic truth probes settings, I also get ~100% PGR:
(Llama 2 13B, layer ³³⁄₄₈, linear probe at the last position, regularization of C = 1e-3, n=100 train inputs with 0 label, n=100 test labels, using the prompt format and activation collection code from this, and using max(auroc, 1-auroc). These were the first hyparams I tried.)
(ag_news has a weird prompt format, which might explain why the truth is less salient there and the method fails)
I think that there is something very interesting going on with consistency + mutual predictability, and it’s not clear to me why this works so well.
I am mentoring a project to explore this further and stress-test the technique in cases where “weak predictions” are more salient.