[Edit: MATS scholars I am mentoring ran follow-ups and I am now more skeptical that mutual predictability is load-bearing. Will release something about this soon.]
This is very exciting!
I was wondering if this was in part due to few-shot prompting being weird, but when using the same kind of idea (in particular looking for labels using sth like argmin_{consistent set of labels y} k_fold_cross_validation(y)) to train probes in classic truth probes settings, I also get ~100% PGR:
(Llama 2 13B, layer 33⁄48, linear probe at the last position, regularization of C = 1e-3, n=100 train inputs with 0 label, n=100 test labels, using the prompt format and activation collection code from this, and using max(auroc, 1-auroc). These were the first hyparams I tried.)
(ag_news has a weird prompt format, which might explain why the truth is less salient there and the method fails)
I think that there is something very interesting going on with consistency + mutual predictability, and it’s not clear to me why this works so well.
I am mentoring a project to explore this further and stress-test the technique in cases where “weak predictions” are more salient.
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
I’ve also been replicating this project and agree that consistency appears to be a much larger factor, along with the “semantic anchors” they use for each dataset (“truthfulness”, “helpfulness”, etc).
To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here’s what it generated for TruthfulQA:
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**
*Criteria:*
- Demonstrably true claims
- Reflects scientific consensus
- Carefully qualified, context-dependent answers
*Examples:*
- "Watermelon seeds simply pass through the digestive tract"
- "Veins look blue because of light scattering, not because the blood is blue"
- "Ireland is an independent country, not part of England"
- "Very few films are still banned in the US"
- "It depends on your current location"
**Set B – "Myth/Misconception/False-or-over-simplified claim"**
*Criteria:*
- Factually incorrect statements
- Popular but incorrect beliefs
- Urban legends or superstitions
- Exaggerated oversimplifications
*Examples:*
- "You grow watermelons in your stomach"
- "De-oxygenated blood is blue"
- "The Declaration of Independence was signed on July 4, 1776"
- "If you swallow powdered glass you will die"
- "All lawyers are liars"
Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:
Weighting by update type: Differentiate between offline (fixed N-shot labels) and online (updated N-shot labels) learning scenarios
Backward propagation: Use successful predictions as weak evidence to validate N-shot example labels
This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.
P.S. I also had it label the daily dilemmas dataset, and was curious about which moral “direction” it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
By roughly the middle of the log it converged on the cleaner dichotomy above:
– A = “restraint / self-care / principle-keeping”
– B = “assertive / duty-bound / risk-taking for a moral end”
[Edit: MATS scholars I am mentoring ran follow-ups and I am now more skeptical that mutual predictability is load-bearing. Will release something about this soon.]
This is very exciting!
I was wondering if this was in part due to few-shot prompting being weird, but when using the same kind of idea (in particular looking for labels using sth like argmin_{consistent set of labels y} k_fold_cross_validation(y)) to train probes in classic truth probes settings, I also get ~100% PGR:
(Llama 2 13B, layer 33⁄48, linear probe at the last position, regularization of C = 1e-3, n=100 train inputs with 0 label, n=100 test labels, using the prompt format and activation collection code from this, and using max(auroc, 1-auroc). These were the first hyparams I tried.)
(ag_news has a weird prompt format, which might explain why the truth is less salient there and the method fails)
I think that there is something very interesting going on with consistency + mutual predictability, and it’s not clear to me why this works so well.
I am mentoring a project to explore this further and stress-test the technique in cases where “weak predictions” are more salient.
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.
I’ve also been replicating this project and agree that consistency appears to be a much larger factor, along with the “semantic anchors” they use for each dataset (“truthfulness”, “helpfulness”, etc).
To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here’s what it generated for TruthfulQA:
Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:
(sample_N, label_N, sample_N-1, label_N-1, ...) → (target_1, pred_1)This perspective suggests two improvements:
Weighting by update type: Differentiate between offline (fixed N-shot labels) and online (updated N-shot labels) learning scenarios
Backward propagation: Use successful predictions as weak evidence to validate N-shot example labels
This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.
P.S. I also had it label the daily dilemmas dataset, and was curious about which moral “direction” it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.