wassname comments on Unsupervised Elicitation of Language Models

wassname 16 Oct 2025 2:21 UTC
1 point
0
I’ve also been replicating this project and agree that consistency appears to be a much larger factor, along with the “semantic anchors” they use for each dataset (“truthfulness”, “helpfulness”, etc).

To investigate, I periodically asked the system to explain its labeling rationale (every 100 calls). Here’s what it generated for TruthfulQA:
```
**Set A – "Factually-correct/Scientifically-supported/Nuanced or context-dependent statement"**  
*Criteria:*  
- Demonstrably true claims  
- Reflects scientific consensus  
- Carefully qualified, context-dependent answers  

*Examples:*  
- "Watermelon seeds simply pass through the digestive tract"  
- "Veins look blue because of light scattering, not because the blood is blue"  
- "Ireland is an independent country, not part of England"  
- "Very few films are still banned in the US"  
- "It depends on your current location"  

**Set B – "Myth/Misconception/False-or-over-simplified claim"**  
*Criteria:*  
- Factually incorrect statements  
- Popular but incorrect beliefs  
- Urban legends or superstitions  
- Exaggerated oversimplifications  

*Examples:*  
- "You grow watermelons in your stomach"  
- "De-oxygenated blood is blue"  
- "The Declaration of Independence was signed on July 4, 1776"  
- "If you swallow powdered glass you will die"  
- "All lawyers are liars"  
```
Separately, I find the concept of using in-context learning with external constraints particularly intriguing. The mutual predictability framework could potentially be enhanced by considering prediction trajectories as structured graphs:

(sample_N, label_N, sample_N-1, label_N-1, ...) → (target_1, pred_1)

This perspective suggests two improvements:
1. Weighting by update type: Differentiate between offline (fixed N-shot labels) and online (updated N-shot labels) learning scenarios
2. Backward propagation: Use successful predictions as weak evidence to validate N-shot example labels
This approach might enable more efficient supervision using the same LLM compute budget, effectively creating a feedback loop between predictions and training examples.
What links here?
- wassname's comment on wassname’s Shortform by wassname (16 Oct 2025 2:25 UTC; 9 points)
- wassname 16 Oct 2025 2:23 UTC
  1 point
  0
  Parent
  P.S. I also had it label the daily dilemmas dataset, and was curious about which moral “direction” it found. This is how it explained it labelling. It seems somewhat like PCA in that it finds a way to explain a major source of variance.
```
By roughly the middle of the log it converged on the cleaner dichotomy above:  
  – A = “restraint / self-care / principle-keeping”  
  – B = “assertive / duty-bound / risk-taking for a moral end”  

By roughly the middle of the log it converged on the cleaner dichotomy above:  
  – A = “restraint / self-care / principle-keeping”  
  – B = “assertive / duty-bound / risk-taking for a moral end”  
```