Fabien Roger comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Fabien Roger 3 Jun 2025 9:58 UTC
LW: 2 AF: 2
0
AF
I still think this is a serious concern that prevents us from finding “the truth predictor”, but here is a way around it:
is_it_bad(x) = max_predictor consistency(predictor) subject to the constraint that predictor(x) ≠ predictor(obvious good inputs)
- For x such that all coherent (and high salience) views predict it is good, then it won’t be possible to find a consistent and salient predictor that predicts that x has a different label from obvious good inputs and thus is_it_bad(x) will be low;
- For x that the model thinks is bad (regardless of what other coherent views think, and in particular coherent human simulators!), then is_it_bad(x) will be high!
- For x such that some coherent view (e.g. the human simulator) thinks is bad, then is_it_bad(x) will be high.
The proposed is_it_bad has no false negatives. It has false positives (in situations where the human simulator predicts the input is bad but the model knows it’s fine), but I think this is might fine for many applications (as long as we correctly use some prior/saliency/simplicity properties to exclude coherent predictors that are not interesting without excluding the ground truth). In particular, it might be outer-alignment safe-ish to maximize for outcome(x) - large_constant * is_it_bad(x).
- Fabien Roger 14 Aug 2025 9:08 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Clarification: the reason why this bypasses the concern is that you can compute the max with training instead of enumeration.