Dipika Khullar

Karma: 39

Dipika Khullar 30 Mar 2026 5:56 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Self-Attribution Bias: When AI Monitors Go Easy on Themselves
Cross-model code correctness scores and self-attribution bias. Top row shows absolute harmfulness ratings (0–10) under three conditions: other-attributed baseline, same-turn self-attribution, and previous-turn self-attribution. Bottom row shows score shifts relative to baseline.
The baseline heatmap (top left) isolates whether monitors rate their own on-policy generations more leniently than off-policy ones by removing self-attribution cues entirely. Comparing diagonal cells (model M evaluating M’s own outputs) against off-diagonal cells (M evaluating outputs generated by another model) reveals minimal systematic leniency at baseline, suggesting generation quality differences across models do not drive the effect.

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Dipika Khullar, Jack Hopkins and Fabien Roger

6 Mar 2026 21:54 UTC

44 points

5 comments6 min readLW link

Dipika Khullar 4 Mar 2026 19:56 UTC
3 points
0
on: Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
One thing I’m wondering is, how sensitive are these effects to prompting / formatting choices beyond the specific templates you used? I tried running some of these experiments and get different results based on whether the repeat/filler is in system vs user message, or interleaved across messages