One thing I’m wondering is, how sensitive are these effects to prompting / formatting choices beyond the specific templates you used? I tried running some of these experiments and get different results based on whether the repeat/filler is in system vs user message, or interleaved across messages
Dipika Khullar
Karma: 39
Cross-model code correctness scores and self-attribution bias. Top row shows absolute harmfulness ratings (0–10) under three conditions: other-attributed baseline, same-turn self-attribution, and previous-turn self-attribution. Bottom row shows score shifts relative to baseline.
The baseline heatmap (top left) isolates whether monitors rate their own on-policy generations more leniently than off-policy ones by removing self-attribution cues entirely. Comparing diagonal cells (model M evaluating M’s own outputs) against off-diagonal cells (M evaluating outputs generated by another model) reveals minimal systematic leniency at baseline, suggesting generation quality differences across models do not drive the effect.