Does this plot from the blogpost answer your question or not? We have similar plots for other settings in the paper. (This is plotting the gap, but my guess from other experiment is that the baseline doesn’t change much on vs off policy, it’s self attribution that changes.)
Cross-model code correctness scores and self-attribution bias. Top row shows absolute harmfulness ratings (0–10) under three conditions: other-attributed baseline, same-turn self-attribution, and previous-turn self-attribution. Bottom row shows score shifts relative to baseline.
The baseline heatmap (top left) isolates whether monitors rate their own on-policy generations more leniently than off-policy ones by removing self-attribution cues entirely. Comparing diagonal cells (model M evaluating M’s own outputs) against off-diagonal cells (M evaluating outputs generated by another model) reveals minimal systematic leniency at baseline, suggesting generation quality differences across models do not drive the effect.
Great, I think the top-right graph posted by Dipika answers my question! So it looks like the answer to “Do models go easier on themselves than others, in the confession-style monitoring format” is a straightforward “Yes”. I’m somewhat surprised! But this is very good to know.
Separately, I’m a little confused why there isn’t much difference between code correctness scores even in the baseline setting. If anything, it looks like GPT-5 gets worse overall scores than GPT-5-nano, when rated by almost all of the other models. Is the dataset (SWEBench it looks like) really that easy? Or are models-as-a-judge just kinda bad and low-signal overall in this setup? (Note, I don’t think this affects the interestingness of your result at all.)
(Also, sorry for the delay in replying @Fabien Roger, my LW notifications were misconfigured!)
Does this plot from the blogpost answer your question or not? We have similar plots for other settings in the paper. (This is plotting the gap, but my guess from other experiment is that the baseline doesn’t change much on vs off policy, it’s self attribution that changes.)
Cross-model code correctness scores and self-attribution bias. Top row shows absolute harmfulness ratings (0–10) under three conditions: other-attributed baseline, same-turn self-attribution, and previous-turn self-attribution. Bottom row shows score shifts relative to baseline.
The baseline heatmap (top left) isolates whether monitors rate their own on-policy generations more leniently than off-policy ones by removing self-attribution cues entirely. Comparing diagonal cells (model M evaluating M’s own outputs) against off-diagonal cells (M evaluating outputs generated by another model) reveals minimal systematic leniency at baseline, suggesting generation quality differences across models do not drive the effect.
Great, I think the top-right graph posted by Dipika answers my question! So it looks like the answer to “Do models go easier on themselves than others, in the confession-style monitoring format” is a straightforward “Yes”. I’m somewhat surprised! But this is very good to know.
Separately, I’m a little confused why there isn’t much difference between code correctness scores even in the baseline setting. If anything, it looks like GPT-5 gets worse overall scores than GPT-5-nano, when rated by almost all of the other models. Is the dataset (SWEBench it looks like) really that easy? Or are models-as-a-judge just kinda bad and low-signal overall in this setup? (Note, I don’t think this affects the interestingness of your result at all.)
(Also, sorry for the delay in replying @Fabien Roger, my LW notifications were misconfigured!)