Kilian Merkelbach

Karma: 4

Kilian Merkelbach 20 Feb 2026 11:33 UTC
1 point
0
on: Monitoring benchmark for AI control
Great read! The refinement pipeline seems like a particularly good idea. Have you tried running it for multiple rounds, i.e., refining a trajectory twice? This might produce trajectories which are even harder for monitors to detect. However, monitor models are already struggling, so this is not necessarily what you need.

On the other hand, it might be interesting to check if you can enable a smaller attacker model to outsmart a larger monitor model this way. If this is the case, it has some interesting implications for AI control, namely, that you aren’t safe even if your monitor is the very best model there is.

Prompt Framing Changes LLM Performance (and Safety)

Kilian Merkelbach2 Oct 2025 18:29 UTC

5 points

0 comments7 min readLW link