Great read! The refinement pipeline seems like a particularly good idea. Have you tried running it for multiple rounds, i.e., refining a trajectory twice? This might produce trajectories which are even harder for monitors to detect. However, monitor models are already struggling, so this is not necessarily what you need.
On the other hand, it might be interesting to check if you can enable a smaller attacker model to outsmart a larger monitor model this way. If this is the case, it has some interesting implications for AI control, namely, that you aren’t safe even if your monitor is the very best model there is.
Great read! The refinement pipeline seems like a particularly good idea. Have you tried running it for multiple rounds, i.e., refining a trajectory twice? This might produce trajectories which are even harder for monitors to detect. However, monitor models are already struggling, so this is not necessarily what you need.
On the other hand, it might be interesting to check if you can enable a smaller attacker model to outsmart a larger monitor model this way. If this is the case, it has some interesting implications for AI control, namely, that you aren’t safe even if your monitor is the very best model there is.