(but do not seem to contain hidden logic for malicious intent)
Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it’s role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.
Also I am curious whether you plan to make the attack policy backdoor coding scaffold publicly available? Maybe even as a contribution to ControlArena? I think this would be great as it is non-trivial to generate good backdoors.
Nitpick: I am concerned that this is an artifact of the artificiality of the setting, aka the monitor taking it’s role too literally by looking for malicious backdoors. Instead, the monitor should just be looking for whether there exists some input that results in failing the unit tests. Anecdotally I found that is helps not to mention the term backdoor at all in the monitoring prompt or very explicitly explaining in detail that we mean by backdoor.
Also I am curious whether you plan to make the attack policy backdoor coding scaffold publicly available? Maybe even as a contribution to ControlArena? I think this would be great as it is non-trivial to generate good backdoors.