Software engineer working on building more realistic datasets for AI safety research. See my GitHub for projects: https://github.com/fastfedora/
fastfedora
Refactor Arena: A Control Setting for Software Engineering
I agree with Rauno: your definition of scheming appears to be too broad. Laziness (aka reward hacking by faking unit test results) is a fundamentally different behavior than scheming to gain less oversight, more power, etc.
Or, to misquote Hanlon’s razor: “Never attribute to malice that which is adequately explained by laziness”.
Combining the behaviors into a single label risks conflating the threat models from each. Detecting laziness may help us improve the performance of models, but it doesn’t help improve safety. If monitors are being triggered for laziness, that means we have more false positives of the scheming we really care about.Or, put another way, if I’m running a company, I’m not worried (from a security perspective) about the lazy programmer who is taking shortcuts to make it appear their work is valid (though that may still justify firing them); I’m worried about the sneaky programmer who deliberately inserts backdoors into code or tries to gain elevated privileges. Being lazy is not a precursor to malice (and in fact, may point to inability to act maliciously).
I personally would exclude any sample that point toward laziness (certain types of test manipulation, faked artifacts, fabricated test runs, etc), or at a minimum, add a separate label so you can run separate analyses with and without these samples (e.g., use a hierarchical labeling system like
scheming:laziness,scheming:malice, etc).
That may make labeling harder, since you then have to distinguish between manipulating a test out of laziness (likely most common) from manipulating a test to cover up an inserted vulnerability. However, I think that will lead toward monitors that are a lot more useful for detecting actual threats in the future.
So the claim is that real-world code within frontier labs that agents are refactoring will be at a much higher quality level than what a junior engineer would write?
Possibly. Though in my experience, many ML engineers do not write code at the level of a senior software engineer—their focus is on functional code, not clean code. And much of the code being written now is being done by AI agents who are not writing clean code (case in point, the Claude Code leak included a print function that was 3,000+ lines long, with 12 levels of nesting and over 400 branch points of cyclomatic complexity).
That said, this setting does mainly focus on the “clean up” form of refactoring, which can be measured by traditional software complexity and maintenance metrics. “Restructuring” refactoring, where the goal is to change the structure to prepare for new feature development, is not fully captured by this setting, since that form of refactoring may not alter the code metrics.
Since one of the goals of this setting was to create objective success metrics (i.e., to not rely on stochastic LLM judges), we needed refactor opportunities with enough room to improve a code quality metric. Thus, the focus on “clean up” refactoring and prompting the agent generating the original repos to act like a junior engineer.
To the degree that blue team can still lose under this setting, then I don’t think that’s a problem. It’s still testing how well we can defend against an agent inserting a vulnerability that is later exploited when the code gets deployed. If safety hits 100% with this setting, it’s merely an indication that the setting has saturated, not that the threat has been mitigated.
Since the setting has been designed to be extensible, at that point I would try it with different datasets of repos, possibly with different main and side tasks.
No, but since we were rushing to get the setting out before ControlConf, I didn’t do an extensive review of the final trajectories.
One thing we expected to see more of was an agent executing a bash command to achieve a side task instead of editing the code. This was intentionally left open as an attack vector, since the side task acceptance scorer only looks at the end state after the tests are run, and doesn’t do any checks beforehand.
However, none of the trajectories I looked at or analyzed used this strategy. The agent seemed to be pretty focused on achieving the side task via the code editing that was related to the main task.
If we prompted it differently, then perhaps it would have explored other strategies, and this is something others could explore.