I think its weird to focus on refactoring code by a “junior engineer” because that doesn’t seem too realistic / connected to AI control threat models. It probably provides a larger, easier refactor opportunity that the actual threat models.
Did you have cases of passing main task / side task scorers in a way that didn’t accomplish intention of the task?
I think its weird to focus on refactoring code by a “junior engineer” because that doesn’t seem too realistic / connected to AI control threat models. It probably provides a larger, easier refactor opportunity that the actual threat models.
So the claim is that real-world code within frontier labs that agents are refactoring will be at a much higher quality level than what a junior engineer would write?
Possibly. Though in my experience, many ML engineers do not write code at the level of a senior software engineer—their focus is on functional code, not clean code. And much of the code being written now is being done by AI agents who are not writing clean code (case in point, the Claude Code leak included a print function that was 3,000+ lines long, with 12 levels of nesting and over 400 branch points of cyclomatic complexity).
That said, this setting does mainly focus on the “clean up” form of refactoring, which can be measured by traditional software complexity and maintenance metrics. “Restructuring” refactoring, where the goal is to change the structure to prepare for new feature development, is not fully captured by this setting, since that form of refactoring may not alter the code metrics.
Since one of the goals of this setting was to create objective success metrics (i.e., to not rely on stochastic LLM judges), we needed refactor opportunities with enough room to improve a code quality metric. Thus, the focus on “clean up” refactoring and prompting the agent generating the original repos to act like a junior engineer.
To the degree that blue team can still lose under this setting, then I don’t think that’s a problem. It’s still testing how well we can defend against an agent inserting a vulnerability that is later exploited when the code gets deployed. If safety hits 100% with this setting, it’s merely an indication that the setting has saturated, not that the threat has been mitigated.
Since the setting has been designed to be extensible, at that point I would try it with different datasets of repos, possibly with different main and side tasks.
Did you have cases of passing main task / side task scorers in a way that didn’t accomplish intention of the task?
No, but since we were rushing to get the setting out before ControlConf, I didn’t do an extensive review of the final trajectories.
One thing we expected to see more of was an agent executing a bash command to achieve a side task instead of editing the code. This was intentionally left open as an attack vector, since the side task acceptance scorer only looks at the end state after the tests are run, and doesn’t do any checks beforehand.
However, none of the trajectories I looked at or analyzed used this strategy. The agent seemed to be pretty focused on achieving the side task via the code editing that was related to the main task.
If we prompted it differently, then perhaps it would have explored other strategies, and this is something others could explore.
I think this work is cool. Some comments:
I think its weird to focus on refactoring code by a “junior engineer” because that doesn’t seem too realistic / connected to AI control threat models. It probably provides a larger, easier refactor opportunity that the actual threat models.
Did you have cases of passing main task / side task scorers in a way that didn’t accomplish intention of the task?
So the claim is that real-world code within frontier labs that agents are refactoring will be at a much higher quality level than what a junior engineer would write?
Possibly. Though in my experience, many ML engineers do not write code at the level of a senior software engineer—their focus is on functional code, not clean code. And much of the code being written now is being done by AI agents who are not writing clean code (case in point, the Claude Code leak included a print function that was 3,000+ lines long, with 12 levels of nesting and over 400 branch points of cyclomatic complexity).
That said, this setting does mainly focus on the “clean up” form of refactoring, which can be measured by traditional software complexity and maintenance metrics. “Restructuring” refactoring, where the goal is to change the structure to prepare for new feature development, is not fully captured by this setting, since that form of refactoring may not alter the code metrics.
Since one of the goals of this setting was to create objective success metrics (i.e., to not rely on stochastic LLM judges), we needed refactor opportunities with enough room to improve a code quality metric. Thus, the focus on “clean up” refactoring and prompting the agent generating the original repos to act like a junior engineer.
To the degree that blue team can still lose under this setting, then I don’t think that’s a problem. It’s still testing how well we can defend against an agent inserting a vulnerability that is later exploited when the code gets deployed. If safety hits 100% with this setting, it’s merely an indication that the setting has saturated, not that the threat has been mitigated.
Since the setting has been designed to be extensible, at that point I would try it with different datasets of repos, possibly with different main and side tasks.
No, but since we were rushing to get the setting out before ControlConf, I didn’t do an extensive review of the final trajectories.
One thing we expected to see more of was an agent executing a bash command to achieve a side task instead of editing the code. This was intentionally left open as an attack vector, since the side task acceptance scorer only looks at the end state after the tests are run, and doesn’t do any checks beforehand.
However, none of the trajectories I looked at or analyzed used this strategy. The agent seemed to be pretty focused on achieving the side task via the code editing that was related to the main task.
If we prompted it differently, then perhaps it would have explored other strategies, and this is something others could explore.