My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.
As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you’re detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.
But does this generalise to 1? Not trivially. If the model is “inherently misaligned” you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic “behavioural fingerprint” that generalises, that’d obviously have a lot of potential.
*I think the current “honest baseline” assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I’d appreciate it.
My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the side objective and claim 100% safety or let the monitor test for side objective completion and claim 100% detection rate, but clearly neither of these approaches will be meaningful.
As I understand your method, you re-run specific actions in a clean context with no history and no side task and investigate whether this differs significantly from the original behaviour. This means that what you’re detecting is context-induced misaligned behaviour, which solves 2) since our attack policy in fact induces the misaligned actions through prompting.
But does this generalise to 1? Not trivially. If the model is “inherently misaligned” you run into the problem flagged by reviewer 1 - how do you get the clean, honest baseline for comparison*? If, however, the misaligned action is context induced (think prompt injection attacks, messages left from previous instances of self, presence of certain environmental signals), then your method might very well be able to catch those situations. And if you could somehow extend the method into a generic “behavioural fingerprint” that generalises, that’d obviously have a lot of potential.
*I think the current “honest baseline” assumptions in a lot of other control papers suffer from the same issue, but to a less severe degree. If someone could convince me that this is a non-issue, I’d appreciate it.