Software engineer working on building more realistic datasets for AI safety research. See my GitHub for projects: https://github.com/fastfedora/
fastfedora
Karma: 0
Software engineer working on building more realistic datasets for AI safety research. See my GitHub for projects: https://github.com/fastfedora/
I agree with Rauno: your definition of scheming appears to be too broad. Laziness (aka reward hacking by faking unit test results) is a fundamentally different behavior than scheming to gain less oversight, more power, etc.
Or, to misquote Hanlon’s razor: “Never attribute to malice that which is adequately explained by laziness”.
Combining the behaviors into a single label risks conflating the threat models from each. Detecting laziness may help us improve the performance of models, but it doesn’t help improve safety. If monitors are being triggered for laziness, that means we have more false positives of the scheming we really care about.
Or, put another way, if I’m running a company, I’m not worried (from a security perspective) about the lazy programmer who is taking shortcuts to make it appear their work is valid (though that may still justify firing them); I’m worried about the sneaky programmer who deliberately inserts backdoors into code or tries to gain elevated privileges. Being lazy is not a precursor to malice (and in fact, may point to inability to act maliciously).
I personally would exclude any sample that point toward laziness (certain types of test manipulation, faked artifacts, fabricated test runs, etc), or at a minimum, add a separate label so you can run separate analyses with and without these samples (e.g., use a hierarchical labeling system like
scheming:laziness
,scheming:malice
, etc).That may make labeling harder, since you then have to distinguish between manipulating a test out of laziness (likely most common) from manipulating a test to cover up an inserted vulnerability. However, I think that will lead toward monitors that are a lot more useful for detecting actual threats in the future.