The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Note that every issue you mentioned here can be dealt with by trading off capabilities. Thats why from my perspective none of these are problems. Any real world machine does this, from a car to a bulldozer to an aircraft—there is tons of excess components just to make it safe. (Military vehicles shave these off..)
Learn from rare examples : you don’t have to, don’t take action when your confidence is low
Power seeking : limited time/scope tasks where doing the task immediately pays off while seeking power will not because of you have 1 minute to design a paperclip factory, you won’t gain by cognitive improvement and you won’t be scored on overall score across all tasks, “you” don’t exist more than a minute.
Sycophancy : don’t let the model know what the user thinks before answering
Deception: prevent situational awareness by limiting the data provided. Does the model plan to betray on Sept 23, 2027? Did you need to give it the present time and date for most tasks?
Yes all of these things reduce performance, like how airbags make a car slower and less fuel efficient.
I am not saying these aren’t concerns, as well as the new problems that real systems have, but you can accomplish a lot even after mitigating them.
Note that every issue you mentioned here can be dealt with by trading off capabilities.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired.
I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising
Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Note that every issue you mentioned here can be dealt with by trading off capabilities. Thats why from my perspective none of these are problems. Any real world machine does this, from a car to a bulldozer to an aircraft—there is tons of excess components just to make it safe. (Military vehicles shave these off..)
Learn from rare examples : you don’t have to, don’t take action when your confidence is low
Power seeking : limited time/scope tasks where doing the task immediately pays off while seeking power will not because of you have 1 minute to design a paperclip factory, you won’t gain by cognitive improvement and you won’t be scored on overall score across all tasks, “you” don’t exist more than a minute.
Sycophancy : don’t let the model know what the user thinks before answering
Deception: prevent situational awareness by limiting the data provided. Does the model plan to betray on Sept 23, 2027? Did you need to give it the present time and date for most tasks?
Yes all of these things reduce performance, like how airbags make a car slower and less fuel efficient.
I am not saying these aren’t concerns, as well as the new problems that real systems have, but you can accomplish a lot even after mitigating them.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)