Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.
In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.
Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
(If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
I agree. I would also try a few variants to try to capture a different intuition:
“Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
it’s straightforward, just detect if perplexity is too high. can they detect perplexity too high? like… probably, they’re prediction models, but it’s not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted ”..and then” or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
I think the answer here depends on how you define “its assistant”. What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
Could this enable the LLM to realize when it is being trained?
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.
Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.
In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.
Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
(If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
LLMs have situational awareness and that might include it.
But should be easy to test!
Related: It’s hard to make scheming evals look realistic for LLMs
At least on a simple evaluation generated by o3 - evaluating Wikipedia texts vs LLM-generated Wikipedia articles, it is not able to distinguish them.
The exactly equal correct/incorrect labels are making me suspicious.
If you ask it which of two samples it generated, does it do better?
I agree. I would also try a few variants to try to capture a different intuition:
“Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
it’s straightforward, just detect if perplexity is too high. can they detect perplexity too high? like… probably, they’re prediction models, but it’s not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted ”..and then” or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
I think the answer here depends on how you define “its assistant”. What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
That proves too much. It would also rule out that LLMs have situational awareness.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.