Could this enable the LLM to realize when it is being trained?
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.
It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
That proves too much. It would also rule out that LLMs have situational awareness.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.