[Question] Any research in “probe-tuning” of LLMs?

Is there any research in “probe-tuning” of LLMs, i.e., tuning LLM’s parameter weights such that a specific probe (classifier) is more reliably detecting certain markers throughout the context, such as grammatical errors, aggression, manipulation, certain political bias, etc.?

This is different from classical fine-tuning and RLHF. As well as classical fine-tuning, probe-tuning is a supervised ML method: it is based on human-annotated texts (contexts). However, probe-tuning should be more effective than classical fine-tuning for detecting many occurrences of a certain marker throughout the context. Probe-tuning doesn’t train on LLM’s own “original rollouts” at all, only on LLM’s activations during the context pass through the LLM.

I imagine than before doing actual probe-tuning, first we should determine which probe in the LLM is most aligned to the training data (annotations) already, so that probe-tuning likely just attenuates some vaguely existing concept within the LLM.

No answers.