[Question] Any research in “probe-tuning” of LLMs?

Roman Leventov15 Aug 2023 21:01 UTC

20 points

Is there any research in “probe-tuning” of LLMs, i.e., tuning LLM’s parameter weights such that a specific probe (classifier) is more reliably detecting certain markers throughout the context, such as grammatical errors, aggression, manipulation, certain political bias, etc.?

This is different from classical fine-tuning and RLHF. As well as classical fine-tuning, probe-tuning is a supervised ML method: it is based on human-annotated texts (contexts). However, probe-tuning should be more effective than classical fine-tuning for detecting many occurrences of a certain marker throughout the context. Probe-tuning doesn’t train on LLM’s own “original rollouts” at all, only on LLM’s activations during the context pass through the LLM.

I imagine than before doing actual probe-tuning, first we should determine which probe in the LLM is most aligned to the training data (annotations) already, so that probe-tuning likely just attenuates some vaguely existing concept within the LLM.

Roman Leventov15 Aug 2023 21:01 UTC

20 points

3 comments1 min readLW link

Language Models AI

No answers.

omegastick 16 Aug 2023 15:36 UTC
2 points
0
A quick Google search of probe tuning doesn’t turn up anything. Do you have more info on it?

Probe-tuning doesn’t train on LLM’s own “original rollouts” at all, only on LLM’s activations during the context pass through the LLM.

This sounds like regular fine tuning to me. Unless you mean that the loss is calculated based on one (multiple?) of the network’s activations rather than on the output logits.

Edit: I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?
- Roman Leventov 16 Aug 2023 15:46 UTC
  3 points
  0
  Parent
  I think I get what you mean now. You want to hook a probe to a model and fine-tune it to perform well as a probe classifier, right?
  Yes, exactly. Also I came up with “probe-tuning” myself, maybe not a good name, but anyways I was trying to find something on Scholar in this direction.
lukehmiles 16 Aug 2023 2:31 UTC
1 point
0
I never heard of it. You should try it.