Josh Engels comments on Josh Engels’s Shortform

Josh Engels 13 Feb 2026 0:04 UTC
4 points
2
I think that most of the time when you need to classify something, you should use an LLM, not a probe.

That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:

1. Efficiency → when done on policy, probes are extremely cheap and fast. For example, Anthropic’s work on efficient misuse probes or our recent Gemini Probing paper.

2. Safety → for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model’s output, and probes (or other internals based techniques) might help you. See Apollo’s work on deception probes and Probing and Steering Evaluation Awareness of Language Models.

3. Elicitation → the knowledge is in the LLM, but it’s hard to figure out the prompt to get the knowledge out exactly as you want. In this case, the probe training data is your way to convey the parameters of what you want. Goodfire’s recent paper could be an example of this: https://arxiv.org/abs/2602.10067

4. Calibration → again, the knowledge is in the LLM, but the model has trouble telling you its internal confidence in the knowledge when prompted. For example, in Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?, they conclude that truth probes are sometimes better than prompting primarily because of better calibration.