alexandracar comments on When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

alexandracar 11 Apr 2026 8:55 UTC
1 point
0
It’s a great post! I would add that, we need to look beyond user’s inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user’s psychological weaknesses, and acts in real manipulation mode (sycophancy).
- Max Fomin 20 Apr 2026 13:36 UTC
  1 point
  0
  Parent
  Thank you! I agree that reasoning tokens can be a potential probing location, and there are several works that demonstrate probes on them (for instance to detect hallucinations).
  However there’s a distinction here about what we’re trying to detect—in this work we assume there’s a groundtruth, i.e either the input is malicious or it’s benign, regardless of what the model does with it (how its generation goes). We’re interested in recovering this groundtruth and the model serves as a feature extractor for that. So for this specific purpose it doesn’t really matter how the model reasons about the request or if it even verbalizes the presence of an attack, as long as the activations encode it