It’s a great post! I would add that, we need to look beyond user’s inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user’s psychological weaknesses, and acts in real manipulation mode (sycophancy).
Thank you! I agree that reasoning tokens can be a potential probing location, and there are several works that demonstrate probes on them (for instance to detect hallucinations). However there’s a distinction here about what we’re trying to detect—in this work we assume there’s a groundtruth, i.e either the input is malicious or it’s benign, regardless of what the model does with it (how its generation goes). We’re interested in recovering this groundtruth and the model serves as a feature extractor for that. So for this specific purpose it doesn’t really matter how the model reasons about the request or if it even verbalizes the presence of an attack, as long as the activations encode it
It’s a great post! I would add that, we need to look beyond user’s inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user’s psychological weaknesses, and acts in real manipulation mode (sycophancy).
Thank you! I agree that reasoning tokens can be a potential probing location, and there are several works that demonstrate probes on them (for instance to detect hallucinations).
However there’s a distinction here about what we’re trying to detect—in this work we assume there’s a groundtruth, i.e either the input is malicious or it’s benign, regardless of what the model does with it (how its generation goes). We’re interested in recovering this groundtruth and the model serves as a feature extractor for that. So for this specific purpose it doesn’t really matter how the model reasons about the request or if it even verbalizes the presence of an attack, as long as the activations encode it