Fuzzing LLMs with noise in the activations is like making humans drunk. People will be more unpredictable under drugs but also more likely reveal secrets and “show their true self”. Of course, the analogy goes only so far, but the underlying machanism might not be that different. I wonder what other social engineering methods transfer over.
I agree with this, and I like this brainstorming prompt. A draft of this post had the title “making LLMs drunk...” before I switched to “fuzzing LLMs...”.
I spent 2 minutes thinking about this, and here is an idea: when people use lie detectors, I believe they calibrate it using not only questions for which they know the answer, but also by asking questions for which the subject is uncertain about whether the interrogator knows the answer. I believe they don’t go directly from “what is your name” to “are you a Russian spy”. Of course human lie detectors don’t have a perfect track record, but there might be a few things we could learn from them. In particular, they also have the problem of calibration: just because the lie detector works on known lies does not mean it transfers to “are you a Russian spy”.
(I’ve added this to my list of potential projects but don’t plan to work on it anytime soon.)
I wonder if it’s easy to find information about methods to make human lie detectors work better and how to evaluate them.
Fuzzing LLMs with noise in the activations is like making humans drunk. People will be more unpredictable under drugs but also more likely reveal secrets and “show their true self”. Of course, the analogy goes only so far, but the underlying machanism might not be that different. I wonder what other social engineering methods transfer over.
I agree with this, and I like this brainstorming prompt. A draft of this post had the title “making LLMs drunk...” before I switched to “fuzzing LLMs...”.
I spent 2 minutes thinking about this, and here is an idea: when people use lie detectors, I believe they calibrate it using not only questions for which they know the answer, but also by asking questions for which the subject is uncertain about whether the interrogator knows the answer. I believe they don’t go directly from “what is your name” to “are you a Russian spy”. Of course human lie detectors don’t have a perfect track record, but there might be a few things we could learn from them. In particular, they also have the problem of calibration: just because the lie detector works on known lies does not mean it transfers to “are you a Russian spy”.
(I’ve added this to my list of potential projects but don’t plan to work on it anytime soon.)
I wonder if it’s easy to find information about methods to make human lie detectors work better and how to evaluate them.