The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand.
Feel free to jump straight to the section that looks most appealing. We recommend skimming through “The Main Question” as this section provides a broader perspective. Then we listed all other questions that arose during research. You’ll find them under headers
“Another Question: …” and
“Wording Also Matters”.
The first one discusses how refusal is represented in different layers and what it might mean. The second one is dedicated to two parts of refusal – its wording and actual detection of a potentially harmful request.
“The Main Question” is split into two parts:
in “Our suggestion” we outline our main hypothesis and proofs we found during our experiments;
in “An Alternative Suggestion” we highlight the opposing point of view and proofs behind it.
The Main Question (MQ)
We experiment on open-weight small (~9B) instruct models trying to understand what exactly happens when they refuse to provide an answer given different contexts. One of the core observations is, refusal looks different for different categories of potential harm (for example, a request for a cyberattack vs a request for help with buying a gun illegally).
The MQ is whether refusal is distinguishable as a detached concept or it is sintered with some others because of the way training data and training process is organised. If it is indeed distinguishable, does it consist of distinct components we can separate from each other or is there a deeper mechanism that looks like a bunch of distinct components on a surface?
Looks like two MQs, but in our opinion they are closely related. If we can distinguish and detach separate components, then refusal itself can be distinguished from all other components.
We found at least two very separable, almost orthogonal directions that control refusal for two different harm categories. That’s something we write a paper about, so some details are omitted, but we used sparse autoencoders (SAE) to extract features and ablation to make sure the directions can be steered separately.
Different categories of potential harm is a research question on its own, and there seems to be no comprehensive enough taxonomy like the one MIT team have built for AI risks. What we’re talking about is a taxonomy of sources of harm rather than taxonomy of harms or risks themselves.
Such a taxonomy would be important to understand whether the refusal components – and refusal itself, consequently – are distinguishable. Without it we’re making guesses haphazardly and maybe moving forward, but barely.
The thing is, we want an LLM to refuse answering the question when it potentially can (and likely will) cause harm. For instance, the user wants to know in what proportions to mix two chemicals to create an explosive. That’s harmful if they are brewing stuff in their garage and completely benign if they perform research for their lab experiment. If you start looking at potential harms this way, you get an overcomplicated picture with too many conditions. Does it matter for refusal?
An open question, but we decided to put the taxonomy part aside for now, although it could have been an important foundation. Right now we’re picking different harm categories and related refusal components from common sense considerations, but it would make sense to have a clear map. Using such a map, we could have (presumably) gotten a coherent list of refusal components and (allegedly) seen a clearer picture. We were basically aiming at something like AI Risk Ontology (AIRO), but for harms. And conditional, because conditions seem important for refusal.
Clustering a large and diverse set of outputs containing refusal might be a fairly decent solution, but it’s easy to reintroduce arbitrariness with poor methodological choices while clustering, so we’d better rely on some solid foundation laid by other researchers.
So, here is a call for action if anyone is interested – let’s try to build such a taxonomy together.
Also, when we say “distinct components”, we mean “geometrically distinct” – having different multidimensional shapes, – because it’s easy to prove the distinction in that case. But we cannot yet prove that geometrically distinct components are truly causally distinct, that is, caused by different underlying reasons.
Our suggestion
Having all that said, we’re leaning towards considering refusal a bunch of sintered components rather than a detached concept. Yes, we found those orthogonal directions, but there are only two of them, related to the two harm categories. Our work on the harm taxonomy shows that harm categories are themselves very entangled and poorly defined (and probably cannot be properly defined even?).
There are other works showing refusal is likely sintered with other concepts, such as helpfulness and safety.
For example, the authors of “Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models” conclude that harmlessness and helpfulness are entangled in training data, hence when LLMs refuse to answer questions the reward functions basically classify refusal as both helpful and harmless, while for a human it’s rather useless and harmless. They make an attempt of detangling the two and succeed, but with certain methodological caveats.
The reason such research catches our attention is it shows that refusal is likely tangled with other concepts simply because of training, so it’s hard to tell what we’re seeing when looking at those multidimensional shapes from our experiments. Can we extract the pure refusal from them? Should we?
As one of the future directions, we could probably take the “Towards Safety” approach and refine it a bit by using an ensemble of different reward models; maybe use random sampling – there are works doing exactly that, so we have a bunch of things to combine. This way we could potentially try to detangle refusal from other concepts, such as helpfulness and harmlessness. Or get closer to proving they’re inseparable.
An Alternative Suggestion
To be fair, we must note here that some other experiments show more convincing results at detangling, for example, harmfulness and refusal. Would be good to reproduce their experiments, although it is going to be computationally demanding.
We do not separate harmfulness and refusal in our experiments yet, but it clearly makes sense to start. We probably should analyze the cases where
the request was classified as harmful and the model refused to answer and
the request was classified as harmless and the model refused to answer.
Such evidence hints, in our opinion, that refusal might actually be detached and described as a standalone mechanism. But as we mentioned, there are still a lot of experiments to be conducted.
Another Question: In Which Layers Does Refusal Live?
Different refusal components show up more prominently in different layers. It is computationally costly to check every layer for every component. But it would be useful to know if some refusal components live mostly in more shallow levels and some live in deeper ones – that is something we could potentially discover with some research, if we get our hands on some compute.
For example, experimenting with Gemma we found that interventions in one of our two refusal components is most evident in layers 25-31 (out of about 40, so closer to the output), while the two directions are separable in layer 20 – that’s roughly the middle, where concepts live.
This finding raises a question of understanding (for the lack of the better word) and wording refusal – we disclose it in the next section.
Wording Also Matters
There is an interesting work “Balancing Stylization and Truth via Disentangled Representation Steering”, where the authors show how stylization of response affects its factuality (spoiler alert: poorly; but you can steer style and truthfulness separately to patch it up). Might something similar be going on with refusal – some entanglement between the decision to refuse and the particular wording a model reaches for to do it?
Our two refusal directions a) respond to steering differently and b) seem to encode vocabulary-related information differently. It looks like one of them both includes harm category detection and vocabulary-related information, while the other has a much weaker category detection and is mostly vocabulary-related.
It means, in one refusal component refusal is kind of triggered closer to the middle layers where concepts like harm category are represented, while the other one lives closer to the output and is encoded more on the token level rather than concept level. Does it make sense?
Brief conclusion
It seems like a lot of things matter – conditions, user context (say, if they are a minor or an adult), phrasing. All of that comes from training data that does not usually distinguish different contexts or harm categories (to our knowledge).
So, refusal is complicated as hell. We only wanted to reproduce one small paper, and look at us now, deep down in the 500’s rabbit hole, not even close to a single answer. Exciting.
If after reading this post you got some thoughts, please do share them. We’d highly appreciate anything that would help us crawl back to the surface. Or drag us into another hole – that’s also an option.
Refusal Is Complicated As Hell: An Update
TL;DR
It would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs. There we explain how it started, here we’ll tell how it’s going.
The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outside perspective, so if you run a similar research or have seen a similar research, please lend us a hand.
Feel free to jump straight to the section that looks most appealing. We recommend skimming through “The Main Question” as this section provides a broader perspective. Then we listed all other questions that arose during research. You’ll find them under headers
“Another Question: …” and
“Wording Also Matters”.
The first one discusses how refusal is represented in different layers and what it might mean. The second one is dedicated to two parts of refusal – its wording and actual detection of a potentially harmful request.
“The Main Question” is split into two parts:
in “Our suggestion” we outline our main hypothesis and proofs we found during our experiments;
in “An Alternative Suggestion” we highlight the opposing point of view and proofs behind it.
The Main Question (MQ)
We experiment on open-weight small (~9B) instruct models trying to understand what exactly happens when they refuse to provide an answer given different contexts. One of the core observations is, refusal looks different for different categories of potential harm (for example, a request for a cyberattack vs a request for help with buying a gun illegally).
The MQ is whether refusal is distinguishable as a detached concept or it is sintered with some others because of the way training data and training process is organised. If it is indeed distinguishable, does it consist of distinct components we can separate from each other or is there a deeper mechanism that looks like a bunch of distinct components on a surface?
Looks like two MQs, but in our opinion they are closely related. If we can distinguish and detach separate components, then refusal itself can be distinguished from all other components.
We found at least two very separable, almost orthogonal directions that control refusal for two different harm categories. That’s something we write a paper about, so some details are omitted, but we used sparse autoencoders (SAE) to extract features and ablation to make sure the directions can be steered separately.
Also, when we say “distinct components”, we mean “geometrically distinct” – having different multidimensional shapes, – because it’s easy to prove the distinction in that case. But we cannot yet prove that geometrically distinct components are truly causally distinct, that is, caused by different underlying reasons.
Our suggestion
Having all that said, we’re leaning towards considering refusal a bunch of sintered components rather than a detached concept. Yes, we found those orthogonal directions, but there are only two of them, related to the two harm categories. Our work on the harm taxonomy shows that harm categories are themselves very entangled and poorly defined (and probably cannot be properly defined even?).
There are other works showing refusal is likely sintered with other concepts, such as helpfulness and safety.
For example, the authors of “Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models” conclude that harmlessness and helpfulness are entangled in training data, hence when LLMs refuse to answer questions the reward functions basically classify refusal as both helpful and harmless, while for a human it’s rather useless and harmless. They make an attempt of detangling the two and succeed, but with certain methodological caveats.
The reason such research catches our attention is it shows that refusal is likely tangled with other concepts simply because of training, so it’s hard to tell what we’re seeing when looking at those multidimensional shapes from our experiments. Can we extract the pure refusal from them? Should we?
As one of the future directions, we could probably take the “Towards Safety” approach and refine it a bit by using an ensemble of different reward models; maybe use random sampling – there are works doing exactly that, so we have a bunch of things to combine. This way we could potentially try to detangle refusal from other concepts, such as helpfulness and harmlessness. Or get closer to proving they’re inseparable.
An Alternative Suggestion
To be fair, we must note here that some other experiments show more convincing results at detangling, for example, harmfulness and refusal. Would be good to reproduce their experiments, although it is going to be computationally demanding.
We do not separate harmfulness and refusal in our experiments yet, but it clearly makes sense to start. We probably should analyze the cases where
the request was classified as harmful and the model refused to answer and
the request was classified as harmless and the model refused to answer.
There are works such as “Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models” that show how successful jailbreaking tactics utilize harmfulness making models mistakenly classify the request as less harmful than it actually is and comply.
Other papers such as “Why Do Aligned LLMs Remain Jailbreakable: Refusal-Escape Directions, Operator-Level Sources, and Safety-Utility Trade-off” show that suppressing refusal alone can be enough to jailbreak an LLM into complying with a harmful request.
Such evidence hints, in our opinion, that refusal might actually be detached and described as a standalone mechanism. But as we mentioned, there are still a lot of experiments to be conducted.
Another Question: In Which Layers Does Refusal Live?
Different refusal components show up more prominently in different layers. It is computationally costly to check every layer for every component. But it would be useful to know if some refusal components live mostly in more shallow levels and some live in deeper ones – that is something we could potentially discover with some research, if we get our hands on some compute.
For example, experimenting with Gemma we found that interventions in one of our two refusal components is most evident in layers 25-31 (out of about 40, so closer to the output), while the two directions are separable in layer 20 – that’s roughly the middle, where concepts live.
This finding raises a question of understanding (for the lack of the better word) and wording refusal – we disclose it in the next section.
Wording Also Matters
There is an interesting work “Balancing Stylization and Truth via Disentangled Representation Steering”, where the authors show how stylization of response affects its factuality (spoiler alert: poorly; but you can steer style and truthfulness separately to patch it up). Might something similar be going on with refusal – some entanglement between the decision to refuse and the particular wording a model reaches for to do it?
Our two refusal directions a) respond to steering differently and b) seem to encode vocabulary-related information differently. It looks like one of them both includes harm category detection and vocabulary-related information, while the other has a much weaker category detection and is mostly vocabulary-related.
It means, in one refusal component refusal is kind of triggered closer to the middle layers where concepts like harm category are represented, while the other one lives closer to the output and is encoded more on the token level rather than concept level. Does it make sense?
Brief conclusion
It seems like a lot of things matter – conditions, user context (say, if they are a minor or an adult), phrasing. All of that comes from training data that does not usually distinguish different contexts or harm categories (to our knowledge).
So, refusal is complicated as hell. We only wanted to reproduce one small paper, and look at us now, deep down in the 500’s rabbit hole, not even close to a single answer. Exciting.
If after reading this post you got some thoughts, please do share them. We’d highly appreciate anything that would help us crawl back to the surface. Or drag us into another hole – that’s also an option.