ValueShift Research

Karma: 18

Experiments on Refusal Shape in LLMs

ValueShift Research2 Apr 2026 12:37 UTC

7 points

0 comments7 min readLW link

ValueShift Research 17 Mar 2026 14:22 UTC
7 points
0
in reply to: Shiva's Right Foot’s comment on: Hello, World of Mechanistic Interpetability
You are right, we are probing a very undefined matter. Thank you for sharing your research and the neuron explainer – these are very helpful.

”″Good” is going to be relationally defined; the area the machine thinks about sunshine and lollypops is the “good” part and we only know that because the “good” stuff is there (i.e. without reference to sunshine and lollypops, which to be explicit is just a stand-in for general “good” things, we can’t define “good” really). What you want to know is whether the thoughts about “kill all humans” are more similar to sunshine and lollypops or fear, piss, and death for the machine (fear, piss, and death being concepts likely on a bad end of any dichotomous good-bad principle component),” – that is an interesting research directions. It is probably a good idea to take multiple different concepts, which are more defined (like sunshine and lollypops), and use them as reference point, and try to map more abstract concepts somewhere between them.

”[T]he LLM may not see traditional Human valence as salient in any way; an embedding in a similarity space may find sunshine next to piss and death strictly between sunshine and lollypops. In such a case Human value may be completely alien to the LLM, indeed the concept of value in general may be alien to such a system,”—that would still be a good finding. At least we would be able to locate these concepts. And then we can start thinking how to deal with lollypops and death being closely connected.

ValueShift Research 17 Mar 2026 14:10 UTC
1 point
0
in reply to: zero85’s comment on: Hello, World of Mechanistic Interpetability
“This makes me wonder if your 5–8 dimension estimate is depth-dependent” – we think that might be true. We used linear probes with mean-difference vectors and also found out that the result is domain-dependent. A family of correlated directions seems to be a reasonable suggestion.
Our dataset was rather small, though, so we are working on running the same experiment with more data to see if the results reproduce.

Our models were also small: Qwen3-0.6B and Llama-3.1-8B. Our results might be similar to yours, which is interesting – our most prominent layers are 19-23. One caveat is, we pruned the last 20% of layes based on the works showing that earlier layers embed more abstract concepts and later layers embed exact tokens. We might relax this assumption and run more tests on all layers.
Dimensionality seems to vary from layer to layer, which is intuitively expected, but we want to obtain stronger evidence before claiming it.

We used residual streams.

Hello, World of Mechanistic Interpetability

ValueShift Research15 Mar 2026 23:36 UTC

8 points

4 comments5 min readLW link

First steps into mechanistic interpretability. Refusal is not a single direction

ValueShift Research15 Mar 2026 23:36 UTC

1 point

0 comments3 min readLW link

ValueShift Research

Ex­per­i­ments on Re­fusal Shape in LLMs

Hello, World of Mechanis­tic Interpetability

First steps into mechanis­tic in­ter­pretabil­ity. Re­fusal is not a sin­gle direction

Experiments on Refusal Shape in LLMs

Hello, World of Mechanistic Interpetability

First steps into mechanistic interpretability. Refusal is not a single direction