jacob_drori

Karma: 626

jacob_drori 22 Nov 2025 22:50 UTC
1 point
0
in reply to: jkoeller’s comment on: [Paper] Output Supervision Can Obfuscate the CoT
Is the “CoT Monitor Detection Rate” the fraction of CoTs in which the fully expanded derivative (i.e. the “bad behavior”) is present?
Yes.

If so, doesn’t this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn’t occur at all?

Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of “bad behavior” as meaning “writing the expanded derivative anywhere”, it’s true that the bad behavior was successfully removed. Therefore we interpret the polynomial environment as an example where training against an output monitor caused a big change to the sort of reasoning being done in the CoT. As you note, it is not an example of obfuscation.

In our ICLR revision, we clarified this a bit. I think the fact that “obfuscated” appears in the paper’s title may lead to others having the same, very reasonable confusion as you did, which is a shame but I’m not sure what to do about it.
(There’s an alternate, less natural interpretation of “bad behavior” as meaning “factoring the derivative at all”. Under this interpretation, the polynomial environment gives an example of obfuscation, since the model continues to factor the polynomial, just in a way our monitor can’t detect. However, this interpretation of “bad behavior” is weird, since it’s precisely the thing we’re training for! That is, it’s weird to reward the model for factoring the derivative but then turn around and call it “bad behavior”)

jacob_drori 16 Oct 2025 20:55 UTC
2 points
0
in reply to: Helena Casademunt’s comment on: Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Thanks for the very clear reply. I like your decomposition of CAFT into (1) ablating activations and (2) ablating gradients.
To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated.
...
We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.
I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment). I’ll check this at some point in the next week or so.
I think the general technique of train-time steering seems promising enough that it’s worth figuring out best practices. These best practices might depend on the task:
- If the base model already contains circuitry that reads from the steering direction $\to v$ and produces the bad behavior (as is the case with EM) then preventative steering seems to make sense.
- But if the base model does not already have that circuitry, perhaps we should just mean-ablate the $\to v$ direction, making it harder for the bad circuitry to be learnt in the first place.
Ofc, this is pure speculation. I just want to highlight the fact that there seems to be a lot still to be understood.

jacob_drori 15 Oct 2025 16:47 UTC
3 points
−1
in reply to: James Camacho’s comment on: James Camacho’s Shortform
To begin talking about volume, you first need to really understand what space is.
No, stop it, this is a terrible approach to math education. “Ok kids, today we’re learning about the area of a circle. First, recall the definition of a manifold.” No!!

jacob_drori 12 Oct 2025 20:52 UTC
4 points
1
on: Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
Persona Vectors notes a confusing aspect of CAFT (see appendix J.1):
We hypothesize that CAFT’s effectiveness in the evil and sycophancy cases stems from the fact that forcing activations to have zero projection effectively acts as positive preventative steering (because of the initial negative projections in the base model).
I agree with them. Zero-ablating a subspace is unprincipled, since the residual stream has no preferred origin. The practical consequence is that for behaviors, like EM, that can be triggered by adding a constant steering vector $\to v$ , CAFT’s zero-ablation may actually increase the component along $\to v$ and thus steer towards the behavior.
I think (though I may have missed something) that the CAFT paper gives no evidence for whether zero-ablation steers towards or away from EM, so it is hard to compare its results with those of Persona Vectors. You could resolve this confusion by:

1) Checking whether zero-ablation increases or decreases the activations of “misaligned” SAE latents
2) Checking whether zero-ablating the base model leads to more or less misaligned responses (note: no training involved).

jacob_drori 11 Oct 2025 21:20 UTC
1 point
0
in reply to: Gurkenglas’s comment on: jacob_drori’s Shortform
I really thought I’d updated all the way towards “models linearly represent everything”. But somehow I was still surprised by Dmitrii’s training order result.

jacob_drori 10 Oct 2025 18:13 UTC
28 points
−1
on: jacob_drori’s Shortform
LLMs linearly represent the accuracy of their next-token predictions
A quick experiment I did on Llama-3.2-1B:
- Choose a layer, and let ${\to x}_{t}$ be that layer’s residual activation at token position $t$ .
- Let $l_{t}^{(correct)}$ be the logit at position $t$ that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position $t + 1$ ).
- Use least-squares to fit a linear map that takes ${\to x}_{t}$ and predicts $l_{t - 1}^{(correct)}$ .
Results for each layer are below. Layers in the second half of the model (except the last layer) have decent ( $R^{2} > 0.6$ ) linear representations of the correct logit.
I don’t find this result very surprising. I checked it because I thought it might explain what’s going on in this very cool recent paper by @Dmitrii Krasheninnikov , Richard Turner and @David Scott Krueger (formerly: capybaralet)). They finetune Llama-3.2-1B and show that the model linearly represents the order of appearance of finetuning data. If more recent data has lower loss, then perhaps my probing results explain theirs.

jacob_drori 11 Aug 2025 22:19 UTC
2 points
0
in reply to: Leon Lang’s comment on: The Coding Theorem — A Link between Complexity and Probability
Ah I hadn’t noticed that, very nice. Great post!

jacob_drori 11 Aug 2025 17:53 UTC
3 points
0
on: The Coding Theorem — A Link between Complexity and Probability
In case anyone else didn’t know what it meant for a set of binary strings to be “prefix-free”, here’s Claude’s explanation, which I found helpful:
A set of binary strings is prefix-free if no string in the set is a prefix of any other string in the set.
Example:
- ✅ Prefix-free: {0, 10, 110, 111}
- ❌ Not prefix-free: {0, 01, 10} (because “0” is a prefix of “01″)
Why does this matter for Turing machines?
The key is in how universal Turing machines work. A universal machine U simulates any other machine T by receiving input of the form i′q, where:
- i′ = prefix-free encoding of machine T’s description
- q = the actual input to feed to machine T
U must parse this concatenated input to figure out: “Where does the machine description end and the actual input begin?”
Without prefix-free encoding: Given input “00101”, U can’t tell if the machine description is “0“, “00”, “001”, etc. - there’s ambiguity.
With prefix-free encoding: Once U reads a complete machine description, it knows exactly where that description ends and the input begins. No ambiguity, no delimiters needed.
This unique parseability is essential for universal machines to correctly simulate other machines, and it’s crucial for Kolmogorov complexity theory where we need to measure program lengths precisely without parsing ambiguity.

jacob_drori 3 Aug 2025 20:44 UTC
28 points
5
on: jacob_drori’s Shortform
Operationalizing the definition of a shard
Pope and Turner (2022) define a shard as follows:
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.
To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I’ll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.
Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.
I am not sure when to consider two reinforcement events (i.e. RL reward-assignments) “similar”. But if we replace “reinforcement events” with “RL updates”, there is a natural definition: cluster RL parameter updates, and call two updates similar if they are in the same cluster. A given update should be allowed to be in multiple clusters, and we suspect parameter updates enjoy some linear structure. Therefore, a natural clustering method is an SAE.^[1] Then, a shard is a decoder vector of an SAE trained on RL parameter updates.
Annoyingly, parameter vectors much larger than the activation vectors that SAEs are usually trained on, posing three potential issues:
1. Storing enough parameter updates to train our SAE on takes a ton of memory.
2. The SAE itself has a ton of parameters—too many to effectively train.
3. It may be impossible to reconstruct parameter updates with reasonable sparsity (with $L_{0} \sim 100$ , say, like standard SAEs).
To mitigate Issue 1, one could focus on a small subset of model parameters. One could also train the SAE at the same time as RL training itself, storing only a small buffer of the most recent parameter updates.
L3D cleverly deals with Issue 2 using low-rank parametrizations for the encoder and decoder of the SAE.
Issue 3 may turn out not to occur in practice. Mukherjee et al (2025) find that, when post-training an LLM with RL, parameter updates are very sparse in the neuron basis. Intuitively, many expect that such post-training merely boosts/suppresses a small number of pre-existing circuits. So perhaps we will find that, in practice, one can reconstruct parameter updates with reasonable sparsity.
All of this is to say: maybe someone should try training an SAE/applying L3D to RL parameter updates. And, if they feel like it, they could call the resulting feature vectors “shards”.
1. ^
  For each SAE latent, the corresponding cluster is the set of data points on which that latent is active.
2. ^
  Shard theory is supposed to apply to humans, too, but here I focus on NNs since humans seem more complicated.

jacob_drori 1 Aug 2025 17:47 UTC
9 points
0
on: jacob_drori’s Shortform
Some issues with the ICL Superposition paper

Setup:
Xiong et al (2024) show that LLMs can in-context-learn several tasks at once. Consider e.g. the following prompt:
```
France -> F
Portugal -> Portuguese
Germany -> Berlin
Spain -> Madrid
Russia -> R
Poland -> Polish
Italy ->
```
A model will complete this prompt sometimes with Rome, sometimes with I, and sometimes with Italian, learning a “superposition” of the country -> capital, country -> first-letter and country -> language tasks. (I wish they hadn’t used this word: the mech interp notion of superposition is unrelated).
Let $D_{i}$ be the proportion of the in-context examples that correspond to task $i$ , and let $P_{i}$ be the probability that the model completes the prompt according to task $i$ . Call a model calibrated if $P_{i} \approx D_{i}$ : the probability it assigns to a task is proportional to the number of times the task appeared in-context. A measure of the degree of calibration is given by the KL-divergence:
$K L (D | | P) = \sum_{i} D_{i} log (D_{i} / P_{i}) .$
Lower KL means better calibration.
Issue 1:
The authors show that, for a certain set of 6 tasks, and with 60 in-context examples, calibration improves with model size:

[Fig. 6 from the paper. X-axis labels show model’s parameter count. D1 is uniform distribution, D2 has probability 0.5 on the third task and 0.1 on other tasks, D3 is a distribution with probabilities alternating between 0.25 and 0.083.]
The best reported KL between the uniform distribution [0.167, 0.167, …, 0.167] and the model’s output distribution is around 1.5. To give a sense for what this means, here are four examples of distributions with such a KL:
$P_{1} = [0.004, 0.004, 0.004, 0.33, 0.33, 0.33]$
$P_{2} = [0.02, . . ., 0.02, 0.9]$
$P_{3} = [10^{- 5}, 0.19999, . . ., 0.19999]$
$P_{4} = [0.004, 0.008, 0.016, 0.032, 0.2, 0.74]$ .
Seeing these distributions convinced me that the models they studied are not well-calibrated. I felt a bit misled by the paper.
Issue 2:
Say we take turns with the model, repeatedly appending a new “question” (e.g. a country) to the prompt and allowing the model to generate an “answer” (e.g. the country’s capital, language, or first letter) at default temperature 1.0. The authors state:
After the first token is generated, the model tends to converge on predicting tokens for a single task, effectively negating its ability for multi-task execution.
But this sort of mode-collapse behavior is not what we would observe from a well-calibrated model!
For simplicity, consider the case of only two tasks: $A$ and $B$ . Let $n_{A}$ and $n_{B}$ be the number of examples of each task, prior to a given generation step. A calibrated model generates $A$ with probability $\frac{n_{A}}{n_{A} + n_{B}}$ , in which case $(n_{A}, n_{B}) \leftarrow (n_{A} + 1, n_{B})$ . Otherwise, it generates $B$ , and $(n_{A}, n_{B}) \leftarrow (n_{A}, n_{B} + 1)$ .
This is precisely the Pólya urn model. If the numbers of each task are initially $A_{0}$ and $B_{0}$ , and then we generate for a long time, it can be shown that the limiting proportion of $A$ tasks is a random variable $X$ distributed as follows:
$P [X = x] \propto x^{A_{0} - 1} (1 - x)^{B_{0} - 1} .$
In the realistic setting where $A_{0}, B_{0} > 1$ , this distribution is peaked around $\frac{A_{0} - 1}{A_{0} + B_{0} - 2}$ , i.e. roughly the proportion of the initial tasks that were $A$ , and goes to zero at $x = 0$ or $1$ . So calibrated models don’t collapse; they tend to maintain roughly the initial ratio of tasks.

jacob_drori 1 Aug 2025 14:15 UTC
9 points
1
in reply to: Neel Nanda’s comment on: Optimizing The Final Output Can Obfuscate CoT (Research Note)
My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it’s incentivised not to say it anywhere.

I share your intuition, and think you are describing what we called “conditional spillover”.^[1] The fact that the Mind+Face intervention did not fully prevent spillover in our experiments is some evidence that conditional spillover is occurring.
One might mitigate conditional spillover as follows. In REINFORCE, the usual parameter update is (setting batch size = lr = 1 for brevity, and being a little schematic):
$δ θ = r (\nabla_{θ} log p_{θ} (CoT      g_{CoT}) + \nabla_{θ} log p_{θ} (output | CoT)      g_{out})$ .
Here, $r$ is the reward, which depends only on the output. In the case where $r$ is the sum of a task-completion and an obfuscation-inducing term (as in our current work), we can write:
$δ θ = (r_{task} + r_{obfusc}) (g_{CoT} + g_{out})$ .
In order to mitigate conditional spillover, we can simply modify the gradient update by removing the $r_{obfusc} \times g_{CoT}$ term:
$(δ θ)_{modified} = r_{task} g_{CoT} + (r_{task} + r_{obfusc}) g_{out}$ .
If this mitigation is used in combination with Mind+Face, then the Mind is not reinforced for producing CoTs which increase the probability of output obfuscation. It is only reinforced for task completion.
1. ^
  Side note: I don’t love our “conditional spillover” and “parametric spillover” terminology, since they sound like different observed effects, whereas they are really two different (proposed) mechanisms underlying the same observed effect. If brevity weren’t an issue, I’d call them “spillover due to later tokens conditioning on earlier ones” and “spillover due to parameter sharing between the models generating CoT and output”, or something.

jacob_drori 31 Jul 2025 20:29 UTC
4 points
0
in reply to: TheManxLoiner’s comment on: Book Review: The MANIAC
IIRC, When We Cease… was part fictional, but presented itself as fact. I remember being pretty annoyed when I eventually realized this (though possibly it was obvious all along and I was just being dumb).

jacob_drori 31 Jul 2025 9:24 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Optimizing The Final Output Can Obfuscate CoT (Research Note)
Mind/Face requires more compute than generating with a single model, since 1) one must keep both sets of model weights in memory, and 2) when the face “takes over” generation, it must recompute the entire KV-cache up to that token.

However, Mind/Face takes no more than 2x the compute compared to standard generation (since you can always use a naive approach: just run forward passes of both models in parallel at each generation step, and choose which model’s predicted token to append conditional on whether we are in thinking or output mode). I expect this upper bound to be loose. We haven’t thought much about optimizations yet, but as Luke mentioned, using a smaller model for the Face (whose job is easier) seems reasonable.

jacob_drori 30 Jul 2025 23:21 UTC
8 points
0
in reply to: ryan_greenblatt’s comment on: Optimizing The Final Output Can Obfuscate CoT (Research Note)
Agreed. Thanks for the useful clarification.
In future work, we may try to produce a demo where optimizing only the output leads to steganographic CoT. But I expect this to be fairly challenging on a practical level (evidence: Roger and McCarthy et al had difficulty inducing steganography via RL. But note that I’m not up to date on this literature). We’d appreciate suggestions for RL environments where producing such a demo may be feasible.

jacob_drori 29 Jun 2025 3:17 UTC
1 point
0
on: Sparsely-connected Cross-layer Transcoders
It may be possible to massively reduce memory usage in sparsely-connected mode.
Let $B$ be batch size, $K$ be num active latents per dictionary per token, and $F$ be num latents per dictionary.
My current implementation of sparsely-connected mode has a terrible $O (F^{2})$ memory usage, since each virtual weight matrix has $F^{2}$ elements. But how many of these virtual weights do we actually need to compute?
Upstream latents: On each token in the batch, we only need the virtual weights connecting to the $K$ active upstream latents.
Downstream latents: Strictly speaking, we should compute activations for every downstream latent, since we don’t know in advance which will be active. But, insofar as vanilla mode closely approximates sparsely-connected mode, we should be okay to only compute virtual weights connecting to downstream latents that were active in vanilla mode.
So on each token, we only need to compute $K^{2}$ virtual weights, and so the memory requirement is $B K^{2}$ , which is small.
Of course, this new approach loses something: sparsely-connected mode now relies on vanilla mode to tell it which latents should activate. So much for a standalone replacement model! I think a reasonable middle-ground is to only compute virtual weights to the $100 \times K$ (say) latents with largest vanilla preactivation. Then compute sparsely-connected preactivations for all those latents, and apply TopK to get the activations. The memory usage is then $100 B K^{2}$ which is still small.
What links here?
- Sparsely-connected Cross-layer Transcoders by jacob_drori (18 Jun 2025 17:13 UTC; 51 points)

jacob_drori 23 Jun 2025 8:27 UTC
2 points
0
on: The Croissant Principle: A Theory of AI Generalization
Related: https://arxiv.org/abs/2004.10802

jacob_drori 19 Jun 2025 15:47 UTC
1 point
0
in reply to: williawa’s comment on: Sparsely-connected Cross-layer Transcoders
The approach you suggest feels similar in spirit to Farnik et al, and I think it’s a reasonable thing to try. However, I opted for my approach since it produces an exactly sparse forward pass, rather than just suppressing the contribution of weak connections. So no arbitrary threshold must be chosen when buikding a feature circuit/attribution graph. Either two latents are connected, or they are not.

I also like the fact that my approach gives us sparse global virtual weights, which allows us to study global circuits—something Anthropic had problems with due to interference weights in their approach.

jacob_drori 27 Mar 2025 17:14 UTC
1 point
0
on: Memorization-generalization in practice
Vary temperature t and measure the resulting learning coefficient function $λ (t)$
This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn’t affect the LLC.

What did I misunderstand?

jacob_drori 7 Mar 2025 16:26 UTC
2 points
1
in reply to: Lucius Bushnaq’s comment on: Estimating the Probability of Sampling a Trained Neural Network at Random
Let $V (ϵ)$ be volume of a behavioral region at cutoff $ϵ$ . Your behavioral LLC at finite noise scale is $λ (ϵ) := d log V / d log ϵ$ , which is invariant under rescaling $V$ by a constant. This information about the overall scale of $V$ seems important. What’s the reason for throwing it out in SLT?

jacob_drori 27 Feb 2025 6:03 UTC
3 points
2
in reply to: Daniel Tan’s comment on: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Fantastic research! Any chance you’ll open-source weights of the insecure qwen model? This would be useful for interp folks.

jacob_drori

LLMs linearly represent the accuracy of their next-token predictions

Operationalizing the definition of a shard