Jason Gross

Karma: 274

Jason Gross 25 Nov 2025 2:40 UTC
1 point
0
in reply to: Quinn’s comment on: Please Measure Verification Burden
And if you combine Hypothesis with fractional proofs, you can ⁸⁰⁄₂₀ the difference between just Hypothesis, and proofs!

Jason Gross 25 Nov 2025 2:39 UTC
3 points
0
on: Please Measure Verification Burden
The baseline for proof burden is just lines of proof / lines of code. For production-grade software verification projects this is 10×--100×.

Models that are bad at verification will do worse.

On ambitious projects (e.g., AlphaProof when it came out) verification might increase capabilities, leading to a verification burden < 1
What links here?
- How to Solve Secure Program Synthesis by Max von Hippel (30 Mar 2026 20:12 UTC; 23 points)

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree and Jason Gross

22 Mar 2025 18:35 UTC

25 points

0 comments7 min readLW link

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Jason Gross and rajashree

6 Jan 2025 4:22 UTC

19 points

0 comments12 min readLW link

Jason Gross 10 Nov 2024 16:04 UTC
5 points
0
on: Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

Jason Gross 25 Oct 2024 3:23 UTC
LW: 3 AF: 1
0
AF
on: Are we dropping the ball on Recommendation AIs?
I believe the closest research to this topic is under the heading “Performative Power” (cf, e.g., this arXiv paper). I think “The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power” by Shoshana Zuboff is also a pretty good book that seems related.

Jason Gross 23 Aug 2024 1:50 UTC
4 points
0
in reply to: leogao’s comment on: A simple model of math skill
The reason you can’t sample uniformly from the integers is more like “because they are not compact” or “because they are not bounded” than “because they are infinite and countable”. You also can’t sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)

If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].

Jason Gross 22 Jul 2024 7:11 UTC
3 points
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform

I don’t want a description of every single plate and cable in a Toyota Corolla, I’m not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.

What I want right now is a basic understanding of combustion engines.

This is the wrong ‘length’. The right version of brute-force length is not “every weight and bias in the network” but “the program trace of running the network on every datapoint in pretrain”. Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.

Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.

Jason Gross 22 Jul 2024 6:53 UTC
LW: 1 AF: 1
0
AF
on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|) correspond to? And what about using $\prod_{i} (1 + | a_{i} |)$ instead of $\sum_{i} log (1 + | a_{i} |)$ ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

Jason Gross 22 Jul 2024 6:33 UTC
LW: 1 AF: 1
0
AF
on: A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
  Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
  There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.
These do not sound like different explanations to me. In particular, the distinction between “mostly-continuous but approximated as discrete” and “discrete but very similar” seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won’t change the behavior of the network meaningfully).
As far as toy models go, I’m pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width ${d_vocab}^{3} n_ctx$ , I expect each feature to split into roughly ${d_vocab}^{2} n_ctx$ features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I’m quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.
Edit: I expect this toy model will also permit exploring:
[Lee] Is there structure in feature splitting?
- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
  ‘topology in a math context’
  ‘topology in a physics context’
  ‘high dimensions in a math context’
  ‘high dimensions in a physics context’
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn’t the original SAEs find these directions?
I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Jason Gross 21 Jul 2024 1:35 UTC
2 points
0
in reply to: Joseph Miller’s comment on: Transformer Circuit Faithfulness Metrics Are Not Robust

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would—resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you’re computing $f (x, y)$ where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I’m imagining you’re describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .

Jason Gross 18 Jul 2024 0:41 UTC
3 points
2
on: Transformer Circuit Faithfulness Metrics Are Not Robust
But in other aspects there often isn’t a clearly correct methodology. For example, it’s unclear whether mean ablations are better than resample ablations for a particular experiment—even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there’s a correct kind of ablation. For example, if the question is “how do we reproduce this behavior from scratch”, you want zero ablation.

Your table can be reorganized into the kinds of answers you’re seeking, namely:
- direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
- necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
- typical case vs worst case, and over what data distribution:
  - “all tokens vs specific tokens” should be absorbed into the more general category of “what’s the reference dataset distribution under consideration” / “what’s the null hypothesis over”,
  - zero ablation answers “reproduce behavior from scratch”
  - mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
  - pessimal ablation is for dealing with worst-case behaviors
- granularity and component are about the scope of the solution language, and can be generalized a bit
Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs

Jason Gross 18 Jul 2024 0:05 UTC
1 point
0
on: Transformer Circuit Faithfulness Metrics Are Not Robust
Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.
Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? “Mean ablation” is underspecified in the absence of a dataset distribution.

Jason Gross 14 Jul 2024 20:51 UTC
1 point
0
on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
it’s substantially worth if we restrict
Typo: should be “substantially worse”

Jason Gross 11 Jul 2024 0:14 UTC
LW: 4 AF: 2
1
AF
on: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there’s just a ton of gorgeous results in there. I think it’s the most (only?) truly rigorous reverse-engineering work out there
Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re “most (only?) truly rigorous reverse-engineering work out there”: I think the clock and pizza paper seems comparably rigorous, and there’s also my recent Compact Proofs of Model Performance via Mechanistic Interpretability (and Gabe’s heuristic analysis of the same Max-of-K model), and the work one of my MARS scholars did showing that some pizza models use a ReLU to compute numerical integration, which is the first nontrivial mechanistic explanation of a nonlinearity found in a trained model (nontrivial in the sense that it asymptotically compresses the brute-force input-output behavior with a (provably) non-vacuous bound).
What links here?
- An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda (7 Jul 2024 17:39 UTC; 146 points)

Jason Gross 26 Jun 2024 17:14 UTC
4 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Formal verification, heuristic explanations and surprise accounting
Possibilities I see:
1. Maybe the cost can be amortized over the whole circuit? Use one bit per circuit to say “this is just and/or” vs “use all gates”.
2. This is an illustrative simplified example, in a more general scheme, you need to specify a coding scheme, which is equivalent to specifying a prior over possible things you might see.

Jason Gross 25 Jun 2024 23:25 UTC
LW: 12 AF: 6
3
AF
in reply to: RogerDearnaley’s comment on: Compact Proofs of Model Performance via Mechanistic Interpretability
I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On other toy models we’ve looked at (modular addition in particular, writeup forthcoming), we have (very) preliminary evidence suggesting that randomizing the noise has a steep drop-off in bound-tightness (as a function of how compact a proof the noise term comes from) in a very similar fashion to what we see with proofs. There seems to be a pretty narrow band of hypotheses for which the noise is structureless but we can’t prove it. This is supported by a handful of comments about how causal scrubbing indicates that many existing mech interp hypotheses in fact don’t capture enough of the behavior.

Jason Gross 25 Jun 2024 19:33 UTC
22 points
1
on: SAE feature geometry is outside the superposition hypothesis

I think it would be valuable to take a set of interesting examples of understood internal structure, and to ask what happens when we train SAEs to try to capture this structure. [...] In other cases, it may seem to us very unnatural to think of the structure we have uncovered in terms of a set of directions (sparse or otherwise) — what does the SAE do in this case?

I’m not sure how SAEs would capture the internal structure of the activations of the pizza model for modular addition, even in theory. In this case, ReLU is used to compute numerical integration, approximating $\int_{- π}^{π} ∣ ∣ cos (\frac{k}{2} + ϕ) ∣ ∣ cos (2 ϕ) d ϕ = \frac{4}{3} cos k$ (and/or similarly for sin). Each neuron is responsible for one small rectangle under the curve. Its input is the part of the integrand under the absolute value/ReLU, $cos (\frac{k}{2} + ϕ)$ (times a shared scaling coefficient), and the neuron’s coefficient in the fourier-transformed decoder matrix is the area element $cos (2 ϕ) d ϕ$ (again times a shared scaling coefficient).

Notably, in this scheme, the only fully free parameters are: the frequencies of interest, the ordering of neurons, and the two scaling coefficients. There are also constrained parameters for how evenly the space is divided up into boxes and where the function evaluation points are within each box. But the geometry of activation space here is effectively fully constrained up to permutation of the axes and global scaling factors.

What could SAEs even find in this case?

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

24 Jun 2024 19:27 UTC

104 points

4 comments8 min readLW link

(arxiv.org)

Jason Gross 27 Apr 2024 2:06 UTC
LW: 2 AF: 1
0
AF
on: Sparsify: A mechanistic interpretability research agenda

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with “small errors” that cause them to have very small but non-zero overlap with some activations might be killed off entirely rather than merely having the overlap penalized.

See here for some brief write-up and animations.

Jason Gross

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

Mea­sur­ing Non­lin­ear Fea­ture In­ter­ac­tions in Sparse Cross­coders [Pro­ject Pro­posal]

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

[Replication] Crosscoder-based Stage-Wise Model Diffing

Measuring Nonlinear Feature Interactions in Sparse Crosscoders [Project Proposal]

Compact Proofs of Model Performance via Mechanistic Interpretability