Yep! It even talked a bit in my style of text-to-voice.
Logan Riggs
Write Cause You Have Something to Say
Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]
Consent-Based RL: Letting Models Endorse Their Own Training Updates
I would be careful about training SAEs from scratch on CE loss, since this will just move the superposition to within correlated features.
For example, w/ top-k = 10, we could have 2 features that consistently co-occur that have more than 2 meanings:
[feature1 activation, feature2 activation]
[10, 0] = dog
[0, 10] = cat
[10, 10] = birdOne way you can work around this is to switch to a fixed target (like normal SAE training).
You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could’ve just trained a bunch of transcoders (maybe? could be a bit different).
Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.
Hopefully that made sense!
Thanks! It looks like they tried to interpret normal NNs by breaking them up into different order terms and used tensor diagrams as a tool. AFAIK, they didn’t use tensor-transformers (I only ctrl-f-ed “tensor” and “bilinear”, so could’ve missed it).
Though analyzing tensor transformers their way would also fail for the same reasons they brought up (ie exponential blow up of polynomial terms).
SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
make cross-layer features a natural object (ie just define a weight-“mechanism” over multiple layers)
faithfulness to the original computation: if all your “mechanisms” (their term for features) sum to the original model, then it’s faithful
multi-dimensional features are also a natural object
I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
[disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
^[1]^: APD paper (Braun, Bushnaq, Heimersheim, Mendel, Sharkey): https://arxiv.org/abs/2501.14926; SPD followup: https://www.goodfire.ai/research/stochastic-param-decomp
^[2]^: Bilinear MLPs Enable Weight-Based Mech Interp (Pearce, Dooms, Rigg, Oramas, Sharkey): https://arxiv.org/abs/2410.08417
^[3]^: Compositionality Unlocks Deep Interpretable Models (Dooms, Gauderis, Wiggins, Oramas): https://arxiv.org/abs/2504.02667
^[4]^: Tensor-Transformer Variants are Surprisingly Performant: https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/
^[5]^: RLFR: Reinforcement Learning from Feature Rewards: https://www.goodfire.ai/research/rlfr
.
Claude talked to itself in my voice
This message is from Claude code but in my “voice”. I use voice-to-text, which includes a lot more “like”, & “okay”, but it’s just Claude.
Then, after writing code he says:
Asking Claude directly:
Me: Real quick before going into results. It seems like you think I said “[insert quotes]”. Did you just talk to yourself but in my voice? (fine if so, just curious)
Claude: Ha, you’re right — that was me. The first message starting with “Okay. Interesting. I wonder though...” was my own text, not yours. I wrote it as if I were voicing your question, then answered it and credited you with the “intuition.” So yes, I literally talked to myself in your voice and then congratulated you for my own idea.
Your actual message was just asking about whether L1H0 is specific or generic for non-sees predictions.Which wasn’t my actual message.
Do you know anyone at OAI who’s taking ownership of working w/ a senator to sponsor a bill to prevent this?
Even if your safeguards work, what’s preventing the DoW from switching vendors or using open source models in ~a year to do mass surveillance?
As the DoW has repeatedly said, they want to only be constrained by the law, so the principled solution is advocating for changing the laws to prevent LLMs being used for mass domestic surveillance.
Mass Surveillance w/ LLMs is the Default Outcome. Contracts Won’t Change That.
How to Reset
I’m confused on what you’re referring to. Bilinear layers are scale invariant by linearity
So x could be the input-token, a vector d (from the previous bilinear layer), or a steering vector added in, but it will still produce the same output vector (and affect the same hidden dims of the bilinear layer in the same proportions).
Another way to say this is that for:
The percentage of attribution of each weight in bilinear w/ respect to y is the same regardless of , since to compute the percentage, you’d divide by the total so that cancels out scaling by .
This also means that, solely from the weights, you can trace the computation done by injecting this steering vector.
[*Caveat: a bilinear layer computes interactions between two things. So you can compute the interaction between BOTH (1) the steering vector and itself and (2) the steering vector w/ the other directions d from previous layers. You CAN’T compute how it interacts w/ the input-token solely from the weights, because the weights don’t include the input token. This is a bit of a trivial statement, but I don’t want to overstate what you can get]
Overall, my main confusion w/ what you wrote is what an activation that is an entire layer or not an entire layer means.
I hadn’t considered steering vectors before, but yes that’s correct.
Just looking at Shazeer’s paper (Appendix A)
All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)
A full 3rd order tensor is much larger, whereas this parametrization is the CP-decomposition form. This is the “official reason” when I’m really just building off Dooms et al. (I’ve never actually tried training the full tensor though!)
Re init: the init for modded gpt at that fork was kind of weird, but I’m pretty sure most standard inits prevent that. I am using RMSNorm which can be treated as a tensor network as well (I could maybe dm explanation, it’s a forthcoming resource from Thomas). I’m also normalizing Q & K which isn’t a tensor network, BUT compositionality is on a spectrum (maybe I am too). So this does mean a small portion of the model isn’t a tensor network.
Ideally we can work around this!
Yep! But I do think the highest priority thing would be actually doing ambitious interp w/ this, although, if we had 100 people working on this (instead of ~4-5 full time?), a few working on the scaling laws would be good.
TNs are more amenable to optimizing exactly what we want in a mathematically precise way, so optimizing for this (to achieve ambitious mech interp) would incur an additional cost in capabilities, just fyi.
Tensor-Transformer Variants are Surprisingly Performant
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
bilinear
directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
a polynomial
can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.


Similar to a previous comment, tensor-transformers are a performant alternative,[1] which are more amenable to analytical tools (eg you can use linear algebra on tensors).
This just screams out tensor networks. They may make an easy test case when you generalize to non-random-init models.
I’m also aware of forthcoming work that can compute when two tensors are similar from the weights alone, with similarity being equivalent to “functional similarity on guassian inputs”. I’m quite free next week if any of y’all would want to book a call.
A bilinear MLP is both more performant & similar to SOTA archs than a ReLU MLP