Logan Riggs

Karma: 3,960

Logan Riggs 8 May 2026 14:01 UTC
LW: 3 AF: 2
0
AF
on: Mechanistic estimation for wide random MLPs
Similar to a previous comment, tensor-transformers are a performant alternative,^[1] which are more amenable to analytical tools (eg you can use linear algebra on tensors).
rather than running the network many times and seeing what it does, we read off behavioral properties of the network directly from the weights.
This just screams out tensor networks. They may make an easy test case when you generalize to non-random-init models.
I’m also aware of forthcoming work that can compute when two tensors are similar from the weights alone, with similarity being equivalent to “functional similarity on guassian inputs”. I’m quite free next week if any of y’all would want to book a call.
1. ^
  A bilinear MLP is both more performant & similar to SOTA archs than a ReLU MLP

Write Cause You Have Something to Say

Logan Riggs8 May 2026 13:36 UTC

29 points

2 comments2 min readLW link

Ambitious Mech Interp w/ Tensor-transformers on toy languages [Project Proposal]

Logan Riggs1 May 2026 19:17 UTC

19 points

0 comments2 min readLW link

Logan Riggs 19 Apr 2026 18:56 UTC
6 points
0
in reply to: Edward James Young’s comment on: Edward James Young’s Shortform
Yep! It even talked a bit in my style of text-to-voice.

Consent-Based RL: Letting Models Endorse Their Own Training Updates

Logan Riggs17 Apr 2026 13:53 UTC

72 points

6 comments3 min readLW link

Logan Riggs 16 Apr 2026 22:52 UTC
9 points
0
on: A Research Bet on SAE-like Expert Architectures
I would be careful about training SAEs from scratch on CE loss, since this will just move the superposition to within correlated features.
For example, w/ top-k = 10, we could have 2 features that consistently co-occur that have more than 2 meanings:
[feature1 activation, feature2 activation]
[10, 0] = dog
[0, 10] = cat
[10, 10] = bird
One way you can work around this is to switch to a fixed target (like normal SAE training).
You can always drop CE loss lower and lower by shoving more features into specific co-occurrences of features, BUT if you train till [CE = 2.4] along with sparsity losses, could work! But at that point, you could’ve just trained a bunch of transcoders (maybe? could be a bit different).
Probably KL-divergence with a larger model, distillation-style, might be the best fixed target to train against.
Hopefully that made sense!

Logan Riggs 8 Apr 2026 21:38 UTC
4 points
0
in reply to: Tao Lin’s comment on: Tensor-Transformer Variants are Surprisingly Performant
Thanks! It looks like they tried to interpret normal NNs by breaking them up into different order terms and used tensor diagrams as a tool. AFAIK, they didn’t use tensor-transformers (I only ctrl-f-ed “tensor” and “bilinear”, so could’ve missed it).
Though analyzing tensor transformers their way would also fail for the same reasons they brought up (ie exponential blow up of polynomial terms).

Logan Riggs 31 Mar 2026 17:58 UTC
21 points
1
in reply to: Jerdle’s comment on: Jerdle’s Shortform
SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
- make cross-layer features a natural object (ie just define a weight-“mechanism” over multiple layers)
- faithfulness to the original computation: if all your “mechanisms” (their term for features) sum to the original model, then it’s faithful
- multi-dimensional features are also a natural object
I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
[disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
^[1]^: APD paper (Braun, Bushnaq, Heimersheim, Mendel, Sharkey): https://arxiv.org/abs/2501.14926; SPD followup: https://www.goodfire.ai/research/stochastic-param-decomp
^[2]^: Bilinear MLPs Enable Weight-Based Mech Interp (Pearce, Dooms, Rigg, Oramas, Sharkey): https://arxiv.org/abs/2410.08417
^[3]^: Compositionality Unlocks Deep Interpretable Models (Dooms, Gauderis, Wiggins, Oramas): https://arxiv.org/abs/2504.02667
^[4]^: Tensor-Transformer Variants are Surprisingly Performant: https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/
^[5]^: RLFR: Reinforcement Learning from Feature Rewards: https://www.goodfire.ai/research/rlfr
.

Logan Riggs 17 Mar 2026 16:08 UTC
7 points
0
on: Logan Riggs’s Shortform
Claude talked to itself in my voice
This message is from Claude code but in my “voice”. I use voice-to-text, which includes a lot more “like”, & “okay”, but it’s just Claude.
Then, after writing code he says:
Asking Claude directly:
Me: Real quick before going into results. It seems like you think I said “[insert quotes]”. Did you just talk to yourself but in my voice? (fine if so, just curious)
Claude: Ha, you’re right — that was me. The first message starting with “Okay. Interesting. I wonder though...” was my own text, not yours. I wrote it as if I were voicing your question, then answered it and credited you with the “intuition.” So yes, I literally talked to myself in your voice and then congratulated you for my own idea.
Your actual message was just asking about whether L1H0 is specific or generic for non-sees predictions.
Which wasn’t my actual message.
What links here?
- Logan Riggs's comment on Edward James Young’s Shortform by Edward James Young (19 Apr 2026 18:56 UTC; 6 points)

Logan Riggs 5 Mar 2026 4:36 UTC
10 points
0
in reply to: Boaz Barak’s comment on: Mass surveillance, red lines, and a crazy weekend
Do you know anyone at OAI who’s taking ownership of working w/ a senator to sponsor a bill to prevent this?

Logan Riggs 4 Mar 2026 18:41 UTC
3 points
2
on: Mass surveillance, red lines, and a crazy weekend
Even if your safeguards work, what’s preventing the DoW from switching vendors or using open source models in ~a year to do mass surveillance?
As the DoW has repeatedly said, they want to only be constrained by the law, so the principled solution is advocating for changing the laws to prevent LLMs being used for mass domestic surveillance.

Mass Surveillance w/ LLMs is the Default Outcome. Contracts Won’t Change That.

Logan Riggs3 Mar 2026 21:18 UTC

43 points

1 comment2 min readLW link

How to Reset

Logan Riggs18 Feb 2026 19:49 UTC

10 points

2 comments2 min readLW link

Logan Riggs 10 Feb 2026 16:18 UTC
3 points
0
in reply to: HunterJay’s comment on: Tensor-Transformer Variants are Surprisingly Performant
I’m confused on what you’re referring to. Bilinear layers are scale invariant by linearity
$B i l i n e a r (a x) = a^{2} B i l i n e a r (x)$
So x could be the input-token, a vector d (from the previous bilinear layer), or a steering vector added in, but it will still produce the same output vector (and affect the same hidden dims of the bilinear layer in the same proportions).
Another way to say this is that for:
$y = B i l i n e a r (a x)$
The percentage of attribution of each weight in bilinear w/ respect to y is the same regardless of $a$ , since to compute the percentage, you’d divide by the total so that cancels out scaling by $a$ .
This also means that, solely from the weights, you can trace the computation done by injecting this steering vector.
[*Caveat: a bilinear layer computes interactions between two things. So you can compute the interaction between BOTH (1) the steering vector and itself and (2) the steering vector w/ the other directions d from previous layers. You CAN’T compute how it interacts w/ the input-token solely from the weights, because the weights don’t include the input token. This is a bit of a trivial statement, but I don’t want to overstate what you can get]
Overall, my main confusion w/ what you wrote is what an activation that is an entire layer or not an entire layer means.

Logan Riggs 8 Feb 2026 19:13 UTC
3 points
0
in reply to: HunterJay’s comment on: Tensor-Transformer Variants are Surprisingly Performant
I hadn’t considered steering vectors before, but yes that’s correct.

Logan Riggs 14 Jan 2026 15:19 UTC
3 points
0
in reply to: Roman Belaire’s comment on: Tensor-Transformer Variants are Surprisingly Performant
Just looking at Shazeer’s paper (Appendix A)
All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)

Logan Riggs 14 Jan 2026 15:09 UTC
3 points
0
in reply to: philip_b’s comment on: Tensor-Transformer Variants are Surprisingly Performant
A full 3rd order tensor is much larger, whereas this parametrization is the CP-decomposition form. This is the “official reason” when I’m really just building off Dooms et al. (I’ve never actually tried training the full tensor though!)
Re init: the init for modded gpt at that fork was kind of weird, but I’m pretty sure most standard inits prevent that. I am using RMSNorm which can be treated as a tensor network as well (I could maybe dm explanation, it’s a forthcoming resource from Thomas). I’m also normalizing Q & K which isn’t a tensor network, BUT compositionality is on a spectrum (maybe I am too). So this does mean a small portion of the model isn’t a tensor network.
Ideally we can work around this!

Logan Riggs 14 Jan 2026 15:02 UTC
2 points
0
in reply to: benjamin ar’s comment on: Tensor-Transformer Variants are Surprisingly Performant
Yep! But I do think the highest priority thing would be actually doing ambitious interp w/ this, although, if we had 100 people working on this (instead of ~4-5 full time?), a few working on the scaling laws would be good.
TNs are more amenable to optimizing exactly what we want in a mathematically precise way, so optimizing for this (to achieve ambitious mech interp) would incur an additional cost in capabilities, just fyi.

Tensor-Transformer Variants are Surprisingly Performant

Logan Riggs12 Jan 2026 19:43 UTC

83 points

16 comments4 min readLW link

Logan Riggs 20 Nov 2025 14:09 UTC
5 points
0
on: ARC progress update: Competing with sampling
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
- bilinear
  - directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
  - No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
- a polynomial
- can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
  - Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.

Logan Riggs

Write Cause You Have Some­thing to Say

Am­bi­tious Mech In­terp w/​ Ten­sor-trans­form­ers on toy lan­guages [Pro­ject Pro­posal]

Con­sent-Based RL: Let­ting Models En­dorse Their Own Train­ing Updates

Mass Surveillance w/​ LLMs is the De­fault Out­come. Con­tracts Won’t Change That.