I hadn’t considered steering vectors before, but yes that’s correct.
Logan Riggs
Just looking at Shazeer’s paper (Appendix A)
All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)
A full 3rd order tensor is much larger, whereas this parametrization is the CP-decomposition form. This is the “official reason” when I’m really just building off Dooms et al. (I’ve never actually tried training the full tensor though!)
Re init: the init for modded gpt at that fork was kind of weird, but I’m pretty sure most standard inits prevent that. I am using RMSNorm which can be treated as a tensor network as well (I could maybe dm explanation, it’s a forthcoming resource from Thomas). I’m also normalizing Q & K which isn’t a tensor network, BUT compositionality is on a spectrum (maybe I am too). So this does mean a small portion of the model isn’t a tensor network.
Ideally we can work around this!
Yep! But I do think the highest priority thing would be actually doing ambitious interp w/ this, although, if we had 100 people working on this (instead of ~4-5 full time?), a few working on the scaling laws would be good.
TNs are more amenable to optimizing exactly what we want in a mathematically precise way, so optimizing for this (to achieve ambitious mech interp) would incur an additional cost in capabilities, just fyi.
Tensor-Transformer Variants are Surprisingly Performant
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
bilinear
directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
a polynomial
can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.
Temporarily Losing My Ego
Great work!
Listened to a talk from Philipp on it today and am confused on why we can’t just make a better benchmark than LDS?
Why not just train eg 1k different models, where you left 1 datapoint out? LDS is noisy, so I’m assuming 1k datapoints that exactly capture what you want is better than 1M datapoints that are an approximation [1]
As an estimate, Nano-GPT speedrun takes a little more than 2 min now, so you can train 1001 of these in:
2.33*1k/60 = 38hrs on 8 H100′s which is maybe 4 b200′s which is $24/hr, so ~$1k.
And that’s getting a 124M param LLM trained on 730M tokens up to GPT2 level. Y’all’s quantitative setting for Fig 4 was a 2M parameter Resnet on Cifar-10 on 5k images, which would be much cheaper to do (although the GPT2 one has been very optimized, so you could just do the speedrun one but on less data).
LDS was shown to be very noisy, but a colleague mentioned that this could be because 5k images is a very small amount of data. I guess another way to validate LDS is running the expensive full-training run on a few datapoints.
Confusion on LDS Hyperparameter Sweep Meaning
Y’all show in Fig 4 that there are large error bars across seeds for the different methods. This ends up being a property of LDS’s noisiness, as y’all show in Figures 7-8 (where BIF & EK-FAC are highly correlated). This means that, even using noisy LDS, you don’t need to re-run 5 times if a new method is much better than previous ones (but only if it’s narrowly better).
What I’m confused about is why you retrained on 100 different ways to resample the data at each percentage? Is this just because LDS is noisy, so you’re doing the thing where randomly sampling 100 datapoints 500 times gives you a good approximation of the causal effect of each individual datapoint (or that is what LDS actually is)? Was there high variance in the relative difference between methods across the 100 retrained models?
Other Experiments
Just wild speculation that there are other data attribution methods as opposed to prediction of the output. When a model “groks” something, there will be some datapoints that were more important for that happening that should show up in an ideal data attribution method.
Similar w/ different structures forming in the dataset (which y’all’s other paper shows AFAIK).
[Note: there’s a decent chance I’ve terribly misunderstood y’all’s technique or misread the technical details, so corrections are appreciated]
- ^
It initially seemed confusing on how to evaluate this, but I think we need to look at the variance over the distribution of datapoints. If BIF is consistently more accurate than EK-FAC over eg 100 randomly sampled datapoints, then that’s a good sign for BIF; however, if there’s a high level of variance, then we’d need more data to differentiate between the two. I do think higher quality data attribution methods would have higher signal, so you’d need less data. For example, I predict that BIF does better than Trak on ~all datapoints (but this is an empirical question).
- ^
I was actually expecting Penny to develop dystonia coincidentally, and the RL would tie-in by needing to be learned in reverse ie optimizing from dystonic to normal. It is a much more pleasant ending than the protagonist’s tone the whole way through.
If I was writing a fanfic of this, I’d keep the story as is (+ or—the last paragraph), but then continue into the present moment which leads to the realization.
It’s very exciting to have an orthogonal research direction that finds these ground truth features, which might possibly even generalize(!!). Please do report future results, even if negative (though your Malladi et al link is some evidence in the positive).
It’s also very confusing since I’m unsure how this all fits in with everything else? This clearly works in these cases. SAE’s clearly work in some cases as well (and same w/ the parameter decomposition research), but what’s the “Grand Theory of NN Interp” that explains all of these results?
In general, I believe it’s very important that we hedge our bets on research directions for interp. The main reason being one of them actually panning out, but even if not, they already provide unique pieces of evidences for later researchers (maybe us, maybe LLMs, lol) to hopefully figure out that “Grand Theory of NN Interp”.
Thanks Charlie:)
My wife and I are pretty sure the paramedic checked w/ a stethascope, and so did the doctor when we arrived. But they didn’t mention anything until the x-ray.
The paramedics might not’ve done the pads due to being a few minutes ride from the hospital (I’m literally on the same block as the hospital), but I did recieve them at the hospital (I’ve still got some glue-residue on me actually).
When nursing staff is working long shifts and spread between a lot of patients,
Ya, mine were working 12 hour shifts, 3 days/nights in a row.
Well I signed up for half-haven, and thought “Well I need to write a post every 2 days”, haha. (I’m also more of an over-sharer than others)
Thanks for hosting half-haven btw!
That seems incredibly important, so I’ve added to the main text. Thanks!
Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.
Hospitalization: A Review
Thanks for making these connections!
“Learned helplessness” only seems to cover some of these cases though. My exaggerated tiredness didn’t relate to a feeling of lack of control, but I do agree that it is a learned behavior.
...
Cognitive Defusion is the idea that your thoughts & emotions are separate from you. They aren’t immutable facts.
I do think that relates, but I want to communicate:
(1) Your mindset can drastically change how you experience things. (2) A specific subset of those mindsets involve exaggerating your suffering (sincerely), as a learned behavior that got you what you wanted in the past from those in power over you.
The oreo/heaven-or-hell point is to drive home #(1)So yes, I agree then!
...
For relating to my close friends, it usually involves them telling themselves a very sad story (which is true but only focuses on a small subset of details), which can be asked for them to articulate, which can then be challenged directly (in a loving, tactful way; almost socratic?) while providing comfort.
Yeah, I think what I’m trying now is a CBT solution of “Notice your brain consistently telling yourself the worse-case story”, and might need to lean more into CBT. Thanks!
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.
Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??
The second intuition is the paths being substantially different, which can be quantified as well.
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?
I’m confused on what you’re referring to. Bilinear layers are scale invariant by linearity
Bilinear(ax)=a2Bilinear(x)So x could be the input-token, a vector d (from the previous bilinear layer), or a steering vector added in, but it will still produce the same output vector (and affect the same hidden dims of the bilinear layer in the same proportions).
Another way to say this is that for:
y=Bilinear(ax)The percentage of attribution of each weight in bilinear w/ respect to y is the same regardless of a, since to compute the percentage, you’d divide by the total so that cancels out scaling by a.
This also means that, solely from the weights, you can trace the computation done by injecting this steering vector.
[*Caveat: a bilinear layer computes interactions between two things. So you can compute the interaction between BOTH (1) the steering vector and itself and (2) the steering vector w/ the other directions d from previous layers. You CAN’T compute how it interacts w/ the input-token solely from the weights, because the weights don’t include the input token. This is a bit of a trivial statement, but I don’t want to overstate what you can get]
Overall, my main confusion w/ what you wrote is what an activation that is an entire layer or not an entire layer means.