Logan Riggs

Karma: 3,678

Logan Riggs 20 Nov 2025 14:09 UTC
5 points
0
on: ARC progress update: Competing with sampling
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
- bilinear
  - directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
  - No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
- a polynomial
- can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
  - Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.

Temporarily Losing My Ego

Logan Riggs28 Oct 2025 16:41 UTC

21 points

4 comments3 min readLW link

Logan Riggs 23 Oct 2025 21:20 UTC
LW: 6 AF: 4
0
AF
in reply to: Jesse Hoogland’s comment on: Jesse Hoogland’s Shortform
Great work!
Listened to a talk from Philipp on it today and am confused on why we can’t just make a better benchmark than LDS?
Why not just train eg 1k different models, where you left 1 datapoint out? LDS is noisy, so I’m assuming 1k datapoints that exactly capture what you want is better than 1M datapoints that are an approximation $.$ ^[1]
As an estimate, Nano-GPT speedrun takes a little more than 2 min now, so you can train 1001 of these in:
2.33*1k/60 = 38hrs on 8 H100′s which is maybe 4 b200′s which is $24/hr, so ~$1k.
And that’s getting a 124M param LLM trained on 730M tokens up to GPT2 level. Y’all’s quantitative setting for Fig 4 was a 2M parameter Resnet on Cifar-10 on 5k images, which would be much cheaper to do (although the GPT2 one has been very optimized, so you could just do the speedrun one but on less data).
LDS was shown to be very noisy, but a colleague mentioned that this could be because 5k images is a very small amount of data. I guess another way to validate LDS is running the expensive full-training run on a few datapoints.
Confusion on LDS Hyperparameter Sweep Meaning
Y’all show in Fig 4 that there are large error bars across seeds for the different methods. This ends up being a property of LDS’s noisiness, as y’all show in Figures 7-8 (where BIF & EK-FAC are highly correlated). This means that, even using noisy LDS, you don’t need to re-run 5 times if a new method is much better than previous ones (but only if it’s narrowly better).
What I’m confused about is why you retrained on 100 different ways to resample the data at each percentage? Is this just because LDS is noisy, so you’re doing the thing where randomly sampling 100 datapoints 500 times gives you a good approximation of the causal effect of each individual datapoint (or that is what LDS actually is)? Was there high variance in the relative difference between methods across the 100 retrained models?
Other Experiments
Just wild speculation that there are other data attribution methods as opposed to prediction of the output. When a model “groks” something, there will be some datapoints that were more important for that happening that should show up in an ideal data attribution method.
Similar w/ different structures forming in the dataset (which y’all’s other paper shows AFAIK).
[Note: there’s a decent chance I’ve terribly misunderstood y’all’s technique or misread the technical details, so corrections are appreciated]
1. ^
  It initially seemed confusing on how to evaluate this, but I think we need to look at the variance over the distribution of datapoints. If BIF is consistently more accurate than EK-FAC over eg 100 randomly sampled datapoints, then that’s a good sign for BIF; however, if there’s a high level of variance, then we’d need more data to differentiate between the two. I do think higher quality data attribution methods would have higher signal, so you’d need less data. For example, I predict that BIF does better than Trak on ~all datapoints (but this is an empirical question).
What links here?
- Halfhaven halftime by Viliam (2 Nov 2025 21:29 UTC; 44 points)

Logan Riggs 23 Oct 2025 17:39 UTC
2 points
0
on: Penny’s Hands
I was actually expecting Penny to develop dystonia coincidentally, and the RL would tie-in by needing to be learned in reverse ie optimizing from dystonic to normal. It is a much more pleasant ending than the protagonist’s tone the whole way through.
If I was writing a fanfic of this, I’d keep the story as is (+ or—the last paragraph), but then continue into the present moment which leads to the realization.

Logan Riggs 17 Oct 2025 16:05 UTC
2 points
0
on: Finding Features in Neural Networks with the Empirical NTK
It’s very exciting to have an orthogonal research direction that finds these ground truth features, which might possibly even generalize(!!). Please do report future results, even if negative (though your Malladi et al link is some evidence in the positive).
It’s also very confusing since I’m unsure how this all fits in with everything else? This clearly works in these cases. SAE’s clearly work in some cases as well (and same w/ the parameter decomposition research), but what’s the “Grand Theory of NN Interp” that explains all of these results?
In general, I believe it’s very important that we hedge our bets on research directions for interp. The main reason being one of them actually panning out, but even if not, they already provide unique pieces of evidences for later researchers (maybe us, maybe LLMs, lol) to hopefully figure out that “Grand Theory of NN Interp”.

Logan Riggs 10 Oct 2025 16:47 UTC
3 points
0
in reply to: Charlie Steiner’s comment on: Hospitalization: A Review
Thanks Charlie:)

Logan Riggs 10 Oct 2025 16:46 UTC
6 points
0
in reply to: nim’s comment on: Hospitalization: A Review
My wife and I are pretty sure the paramedic checked w/ a stethascope, and so did the doctor when we arrived. But they didn’t mention anything until the x-ray.
The paramedics might not’ve done the pads due to being a few minutes ride from the hospital (I’m literally on the same block as the hospital), but I did recieve them at the hospital (I’ve still got some glue-residue on me actually).
When nursing staff is working long shifts and spread between a lot of patients,
Ya, mine were working 12 hour shifts, 3 days/nights in a row.

Logan Riggs 10 Oct 2025 16:38 UTC
7 points
0
in reply to: Viliam’s comment on: Hospitalization: A Review
Well I signed up for half-haven, and thought “Well I need to write a post every 2 days”, haha. (I’m also more of an over-sharer than others)
Thanks for hosting half-haven btw!

Logan Riggs 10 Oct 2025 16:36 UTC
6 points
0
in reply to: AlphaAndOmega’s comment on: Hospitalization: A Review
That seems incredibly important, so I’ve added to the main text. Thanks!

Logan Riggs 9 Oct 2025 17:52 UTC
2 points
0
in reply to: williawa’s comment on: Replacing RL w/ Parameter-based Evolutionary Strategies
Would be interesting to empirically check the reward surrounding reward hacking solutions. Should be able to plot the reward against variance and see if that’s different than other spots.

Hospitalization: A Review

Logan Riggs9 Oct 2025 14:36 UTC

360 points

21 comments9 min readLW link

Logan Riggs 8 Oct 2025 11:35 UTC
2 points
0
in reply to: d_el_ez’s comment on: Making Your Pain Worse can Get You What You Want
Thanks for making these connections!
“Learned helplessness” only seems to cover some of these cases though. My exaggerated tiredness didn’t relate to a feeling of lack of control, but I do agree that it is a learned behavior.
...
Cognitive Defusion is the idea that your thoughts & emotions are separate from you. They aren’t immutable facts.
I do think that relates, but I want to communicate:
(1) Your mindset can drastically change how you experience things. (2) A specific subset of those mindsets involve exaggerating your suffering (sincerely), as a learned behavior that got you what you wanted in the past from those in power over you.

The oreo/heaven-or-hell point is to drive home #(1)
So yes, I agree then!
...
For relating to my close friends, it usually involves them telling themselves a very sad story (which is true but only focuses on a small subset of details), which can be asked for them to articulate, which can then be challenged directly (in a loving, tactful way; almost socratic?) while providing comfort.

Yeah, I think what I’m trying now is a CBT solution of “Notice your brain consistently telling yourself the worse-case story”, and might need to lean more into CBT. Thanks!

Logan Riggs 8 Oct 2025 11:08 UTC
4 points
0
in reply to: williawa’s comment on: Replacing RL w/ Parameter-based Evolutionary Strategies
One intuition I can offer is that you end up in wider basins of reward/loss landscape.
If you want to hit a very narrow basin, but your variance is too high, then you might not sample the high reward point.

Although, sampling enough points which do include the reward hacking weights will eventually center you on the reward hacking weights.
Suppose you sample 1k points, and one of them is the reward hacking weight with reward 100 (and the rest 1). Then you will move towards the reward hacking weight the most, which would make it more likely sampled the next time AFAIK. So maybe not??

The second intuition is the paths being substantially different, which can be quantified as well.

Logan Riggs 8 Oct 2025 11:08 UTC
2 points
0
in reply to: williawa’s comment on: Replacing RL w/ Parameter-based Evolutionary Strategies
The paper does have a few empirical experiments showing they arrive at different solutions. Specifically the KL-reward plot. Would you need more settings to be convinced here?

Logan Riggs 8 Oct 2025 2:57 UTC
2 points
0
in reply to: Alex A’s comment on: Telling the Difference Between Memories & Logical Guesses
Could you elaborate on this? I understand you saying they’re fungible as in the “memory” of doing it today is basically the same memory as doing it yesterday.
Is it like chunking? You’ve learned a specific pattern that doesn’t change, which you just bring forth the general concept of “brush teeth”?