shawnghu

Karma: 112

shawnghu 5 Nov 2025 20:26 UTC
1 point
0
in reply to: leogao’s comment on: Daniel Birnbaum’s Shortform
i’ve thought big names should do this for conference papers to keep conferences honest (peer review is anonymous but as i understand it it’s extremely obvious when a big name has written a paper)

shawnghu 22 Oct 2025 21:57 UTC
1 point
0
in reply to: lewis smith’s comment on: The ‘strong’ feature hypothesis could be wrong
Oh yeah, I’m certainly agreeing with the central intent of the post, now just clarifying the above discussion.

One clarification—here, as stated, “mechanisms operating in terms of linearly represented atoms” doesn’t constrain the mechanisms themselves to be linear, does it? SAE latents themselves are some nonlinear function of the actual model activations. But if the mechanisms are substantially nonlinear we’re not really claiming much.

My own impression is that things are nonlinear unless proven otherwise, and a priori I would really strongly expect the strong linear representation hypothesis to be just false. In general it seems extremely wishful to hope that exactly those things that are nonlinear (in whatever sense we mean) are not important, especially since we employ neural networks specifically to learn really weird functions we couldn’t have thought of ourselves.

shawnghu 21 Oct 2025 22:20 UTC
2 points
0
on: Humanity Learned Almost Nothing From COVID-19
Do you have any suggestions, or references to resources, for what individuals should do to be better prepared for another global pandemic?

shawnghu 21 Oct 2025 22:18 UTC
3 points
0
on: Humanity Learned Almost Nothing From COVID-19
Exceedingly virtuous exceptions exist, I’ll praise the ones I know of at the end. ↩︎

Where?

shawnghu 21 Oct 2025 21:52 UTC
1 point
0
in reply to: lewis smith’s comment on: The ‘strong’ feature hypothesis could be wrong
I see. If I understand you correctly, a mechanism, whether human-interpretable or not, which seems somehow to be functionally separate but not explainable in terms of their operations on linear subspaces of activation space, would count as evidence against the strong feature hypothesis, right?

Aren’t the MLPs in a transformer straightforward examples of this?

(BTW, I agree with the main thrust of the post. I think that the linear feature hypothesis in most usefully strong forms should be default-false unless proven otherwise; I appreciate the thing you said two comments up about how “disproving a vague hypothesis is a bit difficult”).

shawnghu 20 Oct 2025 17:26 UTC
2 points
0
in reply to: CstineSublime’s comment on: shawnghu’s Shortform
I didn’t do a lot of thorough research, but maybe I simply don’t know how to.

I googled around for resources, which usually leads to… I don’t know how to describe this, but short-form articles which are not very information dense and mutually contradictory, and I looked for opinions on Reddit and for an FAQ-like thing on /r/Ergonomics, which also didn’t tell me much definitive except that a) people have a variety of problems due to their variety of body shapes and b) it is a normal thing to want a desk that’s significantly lower than most desks.

I must have done some amount of Claude-querying, but it’s intensive to figure out what the root problems are here and whether there are canonical solutions to them, possibly because of the fact that the resources Claude would most easily reference are the same inadequate ones I’ve just described. I bet that it’s possible to figure this out with Claude if I go slowly and specifically enough, though.

I don’t think I found anything even approaching a central resource which claims to be comprehensive (however opinionated). Something like what they have at /r/bodyweightfitness, for example, would be excellent by the standards described here.

shawnghu 20 Oct 2025 17:17 UTC
1 point
0
in reply to: lewis smith’s comment on: The ‘strong’ feature hypothesis could be wrong
It feels to me like evaluating any of the sentences in your comment rigorously requires a more specific definition of “feature”.

shawnghu 19 Oct 2025 22:30 UTC
1 point
0
in reply to: Daniel Tan’s comment on: The ‘strong’ feature hypothesis could be wrong
While these are logically distinct things, can you think of an experiment that would be able to distinguish the two even in principle? In other words, you say “our existing techniques can’t yet”—but what would one that can distinguish these even look like?

shawnghu 15 Oct 2025 20:59 UTC
3 points
0
in reply to: johnswentworth’s comment on: My Empathy Is Rarely Kind
I don’t necessarily disagree with this way of looking at things.

Serious question—how do you calibrate the standard by which you judge that something is good enough to warrant respect, or self-respect?

To illustrate what I mean, in one of your examples you judge your project teammates negatively for not having had the broad awareness to seriously learn ML ahead of time, in the absence of other obvious external stimuli to do so (like classes that would be hard enough to actually require that). The root of the negative judgment is that a better objective didn’t occur to them.

How can you ever be sure that there isn’t a better objective that isn’t occurring to you, at any given time? More broadly, how can you be sure that there isn’t just a generally better way of living, that you’re messing up for not currently doing?

If, hypothetically, you encountered a better version of yourself that presented you with a better objective and ways of living better, would you retroactively judge your life up to the present moment as worse and less worthy of respect? (Perhaps, based on the answer to the previous problem, the answer is “yes”, but you think this is an unlikely scenario.)

shawnghu 13 Oct 2025 20:30 UTC
2 points
0
on: shawnghu’s Shortform
Does anyone have a rigorous reference or primer on computer ergonomics, or ergonomics in general? It’s hard to find a reference that says with authority/solid reasoning what good ergonomics are and why, and solutions to common problems.

shawnghu 3 Oct 2025 7:22 UTC
2 points
0
in reply to: Maxwell Adam’s comment on: How to Feel More Alive
Hey, I think I relate to this! I didn’t expect to see this phenomenon described anywhere, and I’m happy that you took the time to describe it.

I think I was able to improve on this (or, the aspects of this that were an issue for me) by coming up with new ways of expressing what I think and feel in lossier/less-precise/more vibe-driven ways.

shawnghu 25 Sep 2025 5:11 UTC
1 point
0
on: shawnghu’s Shortform
Does anyone have a sense of whether, qualitatively, RL stability has been solved for any practical domains?

This question is at least in part asking for qualitative speculation about how the post-training RL works at big labs, but I’m interested in any partial answer people can come up with.

My impression of RL is that there are a lot of tricks to “improve stability”, but performance is path-dependent in pretty much any realistic/practical setting (where state space is huge and action space may be huge or continuous). Even for larger toy problems my sense is that various RL algorithms really only work like up to 70% of the time, and 30% of the time they randomly decline in reward.

One obvious way of getting around this is to just resample. If there are no more principled/reliable methods, this would be the default method of getting a good result from RL. It would follow that that’s just what the big labs do. But of course they may have secretly solved some stuff, but it’s hard to imagine what the form of that would be.

shawnghu 21 Sep 2025 23:18 UTC
1 point
0
in reply to: shawnghu’s comment on: shawnghu’s Shortform
followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets. so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own. i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.

i would still like to know the answer to the original question.

shawnghu 21 Sep 2025 5:05 UTC
5 points
0
on: shawnghu’s Shortform
it’s hard to find definitive information about this basic fact about how modern RL on LLMs works:

are there any particularly clever ways of doing credit assignment for the tokens in a sequence S that resulted in high reward?

moreover, if you adopt the naive strategy of asserting that all of the tokens are equally responsible for the reward, is the actual gradient update to the model parameters mathematically equivalent to the one you’d get SFTing the model on S (possibly weighted by the reward, and possibly adjusted by GRPO)?

the followup is this: in this paper they claim that SFT’d models perform badly at something and RL’d models don’t. i can’t imagine what the difference between these things would even be, except that the RL’d models are affected by samples which are on-policy for them.

https://arxiv.org/pdf/2507.00432

shawnghu 12 Sep 2025 23:23 UTC
1 point
0
on: shawnghu’s Shortform
at the end of the somewhat famous blogpost about llm nondeterminism recently https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ they assert that the determinism is enough to make an rlvr run more stable without importance sampling.

is there something i’m missing here? my strong impression is that the scale of the nondeterminism of the result is quite small, and random in direction, so that it isn’t likely to affect an aggregate-scale thing like the qualitative effect of an entire gradient update. (i can imagine that the accumulation of many random errors does bias the policy towards being generally less stable, which implies qualitatively worse, yes...)

without something that mitigates the statement above, my prior is instead that the graph is cherry-picked, intentionally or not, to increase the perceived importance of llm determinism.

shawnghu 4 Sep 2025 0:57 UTC
2 points
1
in reply to: Stephen Elliott’s comment on: Will Any Crap Cause Emergent Misalignment?
Yeah, whenever a result is sensational and comes from a less-than-absolutely-huge name, my prior is that the result is due to mistakes (like 60-95% depending on the degree of surprisingness), and defacto this means I just don’t update on papers like this one any more until significant followup work is done.

shawnghu 30 Aug 2025 21:55 UTC
4 points
3
in reply to: Stephen Elliott’s comment on: Will Any Crap Cause Emergent Misalignment?
I wonder if you’re referring to the “spurious rewards” paper. If so, I wonder if you’re aware of [this critique] (https://safe-lip-9a8.notion.site/Incorrect-Baseline-Evaluations-Call-into-Question-Recent-LLM-RL-Claims-2012f1fbf0ee8094ab8ded1953c15a37) of its methodology, which might be enough to void the result.

shawnghu 30 Aug 2025 7:06 UTC
1 point
0
in reply to: trade_apprentice’s comment on: Will Any Crap Cause Emergent Misalignment?
I think the critique generalizes if it’s a little more focused. If a huge number of papers arose that just demonstrated that EM arose in a bunch of settings that varied superficially without a clear theory of why, this post would be a good critique of that phenomenon.

shawnghu 16 Jul 2025 1:34 UTC
1 point
0
on: Bring back the Colosseums
How do you feel about mutual combat laws in Washington and Texas, where you can fight by agreement (edit: you can’t grievously injure each other, apparently)?

shawnghu 16 Jul 2025 1:30 UTC
1 point
0
in reply to: Beckeck’s comment on: Bring back the Colosseums
I find it absurd on priors to think that soccer of any demographic could result in more concussions than any of those five full-contact sports, particularly the three where part of the objective is explicitly to hit your opponent in the head very hard if you can. (Even factoring in the fact that you do a bunch of headers in soccer.) (Maybe if you do some trickery like selecting certain subpopulations of the practitioners of these sports, but...)