RobinHa

Karma: 37

Gradient-Based Recovery of Memorized Diffusion Model Data

RobinHa1 Feb 2026 0:05 UTC

10 points

0 comments3 min readLW link

RobinHa 15 Jan 2026 3:58 UTC
1 point
0
in reply to: FlorianH’s comment on: Quantifying Love and Hatred
1. This is a very valid point—but I don’t think it inherently disagrees with the core idea. The kind stranger is still someone we admire, else we wouldn’t describe him as such. The more we look up to this stranger, the more we would be willing to risk our life for him. This also doesn’t directly imply a friendship: for that, the p-values have to be a two-way street. As you are a stranger to him, suppose a true stranger, he has no idea who you are, what your beliefs, achievements and goals describe, etc… - for all he knows, you could be a murderer on death-row. Even this extremely kind stranger will probably decide that it’s not worth it.
2. I think a lot of people would argue along these lines, they were simply “too weak” and couldn’t go through with it. I also think that people use the word “friendship” too inflationary and a lot of these relationships are exclusively self-serving. Having fun with people and spending a lot of time with them as a result is not something rare—everyone is having a good time, naturally you would like to reinforce this behavior. But this isn’t what I look at as a friendship. A friendship is defined by what happens when not everyone is having fun, when in fact everyone is miserable. Giving when there’s nothing to give. Protecting when one needs protection themself. You mentioned someone being fine with giving money, just not this. I only think this applies when money isn’t scarce, when the incurred loss is minor. Because giving money when it’s truly scarce often implies existential problems just as much, depending on where you live.
3. I’ll concede this point because it’s pretty much inevitable: a hypothetical clearly has limitations. But I do think that there are proxies much alike to this hypothetical, simply not as extreme. I’m sure each and every one has at least once during school seen a person they know get bullied. Did you step in, even knowing you might become a new target? This sadly introduces a lot of secondary variables that skew the resulting answer, but it does remove the noise of a hypothetical.

Quantifying Love and Hatred

RobinHa14 Jan 2026 20:40 UTC

10 points

8 comments1 min readLW link

RobinHa 11 Jan 2026 18:08 UTC
1 point
0
in reply to: ghost-in-the-weights’s comment on: The Case Against Continuous Chain-of-Thought (Neuralese)
I’m not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn’t really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can’t be worth it anymore.

Making everything discrete is one extreme just as making everything continuous is the other extreme. I’m arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.

RobinHa 11 Jan 2026 1:15 UTC
1 point
0
in reply to: ghost-in-the-weights’s comment on: The Case Against Continuous Chain-of-Thought (Neuralese)
I’m not sure which passage you seem to refer to, saying that my argument implies this. The sections “The Bandwidth Intuition/Counterargument” are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it’s nonsensical for a normal NN to have miniscule layers. This isn’t an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it’s simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.

I hope my position has become a little bit more clear.

RobinHa 10 Jan 2026 23:18 UTC
1 point
0
in reply to: anaguma’s comment on: The Case Against Continuous Chain-of-Thought (Neuralese)
No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can’t take advantage of any possible benefits cause it was trained on a whole other setting)

Even more importantly, the error doesn’t aggregate for the KV Cache (the normal weights obviously can’t, they are literally fixed): inspecting a token’s KV cache in the i’th layer, it only carries the noise from the 0′th to i’th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i’th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)

Specifically as noted in “The Bandwidth Counterargument”:

Having bottleneck layers in a normal neural network is nonsensical—when the “distance” is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn’t yet grown enough to matter and tokenization can clean it up.

The Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa10 Jan 2026 20:32 UTC

11 points

8 comments5 min readLW link

RobinHa 7 Jan 2026 21:37 UTC
−1 points
0
in reply to: ChristianKl’s comment on: Why do LLMs so often say “It’s not an X, it’s a Y”?
Ah, I think I might have slightly misunderstood the intent of your posts title and tried answering a different question: why does LLM writing often seem shallow and bad rather than why LLMs specifically seem biased to a subset of stylistic devices.

I honestly don’t use LLMs much to chat or write with, so my personal experience is rather limited. But I do find the point others made, on the data distribution for post training just not being an accurate sample, convincing enough—just not particularly satisfying.

So, here’s my thoughts on why both RLHF but also SFT or DPO could, even with a perfect sample of training data, result in converged down distributions of stylistic devices.

In the case of RLHF, we can go even further by assuming the distillation of the training data went perfectly—the reward model isn’t biased towards any stylistic devices but a perfect representation of its training data.

Even then, the key problem is that the reward model only sees a single trace. This is important because it makes the reward model unable to determine whether the distribution of stylistic devices seen in the trace is simply a reasonable sample from the whole distribution or only a subset of it.

And because of constant optimization pressure, only mastering a few stylistic devices (just enough to fool the RM in a single trace) will quickly become the path forward.

Now what about something like SFT—after all, here we don’t do any rollouts anymore. This does help. We can assume because of the unbiased loss, the distribution of stylistic devices when presented with some training examples is pretty accurate. But that’s the extent to which we can make statements: we were completely offline.

The traces during inference are very different from the training data: Errors propagate during token generation, biases accumulate and suddenly we are faced with only a subset of the training distribution or worse, something not encountered at all. Assuming the distribution of stylistic devices, once aligned to a completely different distribution of traces, will still be unbiased, is wishful thinking at best.

Online based training where you look beyond a single trace seems most promising. This can happen by either including stuff like logits (KL distillation, see this post for an idea which should work well as well) or simply incorporating multiple traces into the judgement of one—how diverse is this trace compared to others generated? (including its stylistic devices for example)

RobinHa 30 Dec 2025 21:48 UTC
3 points
0
on: Why do LLMs so often say “It’s not an X, it’s a Y”?
I think it might be that the undesired response in RLHF/DPO settings isn’t good enough.

Imagine two responses, one leveraging stylistic devices and persuasive words while the other… well, just doesn’t. Naturally the first is better and more desirable. If we now inspect this over the whole training batch, these distinctions of the preferred response at any point in the response leveraging stylistic devices will become clear. That is, a phrase like “It’s not an X, it’s a Y” will occur at a bunch of different positions throughout all the different positive examples in contrast to the negative examples which very rarely, if at all, showcase such pleasant phrasing.

But wouldn’t then such behavior of constantly repeating stylistic devices be exactly what we would expect? This clear contrast between positive and negative example will be what we distill in our final model, basically telling it that stylistic devices at any point are preferred.

To move away from this, looking for better high quality positive examples won’t be helpful at all—instead we need the negative examples throughout training to become closer and closer to the positive examples just like a writer progresses through his career: first learning about stylistic devices, then understanding when to use them meaningfully, when less is more and finally fully mastering it. This contrast between good and really good writer needs to be captured more in the posttraining data for something like DPO.

Do take this with a grain of salt, just a random theory i came up with while thinking about this for 10 minutes or so, i didn’t research the empirical state of research with this hypothesis but it does seem somewhat convincing to me at least.

RobinHa 16 Dec 2025 22:06 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: Neuroscience of human social instincts: a sketch

but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.

somewhat random but I think I want to learn more about this field in general—from what I can tell, you didn’t learn about it in a normal academic setting (like doing a neuroscience B.Sc.) either; any tips for good resources?

RobinHa 14 Dec 2025 13:47 UTC
3 points
0
on: Neuroscience of human social instincts: a sketch
This isn’t as much a question as it is just sharing some thoughts I had, but I would love to hear your thoughts :) Let’s imagine we are our own brain’s optimizer. We just received a bad signal, we feel pain. Let’s say, we realized someone else is soon going to feel pain, so we feel pain. What could the optimizer do now? Well, there are only 2 things it can do:
1. Try to disconnect “she feels pain” from the concept of pain that then triggered pain in yourself
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
You speak a lot to (1), explaining the symbol grounding mechanism that continuously symbol grounds it in the ground truth, so the optimizer trying to move “she feels pain” away from its previous position in the feature space won’t work (at least as long as we continuously have such ground truth input—this sheds light on the very immoral but very interesting experiment of having an individual not exposed to such input for long periods, like not seeing any human face for multiple months, be it in person, on pictures or on your phone. There, this theory should predict that such a move in feature space could happen and will be successful—to be dramatic, you become a psychopath).

You don’t speak much to (2) though. One option for example here would be to unlearn the concept of “future”—babies first gradually learn about it therefore it’s reasonable to assume that you could unlearn it again. Luckily, this doesn’t seem to happen, so there must be some opposing force, something that promises reward if this concept persists.

Specifically, this concept must offer you insight into your actions such that your future expected reward rises. This is obvious in this case—without the concept “future”, you can hardly make any intelligent decisions at all. But it also carries over to much more specific and even human invented associations/knowledge:

Let’s say you work in cyber-security and the reason you think this person will feel pain is because using those cyber-security skills enabled you to make an association the normal person wouldn’t. The optimizer could try to unlearn these skills, but actually those skills lead to higher expected reward, else you wouldn’t be pursuing it: be it the nice house you can afford, the social status you enjoy because of it or simply the joy you receive from enacting it.

In other words, anything you learned, you learned because you assumed it would result in a higher expected reward and anything you act out (after learning), you do because it results in a higher expected reward. To forget these concepts will at least require a reward matching theirs.

This doesn’t imply it should be impossible though—let’s say you learned something that you hate, like say chiseling stone. You did this because the market would pay insane wages because only few could do the job and so the reward you saw attached to those wages was immense and you pushed through the boring education of becoming an expert in chiseling stone. And once you got there, you realize, you weren’t the only one with the idea: wages drop quicker than the average pump & dump crypto coin. In fact the profession you enacted before, which you intrinsically enjoy, even pays better.

As I’m writing this, I realize there are no good stories for why chiseling stone might give you a better glimpse into someone’s future pain, but let’s just take it for granted. Then the reward of the knowledge of chiseling stone is pretty much zero, maybe even negative because whenever you recall it, you recall all the effort that didn’t pay off.

Yet I have never heard of something along these lines happening. It would be quite a great mechanism for the free market though, the wages would jump right up: let’s hope our individual in question doesn’t once again try to learn to chisel stone, completely forgetting this tale of unreciprocated effort.

You could maybe argue something like: precisely the things that fall in this category are things we gave up on, that is, their occurrence in our day-to-day life is incredibly rare. Therefore, with a normal learning rate, we simply wouldn’t iterate over them often enough to forget them meaningfully.

Lastly, just for completeness, naturally ‘disconnecting your previous thoughts from arriving at “she feels pain”’ also entails your previous actions—it’s a very special occurrence to know somebody will feel pain in the future, unless you had a play in it yourself. Naturally those decisions back then will be optimized on as well, hopefully leading you to make better decisions in the future.

RobinHa 12 Dec 2025 22:09 UTC
3 points
0
in reply to: Steven Byrnes’s comment on: Neuroscience of human social instincts: a sketch

Think of it as vaguely like I-am-juggling versus you-are-juggling.

Here, I can see how they would overlap to a reasonable degree—I don’t think this easily carries over to emotions. Emotions atleast feel like this weird, distinct thing such that any statement along the lines “I’m happy” does it injustice. Therefore I can’t see it being carried over to “She’s happy”, their intersection wouldn’t be robust enough such that it won’t falsely trigger for actually unrelated things. That is, “She’s happy” ≈ “I’m happy” ≉ experiencing happiness.

Facial cues (as one example, it makes sense that there would be other things like higher-pitched voices when enjoying oneself etc) eliminate this problem because opposed to something introspective being the link, a more objective state of the mind, like “He’s sad”, will be the learned link.

this might sound like I’m being unnecessarily picky about this, but imo these associations need to be very exact, else humans would be reward-hacking all day: it’s reasonable to assume that the activations of thinking “She’s happy” are very similar to trying to convince oneself “She’s happy” internally, even ‘knowing’ the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.

regarding micro expressions specifically, it’s definitely not a hill i want to die on, it kind of just popped in my mind as I was writing about facial cues and by micro I really mean ‘micro micro’ - e.g. smiles that aren’t perfectly symmetrical for quarter of a second, something I at least can’t really pick up on; what is their evolutionary advantage if they don’t atleast offer some kind of subconscious effect on conspecifics? But yea, if you can’t consciously pick up on it, linking the two is pointless or even bad.

I read the linked post roughly, but as I read neither so far, i probably can’t relate too well to it. seems reasonable (or honestly, obvious) though that it’s a mix rather than either of those extreme statements.

RobinHa 12 Dec 2025 13:41 UTC
3 points
0
on: Neuroscience of human social instincts: a sketch
let me preface this by saying how much I enjoyed reading this post—it really shows that this isn’t some random idea you had but that you really thought a lot about this. As someone who’s first introduction to this kind of idea was precisely this blogpost, thanks.

question—maybe I’m simply misunderstanding you:

-you seem to assume that the cortex’s modelling of one’s own happiness is very similar to the cortex’s modelling of thinking of happiness. you might argue that it’s only the “concept of happiness”, which I would agree is present in both scenarios, but it doesn’t strike me why that in particular would be learned using this supervised mechanism.

-building on that point, I think it might be more probable that understanding another’s feelings is part of 1A—instead of simply seeing, hearing, etc. there would be something tasked with analyzing facial cues—in particular humans exhibit micro expressions (expressions that last very short periods and are almost impossible to control), something most people can’t seem to pick up on, atleast consciously. So why do we have them if other people can’t pick up on them? Maybe they can, but only subconsciously to precisely facilitate this symbol grounding for somebody else’s feelings. Then again, if you can’t consciously pick up on it, the target for the supervision will probably be terrible as well so maybe that’s not it.

(i’ll probably hammer u with more questions down the line, still trying to process all of this lol)

RobinHa 3 Dec 2025 14:55 UTC
1 point
0
in reply to: StanislavKrym’s comment on: Adding Empathy as a Tool for LLMs
I assume you mean “the most forbidden technique”, the linked post (or maybe the worst technique is a nickname given to it because of empirical results lol).

i took a quick look at it, it very much seems like a generalized statement of a part of the idea behind this design. essentially, instead of having this analyze tool, which would let you interpret the reasoning of the AI, this approach offers a tool which goes a step further, directly trying to judge whether or not this is desired behavior.

RobinHa 3 Dec 2025 14:47 UTC
2 points
0
in reply to: StanislavKrym’s comment on: Adding Empathy as a Tool for LLMs

so that B would agree

B doesn’t really do any agreeing or disagreeing, that’s still up to Empathizer-001, B should really just concern itself with providing a non-manipulated plan of actions given the work A did. In that sense you are right that it probably only works with CoT-based AIs, or rather generally, some kind of output showcasing A’s reasoning.

Then A is trained to manipulate B instead of explaining its ideas in detail

By this I assume manipulating for example the CoT so that when B inspects it, it gives a favorable summary of it to Empathizer-001. The problem I see with this is that since we normalize, there’s no actual positive expected advantage from by-passing Empathizer-001, that is, there’s no reason to manipulate B in the first place. And if we take a random walk argument, it seems very unrealistic that obfuscating it’s own CoT, which it needs to use after bypassing Empathizer-001 to get it’s actual reward, won’t reduce it’s performance. It’s thinkable that there is such minima where it directly interprets it’s obfuscated CoT in the correct light, but it’s not realistic imo that there is a path of low loss towards it, even more so because we are dealing with something non latent but categorial (tokens)

let me know if I misunderstood you or my reasoning became a little bit clearer, i’ll def check out your post later

RobinHa 2 Dec 2025 9:02 UTC
4 points
0
in reply to: ghost-in-the-weights’s comment on: GRPO is terrible
just read through it, i agree there are clear similarities in the data generation and preparation process but how it’s then actually used to train the model seems quite different imo—they employ RL, particularly even GRPO, which I want to avoid by any cost. Specifically, because they don’t use any kind of logit aggregation, they face this issue of “impossible knowledge” or in other words, a terrible teacher. They reconcile this using their KL and log prob reward function but since at this point they are already dealing with traces, they lost out on the rich logit representation.

also, they choose teacher-forcing instead of student-forcing; teacher-forcing is definitely more popular in the literature, but I think what works like R1 showed is that staying on-policy is really key. here even more, the teacher will have something that we can call the “teaching-bias” since it has already seen the solution and if we allow this bias to carry-over throughout generation, all traces will be terrible and the reward function won’t particularly fix this.

still definitely nice work, thanks for the link, decent read.

GRPO is terrible

RobinHa1 Dec 2025 22:54 UTC

4 points

2 comments5 min readLW link

(robinhaselhorst.com)

RobinHa

Gra­di­ent-Based Re­cov­ery of Me­morized Diffu­sion Model Data

Quan­tify­ing Love and Hatred

The Case Against Con­tin­u­ous Chain-of-Thought (Neu­ralese)

GRPO is terrible

Gradient-Based Recovery of Memorized Diffusion Model Data

Quantifying Love and Hatred

The Case Against Continuous Chain-of-Thought (Neuralese)