Sid Black

Karma: 1,180

Sid Black 15 Jun 2026 15:15 UTC
2 points
0
in reply to: Pranav Madhukar’s comment on: Machinic Psychopharmacology: Do LLMs Self-Medicate?
Thanks!
Points (2) and (3) here sound like things we already did, but it’s possible I’m misunderstanding what you’re proposing.

Re (2): iirc the code supports this pretty easily and we may even have some free-text guesses in some of the data we shared (I would need to go check, maybe in some of the guess scorers) that we didn’t analyse/share much of. It should be fairly easy for someone to point their cc/codex at the public repo and logs and try this stuff out themselves!
Re (3): this just sounds like the placebo condition to me, which we ran in ~all experiment arms. Am I misunderstanding?

Sid Black 15 Jun 2026 14:51 UTC
2 points
0
in reply to: Chengfeng Mao’s comment on: Machinic Psychopharmacology: Do LLMs Self-Medicate?
Thank you!
Given the models are able to introspect the steering vectors at some nontrivial degree, in the free-play setting, is it possible to also show the model all the steering vectors with anonymous names and placeholder as in the prefill experiment, then let the model choose?
I like this design a lot! I agree it should better separate revealed preference and priors than my current setup does. I don’t have much time to run extra experiments on this personally, but I’ll put together a quick experiment sketch, dispatch a claude, and get back to you. Hopefully it comes back with something good!
In the redosing experiment, I’m also curious how do the distributions differ in real vs placebo. RIght now it’s just showing the real case.
Tl;dr: 8B redoses a drug very infrequently in the placebo arm (~4.2% of samples) compared to the real arm (~25% of samples). For 32B the rates are roughly equal (placebo: 7.8%, real: 7.3%). The tail in the real arm for 8B is also longer—there are presumably some drugs that 8B likes to redose ⁴⁄₅ times in the real arm but never in the placebo. Attached a plot—I’m quite surprised at how different these are!
Another related idea I’ve been thinking is whether it’s possible to train an LLM to learn to output a steering vector given a description about a direction, with some adaptation to the un-embed layer, similar to a hypernetwork. Perhaps then test whether an LLM can design a steering vector which itself will be obsessed with!
You might be interested in https://arxiv.org/abs/2506.03292 and https://x.com/SakanaAILabs/status/2027240298666209535

Sid Black 10 Jun 2026 15:51 UTC
3 points
0
in reply to: Alexandre Variengien’s comment on: Machinic Psychopharmacology: Do LLMs Self-Medicate?
Thank you for the pointer—we hadn’t seen this. Added citations to the post!

Sid Black 29 Nov 2022 13:02 UTC
5 points
1
in reply to: Mitchell_Porter’s comment on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Applying SVD to neural nets in general is not a new idea. It’s been used a bunch in the field (Saxe, Olah) but mostly with relation to some input data—either you run SVD on the activations, or some input-output correlation matrix or something.
You generally need to have some data to compare against in order to understand what each vector of your factorization represents exactly. What’s interesting with this technique (imo—and this is mostly Beren’s work so not trying to toot my own horn here) is twofold:
1. You don’t have to run your model over a whole evaluation set—which can be very expensive—to do this sort of analysis. Actually—you don’t have to do a forward pass on your model at all. Instead you can project the weight matrix you want to analyse into the embedding space (as first noted in logit lens and https://arxiv.org/pdf/2209.02535.pdf) and factorize the resulting matrix. Now you can analyse each SVD vector with regards to the model’s vocabulary, and get an idea at a glance of what kinds of processing each layer is doing. This could prove to be useful in future scenarios where e.g we want computationally efficient methods of interpretability analysis to be run during training to check for deception, or to otherwise debug a model’s behaviour.
2. The degree of interpretability of these simple factorizations suggests that the matrices we’re analysing operate on largely* linear representations—which could be good news for the MI field in general, as we haven’t made much headway analysing non-linear features.
*As Peter mentions below—we should avoid overupdating on this. Linear features are almost certainly low hanging fruit. Even if they represent “the majority” of the computation going on inside the network in whatever sense, it’s likely that understanding all of the linear features in a network will not give us the full story about the network’s behaviours.