Sid Black

Karma: 886

Do LLMs know what they’re capable of? Why this matters for AI safety, and initial findings

Casey Barkan, Sid Black and Oliver Sourbut

13 Jul 2025 19:54 UTC

53 points

5 comments18 min readLW link

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Sid Black 29 Nov 2022 13:02 UTC
5 points
1
in reply to: Mitchell_Porter’s comment on: The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable
Applying SVD to neural nets in general is not a new idea. It’s been used a bunch in the field (Saxe, Olah) but mostly with relation to some input data—either you run SVD on the activations, or some input-output correlation matrix or something.
You generally need to have some data to compare against in order to understand what each vector of your factorization represents exactly. What’s interesting with this technique (imo—and this is mostly Beren’s work so not trying to toot my own horn here) is twofold:
1. You don’t have to run your model over a whole evaluation set—which can be very expensive—to do this sort of analysis. Actually—you don’t have to do a forward pass on your model at all. Instead you can project the weight matrix you want to analyse into the embedding space (as first noted in logit lens and https://arxiv.org/pdf/2209.02535.pdf) and factorize the resulting matrix. Now you can analyse each SVD vector with regards to the model’s vocabulary, and get an idea at a glance of what kinds of processing each layer is doing. This could prove to be useful in future scenarios where e.g we want computationally efficient methods of interpretability analysis to be run during training to check for deception, or to otherwise debug a model’s behaviour.
2. The degree of interpretability of these simple factorizations suggests that the matrices we’re analysing operate on largely* linear representations—which could be good news for the MI field in general, as we haven’t made much headway analysing non-linear features.
*As Peter mentions below—we should avoid overupdating on this. Linear features are almost certainly low hanging fruit. Even if they represent “the majority” of the computation going on inside the network in whatever sense, it’s likely that understanding all of the linear features in a network will not give us the full story about the network’s behaviours.