Noa Nabeshima’s Shortform

Noa Nabeshima15 Jan 2025 21:18 UTC

LW: 3 AF: 2

4 comments1 min readLW link

Noa Nabeshima 7 May 2025 0:50 UTC
16 points
8
While language models plausibly are trained with comparable amounts of FLOP to humans today here are some differences:
- Humans process much less data
- Humans spend much more compute per datapoint
- Human data includes them taking actions and the results of those actions, language model pretraining data much less so.
These might explain some of the strengths/weaknesses of language models
- LMs know many more things than humans, but often in shallower ways.
- LMs seem less sample-efficient than humans (less compute per datapoint and they haven’t been very optimized for sample-efficiency yet)
- LMs are worse at taking actions over time than humans.
Noa Nabeshima 15 Jan 2025 21:18 UTC
3 points
0
TinyModel SAEs have these first entity and second entity latents.

E.g. if the story is ‘Once upon a time Tim met his friend Sally.’, Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.

I think I at one point found an ‘object owned by second entity’ latent but have had trouble finding it again.

I wonder if LMs are generating these reusable ‘pointers’ and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write ‘owned by first entity’ to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s

(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.

This means that the exact identity information of the object (e.g. ′ hammer’) and the exact identity information of the first entity (′ Tim’) don’t need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.

This suggests a more fine-grained story for what duplicate name heads are doing in IOI.
Noa Nabeshima 15 Jan 2025 21:27 UTC
2 points
0
Auto-interp is currently really really bad

I think o1 is the only model that seems to perform decently at auto-interp but it’s very expensive! IE $1/latent label. This is frustrating to me.
Noa Nabeshima 15 Jan 2025 21:32 UTC
1 point
0
One barrier to SAE circuits is that it’s currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn’t tell you how these latents interact inside the attention layer at all.