E.g. if the story is ‘Once upon a time Tim met his friend Sally.’, Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.
I think I at one point found an ‘object owned by second entity’ latent but have had trouble finding it again.
I wonder if LMs are generating these reusable ‘pointers’ and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write ‘owned by first entity’ to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens Tim|'s
(and 's has calculated that the first entity is immediately before it), 's can, with a single attention head, look for objects owned by the first entity.
This means that the exact identity information of the object (e.g. ′ hammer’) and the exact identity information of the first entity (′ Tim’) don’t need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.
This suggests a more fine-grained story for what duplicate name heads are doing in IOI.
One barrier to SAE circuits is that it’s currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn’t tell you how these latents interact inside the attention layer at all.
While language models plausibly are trained with comparable amounts of FLOP to humans today here are some differences:
Humans process much less data
Humans spend much more compute per datapoint
Human data includes them taking actions and the results of those actions, language model pretraining data much less so.
These might explain some of the strengths/weaknesses of language models
LMs know many more things than humans, but often in shallower ways.
LMs seem less sample-efficient than humans (less compute per datapoint and they haven’t been very optimized for sample-efficiency yet)
LMs are worse at taking actions over time than humans.
TinyModel SAEs have these first entity and second entity latents.
E.g. if the story is ‘Once upon a time Tim met his friend Sally.’, Tim is the first entity and Sally is the second entity. The latents fire on all instances of first|second entity after the first introduction of that entity.
I think I at one point found an ‘object owned by second entity’ latent but have had trouble finding it again.
I wonder if LMs are generating these reusable ‘pointers’ and then doing computation with the pointers. For example to track that an object is owned by the first entity, you just need to calculate which entities are instances of the first entity, calculate when first entity is shown to own an object and write ‘owned by first entity’ to the object token, and then broadcast that forward to other instances of the object. Then, if you have the tokens
Tim|'s
(and
's
has calculated that the first entity is immediately before it),'s
can, with a single attention head, look for objects owned by the first entity.This means that the exact identity information of the object (e.g. ′ hammer’) and the exact identity information of the first entity (′ Tim’) don’t need to be passed around in computations, you can just do much cheaper pointer calculations and grab the relevant identity information when necessary.
This suggests a more fine-grained story for what duplicate name heads are doing in IOI.
Auto-interp is currently really really bad
I think o1 is the only model that seems to perform decently at auto-interp but it’s very expensive! IE $1/latent label. This is frustrating to me.
One barrier to SAE circuits is that it’s currently hard to understand how attention out SAE latents are calculated. Even if you do IG attribution patching to try to understand which earlier latents are relevant to the attention out SAE latents, it doesn’t tell you how these latents interact inside the attention layer at all.