zero85

Karma: 3

Game developer since 2007, splitting time between Seoul and Abu Dhabi. Studied astronomy at Yonsei University but dropped out to start a video game company.

Got hooked on DL and RL around 2015 and have been chasing interesting problems ever since.

Currently building an AI companion service in Korea. These days I’m curious about how LLMs develop personality-like patterns and what happens when you steer them.

zero85 16 Mar 2026 11:10 UTC
3 points
0
on: Hello, World of Mechanistic Interpetability
Your observation that refusal behaves as a multidimensional subspace (5–8 dims) rather than a single vector is interesting, and I’m curious how this interacts with depth.
I’ve been running CAA experiments on personality steering, collecting MLP activations across all layers for ~800 contrastive pairs. One consistent pattern I’ve observed across multiple LLMs: the cosine similarity variance follows an inverted-U curve across depth — increases through early layers, peaks around the middle, then drops and stabilizes. This is also where steering vectors are most effective (layers 13–22 on a 34-layer model).
This makes me wonder if your 5–8 dimension estimate is depth-dependent — the same concept might look lower-dimensional in early or late layers simply because it hasn’t been extracted yet or has already been compressed. Also, extraction method matters: mean-difference vectors and BiPO vectors for the same trait can point in quite different directions yet both produce measurable shifts, suggesting a concept’s representation may be a family of correlated directions rather than a single subspace.
Did you find the dimensionality consistent across layers? And did you use residual stream or MLP outputs?