Nicholas Goldowsky-Dill comments on AlgZoo: uninterpreted models with fewer than 1,500 parameters

Nicholas Goldowsky-Dill 28 Jan 2026 4:45 UTC
32 points
0
Okay, you successfully nerd-sniped me into interpreting the model :)
I think I understand the role of {N1, N6, N7, N8} reasonably well. The activations post- $W_{h h}$ are well approximated by the linear model
$W_{h h} h_{n, t} \approx a_{n} (δ) \cdot M_{t} + b_{n} (δ) \cdot max (S_{t}, 0)$
where $M_{t}$ is the running max, $S_{t}$ is the second running max, and $δ$ represents how long ago the max-value occurred. The coefficients change with delta in pleasing patterns:
This model fits the activations well ( $R^{2} = 0.992$ ).^[1]
This is far from a complete explanation by your standards. In particular:
- I only have a partial mechanistic understanding of how the weights lead to this behavior. I think it’s entirely feasible to understand, but will take more time to unravel.
- There are large parts of the model I haven’t looked at at all, e.g. the other 10 neurons. There are also parts of the task that I don’t know how the model does, e.g. tracking the current position of the 2nd-maximum value).
I may work more on this, but probably not for a couple of days so it seemed worth posting my progress. Lots more detail on my understanding (e.g. a partial mechanistic understanding) in this notebook.
1. ^
  though more like 0.95 for some subsets
- Jacob_Hilton 28 Jan 2026 6:34 UTC
  2 points
  0
  Parent
  Good start!