jacob_drori comments on jacob_drori’s Shortform

jacob_drori 3 Aug 2025 20:44 UTC
28 points
5
Operationalizing the definition of a shard
Pope and Turner (2022) define a shard as follows:
A shard of value refers to the contextually activated computations which are downstream of similar historical reinforcement events.
To operationalize their definition, we must decide exactly what we mean by contextually activated computations, and by similar reinforcement events. One could philosophize here, but I’ll just pick somewhat-arbitrary definitions and run with them, for the sake of quickly producing something I could turn into code.
Following Apollo/Goodfire, I will identify contextually activated computations with certain directions in parameter space. These directions might be found using APD, L3D, SPD, or some future method.
I am not sure when to consider two reinforcement events (i.e. RL reward-assignments) “similar”. But if we replace “reinforcement events” with “RL updates”, there is a natural definition: cluster RL parameter updates, and call two updates similar if they are in the same cluster. A given update should be allowed to be in multiple clusters, and we suspect parameter updates enjoy some linear structure. Therefore, a natural clustering method is an SAE.^[1] Then, a shard is a decoder vector of an SAE trained on RL parameter updates.
Annoyingly, parameter vectors much larger than the activation vectors that SAEs are usually trained on, posing three potential issues:
1. Storing enough parameter updates to train our SAE on takes a ton of memory.
2. The SAE itself has a ton of parameters—too many to effectively train.
3. It may be impossible to reconstruct parameter updates with reasonable sparsity (with $L_{0} \sim 100$ , say, like standard SAEs).
To mitigate Issue 1, one could focus on a small subset of model parameters. One could also train the SAE at the same time as RL training itself, storing only a small buffer of the most recent parameter updates.
L3D cleverly deals with Issue 2 using low-rank parametrizations for the encoder and decoder of the SAE.
Issue 3 may turn out not to occur in practice. Mukherjee et al (2025) find that, when post-training an LLM with RL, parameter updates are very sparse in the neuron basis. Intuitively, many expect that such post-training merely boosts/suppresses a small number of pre-existing circuits. So perhaps we will find that, in practice, one can reconstruct parameter updates with reasonable sparsity.
All of this is to say: maybe someone should try training an SAE/applying L3D to RL parameter updates. And, if they feel like it, they could call the resulting feature vectors “shards”.
1. ^
  For each SAE latent, the corresponding cluster is the set of data points on which that latent is active.
2. ^
  Shard theory is supposed to apply to humans, too, but here I focus on NNs since humans seem more complicated.

jacob_drori comments on jacob_drori’s Shortform

Operationalizing the definition of a shard