An ML interpretation of Shard Theory

Crossposted from my personal blog

Epistemic Status: I have spent a fair bit of time reading the core Shard Theory posts and trying to understand it. I also have a background in RL as well as the computational neuroscience of action and decision-making. However, I may be misunderstanding or have missed crucial points. If so, please correct me!

Shard Theory has always seemed slightly esoteric and confusing to me — what are ‘shards’, why might we expect these to form in RL agents? When first reading the Shard Theory posts, there were two main sources of confusion for me. The first: why an agent optimising a reward function should not optimise for reward but instead just implement behaviours that have been rewarded in the past?

This distinction is now obvious to me. The distinction between amortised vs direct inference, and shards as cached behaviours falls directly out of amortized policy gradient algorithms (which Shard Theory uses as the prototypical case of RL [1]). This idea has also been expanded in many other posts.

The second source of my confusion was the idea of shards themselves. Even given amortisation, why should behaviour splinter into specific ‘shards’? and why should the shards compete with one another? What would it even mean for ‘shards’ to compete or for there to be ‘shard coalitions’ in a neural network?

My best guess here is that Shard Theory is making several empirical claims about the formation of representations during training for large-scale (?) RL models. Specifically, from an ML lens, we can think of shards as loosely-coupled relatively-independent subnetworks which implement specific behaviours.

A concrete instantiation of Shard Theory’s claim, therefore, appears to be that during training of the network, the tendency is for the optimiser to construct multiple relatively loosely coupled circuits which each implement some specific behaviour which has been rewarded in the past. In a forward pass through the networks, these circuits then get activated according to some degree of similarity between the current state and the states that have led to reward in the past. These circuits then ‘compete’ with one another to be the one to shape behaviour by being passed through some kind of normalising nonlinearity such as softmax. I am not entirely sure how ‘shard coalitions’ can occur on this view, but perhaps some kind of reciprocal positive feedback where the early parts of the circuit of shard A also provide positive activations to the circuit of shard B and hence they become co-active (which might eventually lead to the shards ‘merging’) [2].

This is not the only way that processing has to happen in a policy network. The current conceptualisation of shards requires them to be in the ‘output space’ — i.e shards correspond to networks in favour of some series of actions being taken. However, the network could instead do a lot of processing in the input space. For instance, it could separate processing into two phases: 1.) Figure out what action to take by analysing the current state and comparing it to past rewarded states and then 2.) translate that abstract action into the real action space—i.e. translate ‘eat lollipop’ into specific muscle movements. In this case, there wouldn’t be multiple shards forming around behaviours, but there could instead be ‘perceptual shards’ which each provide their own interpretations of the current state.

Another alternative is that all the circuits in the network are tightly coupled and cannot be meaningfully separated into distinct ‘shards’. Instead, each reward event subtly increases and decreases the probabilities of all options by modifying all aspects of the network. This is the ‘one-big-circuit’ perspective and may be correct. To summarize, it appears that Shard Theory claims that processing in the network is primarily done in output (behaviour) space and secondly that the internals of the network are relatively modular and consist of fairly separable circuits which implement and upweight specific behaviours.

These are empirical questions that can be answered! And indeed, if we succeed at interpretability even a small amount we should start to get some answers to these questions. Evidence from the current state of interpretability research is mixed. Chris Olah’s work in CNNs, especially Inception V1 , suggests something closer to the ‘one-big-circuit’ view than separable shards. Specifically, in CNNs representations appear to be built up by hierarchical compositional circuits — i.e. you go from curve detectors to fur detectors to dog detectors — but that these circuits are all tightly intertwined with each other rather than forming relatively independent and modular circuits (although different branches of Inception V1 appear to be modular and specialised for certain kinds of perceptual input). For instance, the features at a higher layer tend to depend on a large number of the features at lower layers. On the other hand, in transformer models, there appears to be more evidence for more independent circuits. For instance, we can uncover specific circuits for things like induction or indirect-object-identification. However, these must be interpreted with caution since we understand much less about the representations of transformer language models than Inception-V1. A-priori, both the much greater number of parameters in transformer models compared to CNNs, as well as the additive nature of residual nets vs multiplicative hierarchical nature of deep CNNs could potentially encourage the formation of more modular additive shard-like sub circuits. To my knowledge, we have almost zero studies of the internal processing of reasonably large scale policy gradient networks, which would be required to address these questions in practice. This (and interpretability in RL models in general) would be a great avenue for future interpretability and safety research.

As well as specific claims, shard theory also implicitly assumes some high level claims about likely AGI architectures. Specifically, it requires that AGI be built entirely (maybe only primarily) through an amortised model-free RL agent on a highly variegated reward function — i.e. rewards for pursuing many different kinds of objectives. To me this is a fairly safe bet, as this is approximately how biological intelligence operates and moreover that neuromorphic or brain-inspired AGI, as envisaged by DeepMind is likely to approximate this ideal. Other AGI paths do not follow this path. One example is an AIXI like super-planner, which does direct optimization and so won’t form shards or approximate value fragments barring any inner-alignment failures. Another example is some kind of recursive query wrapper around a general world model, as portrayed here, which does not really get meaningful reward signals at all and isn’t trained with RL. The cognitive properties of this kind of agent, if it can realistically exist, are not really known to me at all.

  1. ^

    In a fun intellectual circle, a lot of shard theory /​ model-free RL in general seems to be people reinventing behaviourism, except this time programming agents for which it is true. For instance, in behaviourism, agents never ‘optimise for reward’ but always simply display ‘conditioned’ behaviours which were associated with reward in the past. There are also various Pavlovian/​associative conditioning experiments which might be interesting to do with RL agents.

  2. ^

    Does this happen in the brain? Some potential evidence (and probably some inspiration) for this comes from the brain, and probably the basal ganglia which implements subcortical action selection. The basal ganglia is part of a large-scale loop through the brain of cortex → BG → thalamus → cortex which contains the full sensorimotor loop. The classic story of the BG is model-free RL with TD learning (but I personally have come to largely disagree with this). A large number of RL algorithms are consistent with RPEs including policy gradients as well as more esoteric algorithms. Beyond this dopaminergic neurons are more complicated than just implementing RPEs as well as appear to represent multiple reward functions which can result in highly flexible TD learning algorithms. The BG does appear to have opponent pathways for exciting and inhibiting (the Go and No-Go pathways specific actions/​plans, which indicate some level of shard-theory like competition. On the other hand, there also seems to be a fairly clear separation between action selection and action implementation in the brain, where the basal ganglia mostly does action selection and delegates the circuitry to implement the action to the motor cortex or specific subcortical structures. As far as I know, the motor cortex doesn’t have the same level of competition between different potential behaviours as in the basal ganglia, although this has of course been proposed. Behaviourally, there is certainly some evidence for multiple competing behaviours being activated simultaneously and needing to be effortfully inhibited. A classic example is the Stroop task but there is indeed a whole literature studying tasks where people need to inhibit certain attractive behaviours in various circumstances. On the other hand, this is not conclusive evidence for a shard-like architecture, but instead there could be a hybrid architecture of both amortised and iterative inference where the amortised and iterative responses are different.