Price’s equation for neural networks

tailcalled21 Dec 2022 13:09 UTC

29 points

Price’s equation is a fundamental equation in genetics, which can be used to predict how traits will change due to evolution. It can be phrased in many ways, but for the current post I will use the following simplified continuous-time variant:

$\frac{d}{d t} x = g c o v (x, f) = (\nabla_{g} E [x | g]) c o v (g, g) \cdot (\nabla_{g} E [f | g])$

Here, $x$ represents some genetic trait, $f$ represents the fitness of the organism, $g$ represents the genes of an organism, and $g c o v$ represents the genetic covariance between the trait and the fitness. Usually people only use the $\frac{d}{d t} x = g c o v (x, f)$ part of the equation^[1], but I’ve written out the definition

$g c o v (a, b) = (\nabla_{g} E [a | g]) c o v (g, g) \cdot (\nabla_{g} E [b | g])$

because that will make the analogy to neural networks easier to see.

Neural network training and Price’s equation

Suppose we train a neural network’s weights $w$ using the following equation, where $L$ represents the loss for the network:

$\frac{d}{d t} w = - \nabla_{w} L (w)$

In that case, if we have some property $x (w)$ of the network (e.g. $x$ could represent how a classifier labels an image, or how an agent acts in a specific situation, or similar), then we can derive an equation for $x$ ’s evolution over time:

$\frac{d}{d t} x = (\nabla_{w} x (w)) \frac{d}{d t} w = - (\nabla_{w} x (w)) \cdot (\nabla_{w} L (w))$

Similar to how we have a concept of genetic covariance to represent the covariance linked to genes, we should probably also introduce a covariance concept linked to neural network weights, to make it cleaner to talk about. I’ll call that $n t c o v$ (short for neural tangent covariance), defined as:

$n t c o v (a, b) = (\nabla_{w} a (w)) \cdot (\nabla_{w} b (w))$

Furthermore, to make it closer to being analogous, we might replace $L$ with $U = - L$ , yielding the following equation for predicting the evolution of any property $x$ with training under gradient descent:

$\frac{d}{d t} x = n t c o v (x, U)$

This makes a bunch of idealistic assumptions about the training process, e.g. that we have an exact measure of the full gradient. It might be worth relaxing the math to more realistic assumptions, and check how much still applies. But for now, let’s just charge ahead with the unrealistic assumptions.

Covariance niceties

Covariances play nicely with linear causal effects. If $F$ and $G$ are linear transformations, then $c o v (F x, G y) = F c o v (x, y) G^{⊤}$ .

For instance, suppose you have a reinforcement learner that has learned to drink juice when close to it. Suppose further that now the main determinant for whether it gets reward is whether it approaches juice when it sees juice. We might formalize that effect as $r = f a$ , where $r$ is the reward given to the agent, $f$ is the frequency at which it sees juice that it can approach, and $a$ is its likelihood of approaching juice if it sees it.

We can then compute: $\frac{d}{d t} a = n t c o v (a, r) = n t c o v (a, f a) = f n t c o v (a, a)$ .

$n t c o v (a, a)$ is a special quantity which we could call the neural tangent variance $n t v a r (a)$ . It represents the degree to which $a$ is sensitive to the neural network parameters. For common situations, this may be dependent on the structure of the network, but also more directly on the nature and value of $a$ .

For instance, if $a$ is the expectation of a binary variable with a probability $p$ for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that $n t v a r (a)$ is approximately proportional to $p (1 - p)$ , but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.

In particular, this means that if $p$ is very low (in the juice example, if it is exceedingly rare for the agent to approach juice it sees), then $n t v a r (a)$ will also be very low, and this will make $n t c o v (a, r)$ low and therefore also make $\frac{d}{d t} a$ low.

^
And usually people also put in other terms too to account for various distortions.

What links here?

tailcalled's comment on Positive values seem more robust and lasting than prohibitions by TurnTrout (21 Dec 2022 13:10 UTC; 2 points)

tailcalled21 Dec 2022 13:09 UTC

29 points

4 comments2 min readLW link

tailcalled 10 Feb 2023 16:32 UTC
2 points
0
I’ve been meaning to write for a while now:
I’ve realized that since the derivative is infinitesimal, we can actually strengthen the covariance niceties a lot. If $f$ and $g$ are arbitrary functions, then I believe that:
$n t c o v (f (x), g (y)) = J f (x) n t c o v (x, y) J g (y)^{⊤}$
TurnTrout 21 Dec 2022 17:26 UTC
2 points
0
I really like this post. Can you expand your intuitions on
For instance, if $a$ is the expectation of a binary variable with a probability $p$ for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that $n t v a r (a)$ is approximately proportional to $p (1 - p)$ , but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.
- tailcalled 21 Dec 2022 22:51 UTC
  2 points
  0
  Parent
  Sure!
  So let’s start with a basic example, an agent that has two actions, “don’t” and “do”. Suppose it has an output neuron that contains the logits for what action to take, and for simplicity’s sake (will address this in the end of the post) let’s assume that this output neuron is controlled by a single weight $w$ which represents its bias. So this means that the $p$ variable described in the OP expands into: $p = P (a = d o) = s i g m o i d (w)$ .
  We can then compute $\nabla_{w} p (w) = p (w) (1 - p (w))$ . And, hmm, this actually implies that $n t v a r (p) = p^{2} (1 - p)^{2}$ , rather than the $p (1 - p)$ that my intuition suggested, I think? The difference is basically that $p^{2} (1 - p)^{2}$ is flatter than $p (1 - p)$ , especially in the tails where the former quadratically goes to 0 while the latter linearly goes to 0.
  One thing I would wonder is what happens during training, if we e.g. use policy gradients and give a reward of 1 for do and a reward of −1 for don’t. The update rule for policy gradients is basically $r \nabla_{w} log p (w)$ , which according to Wolfram Alpha expands into $\frac{2}{(1 + e^{- w}) (1 + e^{w})}$ , and which we can further simplify to $2 p (1 - p)$ . But we would have to square it to get $n t v a r (r)$ , so I guess the same point applies here as to before. 🤷
  Anyway, obviously this is massively simplified because we are assuming a trivial neural network. In a nontrivial one, I think the principle would be the same, due to the chain rule which gives you a factor of $p (1 - p)$ onto whatever gradients exist before the final output neuron.
tailcalled 21 Dec 2022 13:49 UTC
2 points
0
Actually upon further thought for something like policy gradients, in the limit where the probability $p$ is close to $0$ , then $n t v a r (a)$ would probably be more like $O (p^{2})$ ? Because you get a factor of $p$ from the probability, and then an additional factor of $p (1 - p) = O (p)$ from the derivative of sigmoid/softmax, which adds up to it being $O (p^{2})$ .