Price’s equation is a fundamental equation in genetics, which can be used to predict how traits will change due to evolution. It can be phrased in many ways, but for the current post I will use the following simplified continuous-time variant:
ddtx=gcov(x,f)=(∇gE[x|g])cov(g,g)⋅(∇gE[f|g])
Here, x represents some genetic trait, f represents the fitness of the organism, g represents the genes of an organism, and gcov represents the genetic covariance between the trait and the fitness. Usually people only use the ddtx=gcov(x,f) part of the equation[1], but I’ve written out the definition
gcov(a,b)=(∇gE[a|g])cov(g,g)⋅(∇gE[b|g])
because that will make the analogy to neural networks easier to see.
Neural network training and Price’s equation
Suppose we train a neural network’s weights w using the following equation, where L represents the loss for the network:
ddtw=−∇wL(w)
In that case, if we have some property x(w) of the network (e.g.x could represent how a classifier labels an image, or how an agent acts in a specific situation, or similar), then we can derive an equation for x’s evolution over time:
ddtx=(∇wx(w))ddtw=−(∇wx(w))⋅(∇wL(w))
Similar to how we have a concept of genetic covariance to represent the covariance linked to genes, we should probably also introduce a covariance concept linked to neural network weights, to make it cleaner to talk about. I’ll call that ntcov (short for neural tangent covariance), defined as:
ntcov(a,b)=(∇wa(w))⋅(∇wb(w))
Furthermore, to make it closer to being analogous, we might replace L with U=−L, yielding the following equation for predicting the evolution of any property x with training under gradient descent:
ddtx=ntcov(x,U)
This makes a bunch of idealistic assumptions about the training process, e.g. that we have an exact measure of the full gradient. It might be worth relaxing the math to more realistic assumptions, and check how much still applies. But for now, let’s just charge ahead with the unrealistic assumptions.
Covariance niceties
Covariances play nicely with linear causal effects. If F and G are linear transformations, then cov(Fx,Gy)=Fcov(x,y)G⊤.
For instance, suppose you have a reinforcement learner that has learned to drink juice when close to it. Suppose further that now the main determinant for whether it gets reward is whether it approaches juice when it sees juice. We might formalize that effect as r=fa, where r is the reward given to the agent, f is the frequency at which it sees juice that it can approach, and a is its likelihood of approaching juice if it sees it.
We can then compute: ddta=ntcov(a,r)=ntcov(a,fa)=fntcov(a,a).
ntcov(a,a) is a special quantity which we could call the neural tangent variance ntvar(a). It represents the degree to which a is sensitive to the neural network parameters. For common situations, this may be dependent on the structure of the network, but also more directly on the nature and value of a.
For instance, if a is the expectation of a binary variable with a probability p for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that ntvar(a) is approximately proportional to p(1−p), but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.
In particular, this means that if p is very low (in the juice example, if it is exceedingly rare for the agent to approach juice it sees), then ntvar(a) will also be very low, and this will make ntcov(a,r) low and therefore also make ddta low.
Price’s equation for neural networks
Price’s equation is a fundamental equation in genetics, which can be used to predict how traits will change due to evolution. It can be phrased in many ways, but for the current post I will use the following simplified continuous-time variant:
ddtx=gcov(x,f)=(∇gE[x|g])cov(g,g)⋅(∇gE[f|g])
Here, x represents some genetic trait, f represents the fitness of the organism, g represents the genes of an organism, and gcov represents the genetic covariance between the trait and the fitness. Usually people only use the ddtx=gcov(x,f) part of the equation[1], but I’ve written out the definition
gcov(a,b)=(∇gE[a|g])cov(g,g)⋅(∇gE[b|g])
because that will make the analogy to neural networks easier to see.
Neural network training and Price’s equation
Suppose we train a neural network’s weights w using the following equation, where L represents the loss for the network:
ddtw=−∇wL(w)
In that case, if we have some property x(w) of the network (e.g.x could represent how a classifier labels an image, or how an agent acts in a specific situation, or similar), then we can derive an equation for x’s evolution over time:
ddtx=(∇wx(w))ddtw=−(∇wx(w))⋅(∇wL(w))
Similar to how we have a concept of genetic covariance to represent the covariance linked to genes, we should probably also introduce a covariance concept linked to neural network weights, to make it cleaner to talk about. I’ll call that ntcov (short for neural tangent covariance), defined as:
ntcov(a,b)=(∇wa(w))⋅(∇wb(w))
Furthermore, to make it closer to being analogous, we might replace L with U=−L, yielding the following equation for predicting the evolution of any property x with training under gradient descent:
ddtx=ntcov(x,U)
This makes a bunch of idealistic assumptions about the training process, e.g. that we have an exact measure of the full gradient. It might be worth relaxing the math to more realistic assumptions, and check how much still applies. But for now, let’s just charge ahead with the unrealistic assumptions.
Covariance niceties
Covariances play nicely with linear causal effects. If F and G are linear transformations, then cov(Fx,Gy)=Fcov(x,y)G⊤.
For instance, suppose you have a reinforcement learner that has learned to drink juice when close to it. Suppose further that now the main determinant for whether it gets reward is whether it approaches juice when it sees juice. We might formalize that effect as r=fa, where r is the reward given to the agent, f is the frequency at which it sees juice that it can approach, and a is its likelihood of approaching juice if it sees it.
We can then compute: ddta=ntcov(a,r)=ntcov(a,fa)=fntcov(a,a).
ntcov(a,a) is a special quantity which we could call the neural tangent variance ntvar(a). It represents the degree to which a is sensitive to the neural network parameters. For common situations, this may be dependent on the structure of the network, but also more directly on the nature and value of a.
For instance, if a is the expectation of a binary variable with a probability p for being 1, then I bet there is probably going to be a Bernoulli distribution aspect to it, such that ntvar(a) is approximately proportional to p(1−p), but likely with a scale factor that depends on the network architecture or parameters, rather than being entirely equal to it.
In particular, this means that if p is very low (in the juice example, if it is exceedingly rare for the agent to approach juice it sees), then ntvar(a) will also be very low, and this will make ntcov(a,r) low and therefore also make ddta low.
And usually people also put in other terms too to account for various distortions.