Neural Tangent Kernel Distillation

5 Oct 2022 18:11 UTC

75 points

Produced As Part Of The SERI ML Alignment Theory Scholars Program 2022 Under John Wentworth.

Introduction

Consider the following example from Goal Misgeneralization: you train an RL agent to pursue a coin, which is always at the same location (at the end of the level) during training. At test time, if the coin position is randomized, does the RL agent pursue the coin, or does it go to the fixed location?

Since the training data could not distinguish between the two goals (‘go to the coin’ and ‘go to the location’), which one is chosen is purely based on the neural network prior. It would be nice to be able to use simplicity priors like the Solomonoff prior to think about this: the neural network might tend to choose the ‘simplest’ extrapolation. But what is the right notion of simplicity for neural networks? It’s clearly not program length in any normal programming language, because the parity function^[1] is very short but hard for neural networks to learn.

In this post, we will summarize some recent advances in DNN theory that have given us the ability to describe the prior of a deep learning network, and discuss the relevance to alignment. Our goal is to communicate quickly the insights that we found difficult to understand or slow to extract from the sources we were using.

Neural Tangent Kernel (NTK) theory compares neural networks to kernel methods. ^[2] The most interesting takeaways (to us) are:

We can make a “linearized” neural network and prove that training this is equivalent to kernel regression.
As we increase the width of a normal neural network, we can prove that it will behave more like the linearized network.
By relating this to Gaussian Process inference, we can think of the neural network as doing Bayesian inference, and describe and visualize the prior distribution that it is using.
- We can print out features/eigenmodes (functions from input vectors to output vectors), and think of the neural network as finding the best linear combination of these, with a preference for using the earlier eigenmodes.

Prerequisites: Linear Algebra, Gaussian Processes (GPs), to the level explained here, a source we highly recommend playing with to gain GP intuition.

Background

Notation

Let $f$ denote the network architecture, a function that takes in the parameters and a network input, and outputs a predicted label. $θ \in R^{d}$ denotes the parameters, $x \in R^{k}$ is the input to the network, and $y = f (θ, x) \in R$ is the output of the network.^[3] We will use $X \in R^{n \times k}$ to denote the network inputs, and $Y \in R^{n}$ to denote the network outputs, and denote $f (θ, X)$ to be the network evaluated on each $x$ in $X$ . $ϕ : R^{k} \to R^{m}$ denotes a feature map associated with a kernel.

We will use L2 loss: $L (θ) = \frac{1}{2} \sum_{i} (f (θ, x_{i}) - y_{i})^{2}$ .

Kernel Methods

Kernel Methods have been extensively studied in the pre-deep learning era. And it turns out that we can use insights from this area to better understand neural networks.

A kernel is a function that tells you how a priori similar any two data points are.

There are three intuitive ways I think about kernel methods:

A kernel method predicts a label by taking a weighted average of the labels of nearby data points, weighted by how close the kernel thinks the data points are.
From a Bayesian point of view, a kernel gives you the a priori covariance between data points, which we can use as a prior to do Bayesian inference.
A kernel method transforms data into a fixed feature space, then does linear regression on the data points in that space (with some prior over the linear regression parameters).

Kernel Linear Regression

Classical linear regression works as follows: you want to find a parameter vector $θ$ to predict data labels $Y$ : you want $Y = X θ$ .^[4] It’s pretty easy to just solve for the $θ$ that is closest:

\begin{matrix} Y & = X θ X^{T} Y & = X^{T} X θ (X^{T} X)^{- 1} X^{T} Y & = θ \end{matrix}

In order to get prediction for a new input, $x$ , now that you have $θ$ , you can simply multiply by theta:

\begin{matrix} ^y & = x θ = x (X^{T} X)^{- 1} X^{T} Y \end{matrix}

Kernel regression generalizes linear regression: instead of fitting a linear predictor from the feature space $R^{n}$ , we pick a kernel function $ϕ : R^{n} \to R^{m}$ that picks out features of the input space, and then do linear regression in the higher dimensional space. We also assume that the parameters learned by the linear regression, $θ$ , can be expressed as $θ = ϕ (X)^{T} w$ , i.e., that it is a linear combination of features extracted by some feature function $ϕ$ .^[5] So:

\begin{matrix} Y & = ϕ (X) θ = ϕ (X) ϕ (X)^{T} w \end{matrix}

Solving for $w$ gives:

\begin{matrix} w = (ϕ (X) ϕ (X)^{T})^{- 1} Y \end{matrix}

Now, we can substitute in:

\begin{matrix} ^y & = ϕ (x) ϕ (X)^{T} (ϕ (X) ϕ (X)^{T})^{- 1} Y = K (x, X) K^{- 1} (X, X) Y \end{matrix}

This is the equation for kernel regression, where we can understand the first two terms as being a weighted similarity vector of our test data point to each of the training data points, which is dotted with the training labels.

Neural Tangent Kernel

How are neural networks $\approx$ kernel methods?

Normally, you treat the neural network as $f (θ, \cdot)$ , fixing $θ$ , so you get simply a map from inputs to outputs. But there’s another way of thinking about it which is as a parameter to function map, given a fixed $x$ . In particular, we can do a Taylor expansion of the parameter function map around $θ_{0}$ , the initialization.

f (x, θ) \approx f (x, θ_{0}) + \nabla_{θ} f (x, θ_{0}) \cdot (θ - θ_{0})

The error of this approximation is $o (∥ θ - θ_{0} ∥^{2})$ , so the less the parameters are updated during training, the better this approximation is. One of the key results behind NTK research is that as the width of a network increases toward infinity, the parameters change less during training.

But for a moment, let’s keep the width finite. At finite $w$ , we can make a new learning algorithm called a “linearized neural network”, which is described by this equation:

f_{l i n e a r} (x, θ) = f (x, θ_{0}) + \nabla_{θ} f (x, θ_{0}) \cdot (θ - θ_{0})

This equation describes (almost) linear regression on a particular feature space $ϕ (x) = \nabla_{θ} f (x, θ_{0})$ :

\begin{matrix} f_{l i n e a r} (x, θ) & = f (x, θ_{0}) + ϕ (x) \cdot (θ - θ_{0}) \approx ϕ (x) \cdot θ \end{matrix}

As we learned above in the Kernel Linear Regression section, linear regression on a feature space is equivalent to Kernel Regression with $K (X, X) = ϕ (X) ϕ (X)^{T}$ !

Hence, training $f_{l i n e a r}$ is equivalent to doing:

\begin{matrix} ^y & = K (x, X) K^{- 1} (X, X) Y \end{matrix}

where $K (x_{1}, x_{2}) = \nabla_{θ} f (θ_{0}, x_{1}) \cdot \nabla_{θ} f (θ_{0}, x_{2})$ .

The only major insight left is that $w \to \infty$ implies $∥ θ - θ_{0} ∥ \to 0$ , which means $f_{l i n e a r} \to f$ . This is non-trivial to prove and depends on the initialization distribution.^[6]

The NTK function

This Kernel regression motivates the definition of the NTK function:

\begin{matrix} N T K : R^{n} \times R^{n} & \to R x_{1}, x_{2} & \mapsto \nabla_{θ} f (θ, x_{1}) \cdot \nabla_{θ} f (θ, x_{2}) |_{θ = θ_{0}} \end{matrix}

We can think of the NTK function as telling you the ‘similarity’ of two given data points according to the feature map at initialization. This is not to be confused with the NTK matrix: the matrix whose $i, j$ -th component is $N T K (x_{i}, x_{j})$ , for a set of input datapoints $X$ .

The last result that we need to know is that the NTK stops depending on $θ_{0}$ when the width is $\infty$ . We won’t prove this here, but the sketch is that if we expand out $\nabla_{θ} f (θ, x_{1}) \cdot \nabla_{θ} f (θ, x_{2}) |_{θ = θ}$ for a particular neural network architecture, it ends up having a lot of sums over the weights. When the width is $\infty$ , these sums become expectations.

The NTK in the infinite width limit can be written out analytically, e.g. the one for 2 layer ReLU network is:^[7]

K (x, x^{'}) = ∥ x ∥ ∥ x^{'} ∥ κ (\frac{x \cdot x^{'}}{∥ x ∥ ∥ x^{'} ∥})

where:

κ (u) := 2 u \frac{1}{π} (π - arccos (u)) + \frac{1}{π} \sqrt{1 - u^{2}}

Prior over data

We can view any kernel method as giving us the posterior mean of a Gaussian Process (see the Marginalization and Conditioning section of this to see why, although beware that they are using different variable names which is confusing^[8]).

So we can think of the NTK as giving us a prior over the labels of a given data distribution $X$ , specifically:

\begin{matrix} Y \sim N (0, H) = \frac{e^{- \frac{1}{2} Y^{T} H^{- 1} Y}}{\sqrt{2 π det H}} \end{matrix}

Where $H$ is the NTK matrix, which depends on our dataset: $H = \nabla_{θ} f (θ, X)^{T} \nabla_{θ} f (θ, X)$ . We will abbreviate the constant denominator $C = \sqrt{2 π det H}$ .

Kernel Eigenmodes

We can understand this prior via eigendecomposition. Since $H$ is positive semidefinite, it is symmetric, and so the Spectral Theorem applies, allowing us to eigendecompose $H$ into $H = P D P^{- 1}$ , where $D$ is a diagonal matrix of the eigenvalues, and $P$ is the matrix of eigenvectors of $H$ .

Here, $v_{1} \dots v_{n}$ are the eigenvectors of H, with corresponding eigenvalues $λ_{1} \geq λ_{2} \dots \geq λ_{n}$ .

Eigendecomposing $H$ gives $H = P D P^{- 1}$ , and the labels are sampled from a Gaussian Process, so the log probability density is:

\begin{matrix} log p (Y) & = log ⎛ ⎝ \frac{e^{- \frac{1}{2} Y^{T} H^{- 1} Y}}{C} ⎞ ⎠ = - \frac{1}{2} n \sum i = 1 (v_{i} \cdot Y)^{2} λ_{i} - log (C) \end{matrix}

We can think of $(v_{i} \cdot Y)^{2}$ as the correlation between the dataset and each eigenvector. See footnote for the full derivation.^[9]

This is an explicit prior for a neural network. We can predict how a neural network will generalize to certain test points $X^{t e s t}$ , when trained on the training data points $X^{t r a i n}$ with labels $Y^{t r a i n}$ . The way to do this is by computing the NTK with $X = [X^{t e s t}, X^{t r a i n}]$ , then calculating $log p ([Y^{t e s t}, Y^{t r a i n}])$ for several different versions of $Y^{t e s t}$ . This version of the test labels that gives the highest prior probability is the generalization most likely to be chosen. This is analogous to conditioning this Gaussian on the labels $Y^{t r a i n}$ .

Visualizing Eigenvectors

We visualize these for a specific neural network below:

The first four eigenvectors for for a 2-layer fully connected neural network with training inputs ranging from 3 to −3.

When we get to the later eigenvectors, they turn out to be all sinusoidal. We can think of neural network training as finding a linear combination of these functions, where it prefers learning the functions with higher eigenvalues. It also learns the ones with higher eigenvalues earlier in training run on each of these, with update learning rate according to their eigenvalue (see appendix for why this is true).

Here is our Google Colab Notebook to generate these results.

Alignment relevance

So we have a mathematically precise notion of the simplicity prior! What does this tell us about alignment?

Unfortunately, not too much. The key problem is abstraction: it’s really hard for us to express abstract concepts like ‘is this network deceptive?’ in the language of the kernel eigenfunctions sine wave decomposition. I am excited for future work to tackle this problem and use NTK theory to predict how neural networks will generalize. For example, could we prove something like “this neural network is very unlikely to learn an algorithm in the set of bounded tree search algorithms”?

We should be able to put any two data points into the kernel and get a measure of how similar they are. This should let us test whether the trained neural network going to treat an image of a lion in the snow as more similar to a training data point of a husky in the snow or a lion in grass?

Appendix: Modeling training dynamics with gradient flow

A result that we thought was cool that didn’t fit anywhere else in this is the proof that infinitely wide neural networks can always get to zero loss. Recall that we model the training dynamics of a neural network as having infinitely small step sizes, called a gradient flow. This allows us to model training as differential equation, where we are continuously updating the parameters $θ$ over time according to how they perform on the loss $L$ :

\begin{matrix} ˙ θ (t) & = - \nabla_{θ} L (θ (t)) = - \nabla_{θ} \frac{1}{2} \sum i (f (θ, x_{i}) - y_{i})^{2} = - \sum i \nabla_{θ} f (θ, x_{i}) (f (θ, x_{i}) - y_{i}) = - \nabla_{θ} f (θ, X) \cdot (f (θ, X) - Y) \end{matrix}

Where we sometimes abbreviate $θ (t)$ as just $θ$ . This was modeling the gradient flow in parameter space, but what we really care about is the dynamics in function space. In other words, we care about the changes in the function $f (θ (t), \cdot)$ as $t$ increases. Fortunately, we can simply compute this on the training set:

\begin{matrix} \frac{d}{d t} f (θ, X) & = \nabla_{θ} f (θ (t), X) \cdot ˙ θ (t) = - \nabla_{θ} f (θ, X) \cdot \nabla_{θ} f (θ, X) \cdot (f (θ, X) - Y) \end{matrix}

We have now found a crucial quantity: $\nabla_{θ} f (θ, X) \cdot \nabla_{θ} f (θ, X)$ is the NTK matrix. Let’s call it $H (θ)$ . It turns out that in the limit of width, this quantity is constant over time. Thus:

\begin{matrix} \frac{d}{d t} f (θ, X) & = - H (θ) \cdot (f (θ, X) - Y) \end{matrix}

$f (θ, X) = Y$ is clearly an equilibrium of this ODE, because when this is satisfied, the RHS is $0$ . We can explicitly solve this ODE by making the substitution $U (t) = f (θ, X) - Y$ , so:

\begin{matrix} \frac{d}{d t} U (t) & = - H (θ) \cdot U (t) \end{matrix}

This is a well-known ODE, with solution given by:

\begin{matrix} U (t) & = e^{- H (θ) t} \cdot U (0) \end{matrix}

This gives us a proof of global convergence on the training data.

^
The parity function is $f : {0, 1}^{*} \to {0, 1}$ , and returns $1$ if and only if the input has an odd number of ones.
^
See Jacot et al., Lee et al.
^
We will assume 1-dimensional outputs, because it makes the math much more manageable: the NTK becomes a 4-Tensor with output dimension more than 1.
^
Assume for simplicity that there is no noise and we can perfectly fit the data with a linear function.
^
There is a theorem (the Representer theorem) which says that the loss minimizing hypothesis (within the space of functions associated with this kernel) has a representation of this form.
^
For the actual proof, start at the bottom of p14 of the original paper and work backwards. There’s a simplified version here in the One hidden Layer Network proof.
^
I haven’t checked all of the derivation of this, I got it from On the Inductive Bias of Neural Tangent Kernels, who seem to have got it by combining the NTK definition in Appendix A of this with analytical evaluation of the integrals from here.
^
See the equations for conditioning a Gaussian, and assume that the prior means $μ_{X}$ and $μ_{Y}$ are 0. Then, translating back into our variables, we get:
$\begin{matrix} x | X & \sim N (K (x, X) K (X, X)^{- 1} Y, K (x, x) - K (x, X) K (X, X)^{- 1} K (X, x)) \sim N (μ_{p o s t e r i o r}, Σ_{p o s t e r i o r}) \end{matrix}$
^
$\begin{matrix} log p (Y) & = log ⎛ ⎝ \frac{e^{- \frac{1}{2} Y^{T} H^{- 1} Y}}{C} ⎞ ⎠ = - \frac{1}{2} Y^{T} H^{- 1} Y - log (C) = - \frac{1}{2} Y^{T} (P^{- 1} D P)^{- 1} Y - log (C) = - \frac{1}{2} Y^{T} P D^{- 1} P^{- 1} Y - log (C) = - \frac{1}{2} (Y^{T} P) D^{- 1} (P^{T} Y) - log (C) = - \frac{1}{2} n \sum i = 1 (P^{T} Y)_{i}^{2} λ_{i} - log (C) = - \frac{1}{2} n \sum i = 1 (v_{i} \cdot Y)^{2} λ_{i} - log (C) \end{matrix}$
^
See Jacot et al., Lee et al.

What links here?

Thomas Larsen and Jeremy Gillen