Some Notes on the mathematics of Toy Autoencoding Problems

Anthropic’s recent mechanistic interpretability paper, Toy Models of Superposition, helps to demonstrate the conceptual richness of very small feedforward neural networks. Even when being trained on synthetic, hand-coded data to reconstruct a very straightforward function (the identity map), there appears to be non-trivial mathematics at play and the analysis of these small networks seems to providing an interesting playground for mechanistic interpretability.

While trying to understand their work and train my own toy models, I ended up making various notes on the underlying mathematics. This post is a slightly neatened-up version of those notes, but is still quite rough and un-edited and is a far-from-optimal presentation of the material. In particular, these notes may contain errors, which are my responsibility.

1. Directly Analyzing the Critical Points of a Linear Toy Model

Throughout we will be considering feedforward neural networks with one hidden layer. The input and output layers will be of the same size and the hidden layer is smaller. We will only be considering the autoencoding problem, which means that our networks are being trained to reconstruct the data. The first couple of subsections here are largely taken from the Appendix to the paper “Neural networks and principal component analysis: Learning from examples without local minima.” by Pierre Baldi and Kurt Hornik. (Neural networks 2.1 (1989): 53-58).

Consider to begin with a completely liner model., i.e. one without any activation functions or biases. Suppose the input and output layers have $D$ neurons and that the middle layer has $d < D$ neurons. This means that the function that the model is implementing is of the form $x \mapsto A B x$ , where $x \in R^{D}$ , $B$ is a $d \times D$ matrix, and $A$ is a $D \times d$ matrix. That is, the matrix $B$ contains the weights of the connections between the input layer and the hidden layer, and the matrix $A$ is the weights of the connections between the hidden layer and the output layer. It is important to realise that even though—for a given set of weights—the function that is being implemented here is linear, the mathematics of this model and the dynamics of the training are not completely linear.

The error on a given input $x$ will be measured by $∥ ∥ x - A B x {∥ ∥}^{2}$ and on the data set ${x^{t}}_{t = 1}^{T}$ , the total loss is

L = L (A, B, {x^{t}}_{t = 1}^{T}) := T \sum t = 1 ∥ ∥ x^{t} - A B x^{t} {∥ ∥}^{2} = T \sum t = 1 D \sum i = 1 (x_{i}^{t} - D \sum j, k = 1 a_{i j} b_{j k} x_{k}^{t})^{2}

Define $Σ$ to be the matrix whose $(i, j)^{t h}$ entry $σ_{i j}$ is given by

σ_{i j} = T \sum t = 1 x_{i}^{t} x_{j}^{t} .

Clearly this matrix is symmetric.

Assumption. We will assume that the data is such that a) $Σ$ is invertible and b) $Σ$ has distinct eigenvalues.

Let $λ_{1} > \dots > λ_{D}$ be the eigenvalues of $Σ$ .

1.1 The Global Minimum

Proposition 1. (Characterization of Critical Points) Fix the dataset and consider $L$ to be a function of the two matrix variables $A$ and $B$ . For any critical point $(A, B)$ of $L$ , there is a subset $I \subset {1, \dots, D}$ of size $d$ for which

$A B$ is an orthogonal projection onto a $d$ -dimensional subspace spanned by orthonormal eigenvectors of $Σ$ corresponding to the eigenvalues ${λ_{i}}_{i \in I}$ ; and
$L (A, B) = t r Σ - \sum_{i \in I} λ_{i} = \sum_{i \notin I} λ_{i}$ .

Corollary 2. (Characterization of the Minimum) The loss has a unique minimum value that is attained when $I = {1, \dots, d}$ , which corresponds to the situation when $A B$ is an orthogonal projection onto the $d$ -dimensional subspace spanned by the eigendirections of $Σ$ that have the largest eigenvalues.

Remarks. We won’t try to spell out all of the various connections to other closely related things, but for those who want some more keywords to go away and investigate further, we just remark that the minimization problem being studied here is about finding a low-rank approximation to identity and is closely related to Principal Component Analysis. See also the Eckart–Young–Mirsky Theorem.

We begin by directly differentiating $L$ with respect to the entries of $A$ and $B$ . Using summation convention on repeated indices, we first take the derivative with respect to $b_{j^{'} k^{'}} :$

\begin{matrix} \frac{\partial L}{\partial b_{j^{'} k^{'}}} & = T \sum t = 1 D \sum i = 1 - 2 (x_{i}^{t} - a_{i j} b_{j k} x_{k}^{t}) a_{i l} δ_{l j^{'}} δ_{q k^{'}} x_{q}^{t} = - 2 T \sum t = 1 (x_{i}^{t} a_{i j^{'}} x_{k^{'}}^{t} - a_{i j} b_{j k} x_{k}^{t} a_{i j^{'}} x_{k^{'}}^{t}) \end{matrix}

Setting this equal to zero and interpreting this equation for all $j^{'} = 1, \dots, d$ and $k^{'} = 1, \dots, D$ gives us that

\begin{matrix} A^{T} Σ = A^{T} A B Σ . \\ (1) \end{matrix}

Then, separately, we differentiate $L$ with respect to $a_{i^{'} j^{'}}$ :

\begin{matrix} \frac{\partial L}{\partial a_{i^{'} j^{'}}} & = T \sum t = 1 D \sum i = 1 - 2 (x_{i}^{t} - a_{i j} b_{j k} x_{k}^{t}) δ_{i i^{'}} δ_{p j^{'}} b_{p q} x_{q}^{t} = - 2 T \sum t = 1 (x_{i^{'}}^{t} b_{j^{'} q} x_{q}^{t} - a_{i^{'} j} b_{j k} x_{k}^{t} b_{j^{'} q} x_{q}^{t}) . \end{matrix}

Setting this equation equal to zero for every $i^{'} = 1, \dots, D$ and $j^{'} = 1, \dots, d$ we have that:

\begin{matrix} Σ B^{T} = A B Σ B^{T} . \\ (2) \end{matrix}

Thus

\begin{matrix} \nabla L (A, B) = 0 ⟺ {\begin{matrix} A^{T} Σ & = A^{T} A B Σ Σ B^{T} & = A B Σ B^{T} . \end{matrix} \\ (3) \end{matrix}

Since we have assumed that $Σ$ is invertible, the first equation immediately implies that $A^{T} = A^{T} A B$ . If we assume in addition that $A$ has full rank (a reasonable assumption in any case of practical interest), then $A^{T} A$ is invertible and we have that

\begin{matrix} (A^{T} A)^{- 1} A^{T} = B, \\ (4) \end{matrix}

which in turn implies that

\begin{matrix} A B = A (A^{T} A)^{- 1} A^{T} = P_{A}, \\ (5) \end{matrix}

where we have written $P_{A}$ to denote the orthogonal projection on to the column space of $A$ .

Claim. We next claim that $Σ$ commutes with $P_{A}$ .

Proof of claim. Plugging (5) into (3), we have:

\begin{matrix} Σ B^{T} = P_{A} Σ B^{T} . \\ (6) \end{matrix}

Then, right-multiply by $A^{T}$ and use the fact that $P_{A}^{T} = P_{A}$ to get:

\begin{matrix} Σ P_{A} = P_{A} Σ P_{A} . \\ (7) \end{matrix}

The right-hand side is manifestly a symmetric matrix, so we deduce that $Σ P_{A}$ is symmetric. If the product of two symmetric matrices is symmetric then they commute, so this indeed shows that $Σ$ commutes with $P_{A}$ and completes the proof of the claim.

Now let $U$ be the orthogonal matrix which diagonalizes $Σ$ , i.e. the matrix for which

\begin{matrix} Σ = U Λ U^{T}, \\ (8) \end{matrix}

where $Λ$ is a diagonal matrix with entries $λ_{1} > λ_{2} > \dots > λ_{D} > 0$ .

Claim. We next claim that $P_{A} = U P_{U^{T} A} U^{T}$ and that $P_{U^{T} A}$ is diagonal.

Proof of Claim. Firstly, using the standard formula for orthogonal projections, we have

P_{U^{T} A} = U^{T} A (A^{T} U U^{T} A)^{- 1} A^{T} U = U^{T} A (A^{T} A)^{- 1} A^{T} U = U^{T} P_{A} U,

which implies that

\begin{matrix} P_{A} = U P_{U^{T} A} U^{T} . \\ (9) \end{matrix}

To show that $P_{U^{T} A}$ is diagonal, we show that it commutes with the diagonal matrix $Λ$ (any matrix that commutes with a diagonal matrix must itself be diagonal). Starting from $P_{U^{T} A} Λ$ , we first insert the identity matrix in the form $U^{T} U$ , and then use (8) and (9) thus:

P_{U^{T} A} Λ = U^{T} U P_{U^{T} A} U^{T} U Λ U^{T} U = U^{T} P_{A} Σ U

Then recall that we have already established that $P_{A}$ commutes with $Σ$ . So we can swap them and then performing the same trick in reverse:

U^{T} P_{A} Σ U = U^{T} Σ P_{A} U = U^{T} U Λ U^{T} U P_{U^{T} A} U^{T} U = Λ P_{U^{T} A} .

This shows that $P_{U^{T} A}$ commutes with $Λ$ and completes the proof of the claim.

So, given that $P_{U^{T} A}$ is an orthogonal projection of rank $d$ and is diagonal, there exists a set of indices $I = {i_{1}, \dots, i_{d}}$ with $1 \leq i_{1} < i_{2} < \dots < i_{d} \leq D$ such that the $(i, j)^{t h}$ entry of $P_{U^{T} A}$ is zero if $i \neq j$ and 1 if $i = j and i \in I$ . And since $P_{A} = U P_{U^{T} A} U^{T}$ , we see that

\begin{matrix} P_{A} = U_{I} U_{I}^{T}, \\ (10) \end{matrix}

where $U_{I}$ is formed from $U$ by simply setting to zero the $j^{t h}$ column if $j \notin I$ . This is manifestly an orthogonal projection onto the span of ${u_{i_{1}}, \dots, u_{i_{d}}}$ , where $u_{1}, u_{2}, \dots, u_{D}$ is an orthonormal basis of eigenvectors of $Σ$ (and indeed the columns of $U$ ). Combining these observations with (5), we have that

\begin{matrix} A B = P_{A} = U_{I} U_{I}^{T} = P_{U_{I}} . \\ (11) \end{matrix}

This proves the first claim of the proposition.

To prove the second part, write $A B = [p_{i j}]$ and compute thus:

\begin{matrix} T \sum t = 1 ∥ ∥ x^{t} - A B x^{t} {∥ ∥}^{2} & = T \sum t = 1 D \sum i = 1 (x_{i}^{t} - p_{i j} x_{j}^{t})^{2} = T \sum t = 1 (x_{i}^{t} x_{i}^{t} - 2 x_{i}^{t} p_{i j} x_{j}^{t} + p_{i j} x_{j}^{t} p_{i k} x_{k}^{t}) = t r Σ - 2 t r (P_{U_{I}} Σ) + t r (P_{U_{I}} Σ P_{U_{I}}^{T}) . \\ (12) \end{matrix}

But we know from (7) and (11) that $P_{U_{I}} Σ P_{U_{I}} = Σ P_{U_{I}}$ and so this last line is actually just equal to

t r Σ - t r (P_{U_{I}} Σ) .

Focussing on the second term and using (11), then (9) and (8), then cancelling $U^{T} U = I$ , and then—to reach the last line—cyclicly permuting the matrices inside the trace operator to produce another $U^{T} U$ cancellation, we have:

\begin{matrix} t r (P_{U_{I}} Σ) & = t r (P_{A} Σ) = t r (U P_{U^{T} A} U^{T} U Λ U^{T}) = t r (U P_{U^{T} A} Λ U^{T}) = t r (P_{U^{T} A} Λ) . \end{matrix}

The diagonal form of $P_{U^{T} A}$ means that this final expression is equal to $\sum_{i \in I} λ_{i}$ , meaning that

\begin{matrix} T \sum t = 1 ∥ ∥ x^{t} - A B x^{t} {∥ ∥}^{2} = t r Σ - \sum i \in I λ_{i} . \end{matrix}

Since $t r Σ = \sum_{i = 1}^{D} λ_{i}$ (the trace is always equal to the sum of the eigenvalues), this completes the proof of the proposition. $□$

Remarks. Equation (10) above tells us that $c o l (A) = s p a n ⟨ u_{i_{1}}, \dots, u_{i_{d}} ⟩$ , which means that there exists an invertible matrix $C$ with $A = U_{I} C$ . Then, using (4), we compute that

B = (A^{T} A)^{- 1} A^{T} = (C^{T} U_{I}^{T} U_{I} C)^{- 1} (U_{I} C)^{T} = C^{- 1} (C^{T})^{- 1} C^{T} U_{I}^{T} = C^{- 1} U_{I}^{T} .

So we have:

\begin{matrix} {\begin{matrix} A = U_{I} C B = C^{- 1} U_{I}^{T} \end{matrix} \\ (⋆) \end{matrix}

1.2 Characterizing Other Critical Points

This subsection is something of an aside, but it is included for completeness.

Proposition 3. (Other Critical Points are Saddle Points.) Fix the dataset and consider $L$ to be a function of the two matrix variables $A$ and $B$ . Every other critical point is a saddle point, i.e. if $(A, B)$ is a critical point but not equal to the unique minimum, then exist $~ A$ and $~ B$ which are arbitrarily close to $A$ and $B$ respectively and at which a lower loss is achieved.

Proof. Since $(A, B)$ is not the unique global minimum, we know from Corollary 2 that $I \neq {1, \dots, d}$ . This means that there are distinct indices $j$ and $k$ for which $j \in I$ , $k \notin I$ and $k < j$ . In particular, bear in mind that $λ_{k} > λ_{j}$ .

Now, given any $ϵ > 0$ , put

{~ u}_{j} := \frac{u_{j} + ϵ u_{k}}{\sqrt{1 + ϵ^{2}}} .

And let us form the new matrix ${~ U}_{I}$ by starting with $U_{I}$ and replacing the column $u_{j}$ with ${~ u}_{j}$ . Write

\begin{matrix} ~ A & = {~ U}_{I} C ~ B & = C^{- 1} {~ U}_{I}^{T} \\ (13) (14) \end{matrix}

We want to calculate the loss of the model at $(~ A, ~ B)$ . We ought to bear in mind that it is not a critical point, so we cannot assume the intermediate results in the proof of Proposition 2, but it turns out that the bits that are most useful for this computation rely only on algebra and (13), (14). We start from the equivalent of line (11) which is that $~ A ~ B = {~ U}_{I} {~ U}_{I}^{T} = P_{{~ U}_{I}},$ which implies that $~ A ~ B = P_{~ A}$ . And so just as in (12) above, we have

\begin{matrix} T \sum t = 1 ∥ ∥ x^{t} - ~ A ~ B x^{t} {∥ ∥}^{2} & = t r Σ - 2 t r (P_{{~ U}_{I}} Σ) + t r (P_{{~ U}_{I}} Σ P_{{~ U}_{I}}^{T}) . \\ (15) \end{matrix}

Now, looking at the final term on the right-hand side, we have $P_{{~ U}_{I}} Σ P_{{~ U}_{I}}^{T} = P_{~ A} Σ P_{~ A}$ and (by cycling permutation) $t r (P_{~ A} Σ P_{~ A}) = t r (P_{~ A} Σ)$ . And since

P_{U^{T} ~ A} = U^{T} ~ A ({~ A}^{T} U U^{T} ~ A)^{- 1} {~ A}^{T} U = U^{T} P_{~ A} U,

we have:

\begin{matrix} t r (P_{~ A} Σ) = t r (U P_{U^{T} ~ A} U^{T} Σ) = t r (P_{U^{T} ~ A} Λ) . \\ (16) \end{matrix}

We also use $t r (P_{{~ U}_{I}} Σ) = t r (P_{~ A} Σ)$ and (16) on the second term on the right-hand side of (15) to ultimately arrive at:

\begin{matrix} T \sum t = 1 ∥ ∥ x^{t} - ~ A ~ B x^{t} {∥ ∥}^{2} & = t r Σ - t r (P_{U^{T} ~ A} Λ) . \\ (17) \end{matrix}

So we are interested in computing the diagonal elements of $P_{U^{T} ~ A}$ . Fix $i \in {1, \dots, D}$ . The $i^{t h}$ diagonal entry is given by:

e_{i}^{T} P_{U^{T} ~ A} e_{i} = e_{i}^{T} U^{T} P_{~ A} U e_{i} = u_{i}^{T} {~ U}_{I} {~ U}_{I}^{T} u_{i} .

This can be computed directly from the definition of ${~ U}_{I}$ to give that the $i^{t h}$ entry on the diagonal is equal to

\begin{matrix} ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} 0 & if i \notin I \cup {k} 1 & if i \in I ∖ {j} 1 / (1 + ϵ^{2}) & if i = j ϵ^{2} / (1 + ϵ^{2}) & if i = k . \end{matrix} \\ (18) \end{matrix}

Therefore

\begin{matrix} T \sum t = 1 ∥ ∥ x^{t} - ~ A ~ B x^{t} {∥ ∥}^{2} & = t r Σ - [\sum i \in I ∖ {j} λ_{i} + λ_{j} / (1 + ϵ^{2}) + ϵ^{2} λ_{k} / (1 + ϵ^{2})] = t r Σ - \sum i \in I λ_{i} - \frac{ϵ^{2}}{(1 + ϵ^{2})} (λ_{k} - λ_{j}) = T \sum t = 1 ∥ ∥ x^{t} - A B x^{t} {∥ ∥}^{2} - \frac{ϵ^{2}}{(1 + ϵ^{2})} (λ_{k} - λ_{j}) . \end{matrix}

Since $λ_{k} > λ_{j}$ this shows that in an arbitrarily small neighbourhood of the critical point $(A, B)$ we can find a point $(~ A, ~ B)$ where smaller loss is achieved. We will not bother doing so here, but one can also check that $(A, B)$ is not a local maxima by using the fact that for fixed (full rank) $A$ , the function $z \mapsto ∥ x - A z ∥^{2}$ is convex. $□$

2. Sparse Data, Weight Tying, and Gradients

Abstractly analyzing critical points is not at all the same as training real models. In this section we start to think about data and the optimization process.

2.1 Sparse Synthetic Data

Here we describe the kind of training data used in Anthropic’s toy experiments

Fix a number $S \in [0, 1]$ . This parameter is the sparsity of the data. We will typically be most interested in the case where $S$ is close to 1.

Let ${B_{i}^{t}}_{t = 1}^{\infty}_{i = 1}^{D}$ be an independent and identically distributed family of Bernoulli random variables with parameter $(1 - S)$ . And let ${U_{i}^{t}}_{t = 1}^{\infty}_{i = 1}^{D}$ be an IID family of $Uniform ([0, 1])$ random variables. Write $X_{i}^{t} = B_{i}^{t} U_{i}^{t}$ and $X^{t} := (X_{1}^{t}, \dots, X_{D}^{t})$ . Our datasets ${x^{t}}_{t = 1}^{T} \in R^{D}$ will be drawn from the IID family ${X^{t}}_{t = 1}^{\infty}$ . Notice that

Independently, for each data point $x^{t} \in R^{D}$ and for each $i \in {1, \dots, D}$ , we will have $P (x_{i}^{t} = 0) = S$ .
So, the expected number of non-zero entries for each data point is $(1 - S) D$ . To bring this in line with the way people say things like “ $k$ -sparse”, we can say that the data is, on average, $(1 - S) D$ -sparse.
$E (x_{i}^{t}) = \frac{1}{2} (1 - S)$ .

Remark. Judging from some of the existing literature on the linear model that we analyze in Section 1 (e.g. “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.” by Andrew M. Saxe, James L. McClelland and Surya Ganguli), it seems like it’s tempting to make an assumption/simplification/approximation that $Σ = I$ . I still don’t feel like I understand how justifiable that is—for me this question is a potential ‘jumping-off’ point for further analysis of the whole problem. Recall that the matrix $Σ$ is equal to $\sum_{t = 1}^{T} x^{t} \otimes x^{t}$ . Certainly the probability that an off-diagonal entry of $x^{t} \otimes x^{t}$ is equal to zero is $1 - (1 - S)^{2}$ whereas for the diagonal entries it is just $S$ . And note that $E (x_{i}^{t} x_{j}^{t}) = \frac{1}{4} (1 - S)^{2}$ if $i \neq j$ and $E ((x_{i}^{t})^{2}) = \frac{1}{3} (1 - S)$ . But the diagonal entries are still independent and I’m not sure why thinking of them as equal makes sense.

The data (and the loss) are model two main ideas: Firstly, that the coordinate directions of the input space act as a natural set of features for the data. And secondly, when $S$ is close to 1, the sparsity of the data is supposed to capture the fact that features really do often tend to be sparse in real-world data, i.e. we see that for any given object or any given word/idea that appears in a language, it is the case that most images don’t contain that object and most sentences don’t contain that word or idea.

2.2 Weight Tying and The Gradient Flow

In practice, when we train an autoencoder like this, we do so with weight tying. Roughly speaking, this means that we only consider the case where $A = B^{T}$ . Proposition 1 does indeed allow for a global minimum in which $A = B^{T}$ : This is achieved by essentially taking $C = I$ in the equations ( $⋆$ ) at the end of Section 1.1, i.e. we have:

\begin{matrix} A & = U_{I} B & = U_{I}^{T} \end{matrix}

But note that we don’t actually want to try to repeat the analysis of Section 1 on a loss of the form $\sum_{t} ∥ x^{t} - W^{T} W x^{t} ∥^{2}$ . This would be a higher-order polynomial function of the entries of $W$ and so it’s genuinely a different and potentially more complicated functional. The way that weight-tying is done in practice is more similar to saying that we insist during training that updates are made that preserve the equality $A = B^{T}$ .

Equations (1) and (2) in subsection 1.1 are obtained as a direct result of differentiating the loss with respect to individual entries of the matrices (or individual ‘weights’ if we interpret this model as a feedforward neural network without activations). Our computations show that:

\begin{matrix} \nabla L (A, B) = - 2 (\begin{matrix} A^{T} Σ - A^{T} A B Σ Σ B^{T} - A B Σ B^{T} \end{matrix}) . \\ (19) \end{matrix}

In an appropriate continuous time limit, if we set the learning rate to 1, the weights during training evolve according to the differential equations:

\begin{matrix} \frac{d}{d t} B & = A^{T} (Σ - A B Σ) \frac{d}{d t} A & = (Σ - A B Σ) B^{T} . \\ (20) (21) \end{matrix}

Remarks. Notice that there is a certain deliberate sloppiness here: One doesn’t really have a fixed matrix $Σ$ and then run this gradient flow for all time; the matrix $Σ$ is a function of (a batch of) training data. So we need to be careful about any further manipulations or interpretations of these equations.

Those caveats having been noted, if we additionally add in the weight-tying constraint $A = B^{T}$ , we get:

\begin{matrix} \frac{d}{d t} W & = W (Σ - W^{T} W Σ) . \\ (22) \end{matrix}

We can even make the substitution $W = ¯ ¯¯¯¯ ¯ W U^{T}$ to introduce the form:

\begin{matrix} \frac{d}{d t} ¯ ¯¯¯¯ ¯ W = ¯ ¯¯¯¯ ¯ W (I - {¯ ¯¯¯¯ ¯ W}^{T} ¯ ¯¯¯¯ ¯ W) Λ . \\ (23) \end{matrix}

In components (and without summation convention) the equation reads

\begin{matrix} \frac{d}{d t} {¯ ¯¯ ¯ w}_{i j} = D \sum k, l = 1 {¯ ¯¯ ¯ w}_{i k} (δ_{k l} - d \sum m = 1 {¯ ¯¯ ¯ w}_{m k} {¯ ¯¯ ¯ w}_{m l}) δ_{l j} λ_{j} . \\ (24) \end{matrix}

Let ${{¯ ¯¯ ¯ w}^{i}}_{i = 1}^{D}$ denote the set of columns of $¯ ¯¯¯¯ ¯ W$ so that (24) can becomes:

\begin{matrix} \frac{d}{d t} {¯ ¯¯ ¯ w}^{j} = D \sum k, l = 1 (δ_{k l} - ⟨ {¯ ¯¯ ¯ w}^{k}, {¯ ¯¯ ¯ w}^{l} ⟩) δ_{l j} λ_{j} {¯ ¯¯ ¯ w}^{k} . \\ (25) \end{matrix}

Expanding the brackets and executing the sum over $l$ gives:

\begin{matrix} \frac{d}{d t} {¯ ¯¯ ¯ w}^{j} = D \sum k = 1 δ_{k j} λ_{j} {¯ ¯¯ ¯ w}^{k} - D \sum k = 1 ⟨ {¯ ¯¯ ¯ w}^{k}, {¯ ¯¯ ¯ w}^{j} ⟩ λ_{j} {¯ ¯¯ ¯ w}^{k} . \\ (26) \end{matrix}

Then the sum over $k$ further simplifies the first term to give:

\begin{matrix} \frac{d}{d t} {¯ ¯¯ ¯ w}^{j} = λ_{j} {¯ ¯¯ ¯ w}^{j} - D \sum k = 1 ⟨ {¯ ¯¯ ¯ w}^{k}, {¯ ¯¯ ¯ w}^{j} ⟩ λ_{j} {¯ ¯¯ ¯ w}^{k} . \\ (27) \end{matrix}

Finally, just peel off the $k = j$ term from the remaining summation to arrive at the equation

\begin{matrix} \frac{d}{d t} {¯ ¯¯ ¯ w}^{j} = (1 - | {¯ ¯¯ ¯ w}^{j} |^{2}) λ_{j} {¯ ¯¯ ¯ w}^{j} - D \sum k \neq j ⟨ {¯ ¯¯ ¯ w}^{k}, {¯ ¯¯ ¯ w}^{j} ⟩ λ_{j} {¯ ¯¯ ¯ w}^{k} . \\ (28) \end{matrix}

Remarks. (cf. the previous two Remarks) If we assume that $Σ = I$ , then $W = ¯ ¯¯¯¯ ¯ W$ and the equation above arises as gradient descent on the energy functional

\begin{matrix} E = \frac{1}{4} D \sum i = 1 λ (1 - | w^{i} |^{2})^{2} + \frac{1}{2} \sum i, j : i \neq j λ ∣ ∣ ⟨ w^{i}, w^{j} ⟩ {∣ ∣}^{2} . \\ (28) \end{matrix}

It’s plausible that a reasonable line of argument to justify this is that since no particular directions in the data are special, it means that over time, on average, the effects of different eigenvalues of $Σ$ just somehow ‘average out’. But I don’t endorse or understand how that argument would actually go. Regardless, if we just assume this for now, as is explained in the Anthropic paper, we can think of the two terms in (28) as being in competition. The first term suggests that model ‘wants’ to learn the $k^{t h}$ feature by arranging $| w^{k} | = 1$ . However, as it tries to does so, it incurs a penalty—given by $\sum_{i \neq k} λ ∣ ∣ ⟨ w^{i}, w^{k} ⟩ {∣ ∣}^{2}$ - that can reasonably be interpreted as the extent to which the hidden representation $w^{k} \in R^{d}$ of that feature interferes with its attempts to represent and reconstruct the other features.

3. The ReLU Output Model

3.1 The Distribution of the Data and the Integral Loss

Perhaps a better way to try to incorporate information about the distribution of the data into the analysis here is to directly let $μ$ be the distribution (i.e. in the proper measure-theoretic since) of $X_{t}$ on $R^{D}$ and to consider

L = \int_{R^{D}} ∥ ∥ x - W^{T} W x ∥^{2} d μ (x) .

In the Anthropic paper and in my own work, we are ultimately more interested in a model with biases and ReLUs at the output layer.

L = \int_{R^{D}} ∥ ∥ x - R e L U (W^{T} W x - b) {∥ ∥}^{2} d μ (x)

Performing an analysis anything like that done in Section 1 seems much harder for this model, but perhaps more progress can be made studying the integral above.

The synthetic data we described in the previous section is all contained in the cube $Q := [0, 1]^{D} \subset R^{D}$ . In the sparse regime i.e. with $S$ close to 1, the vast majority of the data is concentrated around the lower-dimensional skeletons of the cube. For $l = 1, \dots, D$ , if we write $Q_{l}$ for the set of points in the cube with only $l$ non-zero entries, i.e.

Q_{l} := {x \in Q : # {j : x_{j} \neq 0} = l},

then $C$ is the disjoint union

C = n ⋃ l = 0 Q_{l}

Without a closer analysis of binomial tail bounds I can’t immediately tell how well-justified it is to say, ignore $\cup_{l \geq 2} Q_{l}$ and focus the analysis just on 1-sparse vectors in the dataset. i.e. You might want to say that $μ (⋃_{l \geq 2} Q_{l})$ is sufficiently small such that that region contributes only negligibly to the integral. Then you can start to work with more manageable expressions To my mind this is another concrete potential ‘jumping-off’ point if one were to do more investigation. In particular, it is in the direction of the observations made in Toy Models of Superposition to suggest a link between this problem and ‘Thomson Problem’.