ARC progress update: Competing with sampling
In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I (Eric) have seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications.
The purpose of this post is to:
Explain and motivate our “outperforming sampling” agenda from the standpoint of preventing catastrophic AI misalignment.
Introduce what we call the matching sampling principle (MSP) as a semi-formalization of the belief underpinning our research agenda, and discuss why we believe this principle.
Discuss the progress we’ve made toward matching sampling in some specific contexts, such as random MLPs and trained two-layer MLPs.
Also: we’re hiring! If the research direction described in this post excites you, you can apply to ARC!
Outperforming sampling as a step toward preventing AI misalignment
Consider the following simple scheme that attempts to align an AI model
Build a “catastrophe detector”
that classifies model outputs as “catastrophic” (1) or “non-catastrophic” (0).[1] You can do this by, for example, scaffolding together a deliberative system of GPT-5′s that carefully investigate whether the model is doing anything suspicious.Do adversarial training using the catastrophe detector. Concretely, this means:
Optimize a probability distribution
over inputs , so as to maximize .At the same time, optimize
so as to minimize .
I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues:
A catastrophe detector that’s built in the way I described would be imperfect, even if you did a really good job with the engineering. If
were “smarter” than , then it could figure out ways to fool into assigning a low catastrophe probability to a catastrophic output.Even if the catastrophe detector were perfect, it is not clear how to compute
efficiently enough to make this scheme work.[2] The most straightforward approach would involve estimating this expectation by taking random samples , but this is extremely slow if catastrophes are rare on .
(There are other technical issues as well,[3] but these are the ones that seem hardest to surmount.)
We believe that ARC’s technical research agenda is capable of addressing both of these issues. However, issue #1 is mostly out of scope for this post (though I’ll very briefly describe our planned approach in this footnote[4]). The purpose of this post is to explain in detail how we hope to address issue #2.
Understanding structure helps outperform sampling
ARC’s goal is to be able to estimate
Here’s a simple example that’s meant to illustrate this point. Suppose that, by understanding the internals of
And suppose that, furthermore, we understand the structure of
Using this structural understanding, we can estimate that
This example is simplistic, of course: in practice, we will need to understand structure that is far more sophisticated than “the output is a conjunction of three independent predicates.” But the example illustrates the point that having a detailed mechanistic understanding of a neural net lets us estimate properties of its outputs far better than black-box methods alone.
A non-human understanding
When we speak of “understanding the structure” of
Instead, we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of
This contrasts to almost all other research on neural network interpretability, which aims for a partial, human understanding of neural nets. Our research is instead aimed at full, algorithmic understanding.
In the next section, I will elaborate on what this means.
The matching sampling principle
In this section, I will more formally describe what we hope to accomplish by gaining structural understanding of neural networks. While above I talked about outperforming sampling, in the fully general case we can only hope to match the performance of sampling. In other words, we expect that the performance of our algorithms in the practical setting of trained neural nets will substantially exceed the worst-case bounds that we will be able to state and prove. See below for further discussion of this point.
I will start with a first-pass attempt at stating the matching sampling principle (MSP). As we will discuss, it does not quite make sense; however, it carries across the key intuition.
A first attempt at stating the MSP
In order to state the MSP, we will define a few pieces of notation:
We will use the notation
to describe a neural network (or other function from a parameterized family). Here, denotes the architecture of the neural net, and denotes the parameters.Concretely,
is a function mapping -bit inputs to real numbers. (Note that this notation differs from the notation we used in the previous section; here, describes the composition of the neural net and the catastrophe detector above.)We are interested in estimating
, where is sampled uniformly.[6]
We will use the notation
to describe a mechanistic explanation of ; is what provides us with a structural understanding of and allows us to estimate its expected output. (This mirrors the way that ARC has historically used the letter .)We will use the notation
to denote an estimator (the subscript is meant to emphasize that different architectures could have different estimators): takes as input the parameters , a mechanistic explanation , and a tolerance parameter . (As we discuss below, the smaller the tolerance parameter, the more accurate ’s estimate, but the longer will be allowed to run.)
With this notation in place, we will make our first attempt to state the MSP:
For all architectures
(with parameters ), there exists an estimator such that:For all parameters
, there exists a short[7] explanation (we require that ), such that:For all tolerance parameters
, satisfies the following three properties:It runs in time
.Its error is competitive with sampling:
It is mechanistic.
Let’s parse these three requirements:
needs to run in time . This is the time that it takes to estimate by running randomly sampled values of through .The squared error of
must be small. Concretely, the right-hand side represents the expected squared error via taking the empirical average of samples of . is mechanistic. We have not defined “mechanistic”, so this point requires elaboration.
What makes an algorithm “mechanistic”?
We do not have a formal definition of “mechanistic.” But, loosely speaking, we mean that
To illustrate this difference, suppose that the explanation
(If indeed
In some of our previous work, we discussed covariance propagation: successively modeling each layer of
A simple, though not entirely correct, heuristic for whether an estimation algorithm is deductive, is whether it avoids any random or pseudorandom sampling. This heuristic should work for the purposes of engaging with this post.
(See much more on mechanistic estimation in our earlier paper, “Formalizing the presumption of independence”,[10] as well as in former ARC intern Gabe Wu’s senior thesis on deduction-projection estimators.)
Why do we require to be mechanistic?
There are multiple reasons for this; in a previous blog post, we discussed how mechanistic estimates can help us detect mechanistic anomalies. But for the purposes of this post, the reason is pretty straightforward: in cases where
Thus, loosely speaking, our hope is that if we find a
The intuition behind the MSP
Sampling is a really powerful tool, because randomly drawn samples are representative (with high probability), and so a sampling-based estimate can’t be off by too much (with high probability). In light of this, why do we think that a mechanistic estimation algorithm can compete with sampling?
Suppose that
Well, given that the discrepancy could not have happened by chance, there must be structure that explains the discrepancy. For illustration, let’s consider two types of structure.
First, maybe the discrepancy is caused by different gates reusing the same inputs, thereby inducing nontrivial correlations between different parts of the circuit.[11] In that case,
Second, maybe only the first 10 input bits matter to the output of the circuit (perhaps
(What if
More generally, the intuition is that knowing the structure of
Why only matching sampling?
Given the above intuition that understanding structure can outperform sampling, why are we only aiming to match the performance of sampling?
Consider the above example, where the average output of
In general, we expect that there will often be a range of
However, as discussed above, we expect that if our mechanistic estimator matches the performance of sampling in all cases, then it will substantially outperform sampling in structured cases such as trained neural nets, at least for non-tiny values of
An issue: can just tell the answer
As mentioned earlier, out first attempt at stating the MSP doesn’t quite make sense. The idea of MSP is for
To fix this issue, we observe that if
Our actual MSP statement
Here is ARC’s mainline “matching sampling principle” (MSP):
For all architectures
(with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:For all parameters
, there exists a short explanation ( ), such that:For all tolerance parameters
, satisfies the following three properties:It runs in time
.Its error is competitive with sampling, on average over random
: , where and .It is mechanistic.
(Just as before, this statement isn’t fully formal, because of the informal “mechanistic” qualifier. But in practice, we have strong enough opinions about what counts as “mechanistic” that this statement is formal enough to guide our research.)
An interesting special case of the MSP is when
An important variant: Findable explanations
While the above MSP statement is the most theoretically clean one, on its face the statement is not very useful. That’s because it says nothing about being able to find the explanation
This leads us to the following alternative statement, which we’ve been calling the “train and explain” formulation of the MSP:
For all architectures
(with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:For all “training” algorithms
mapping random seeds to parameters , there exists an “explaining” algorithm mapping random seeds to explanations , with , such that:For all tolerance parameters
, satisfies the following three properties:It runs in time
.Its error is competitive with sampling, on average over random
and : , where , , and .It is mechanistic.
In this statement, it is useful to think of
Note that our mainline MSP statement is the special case of the “train and explain” formulation where
In general, we think that for any computational constraints placed on
Another modification: getting rid of
It turns out that the MSP can be stated without reference to a tolerance parameter, by subsuming the number of samples into the architecture instead. See this appendix for details.
Our progress so far
Over the course of 2025, ARC has made progress on the MSP in a few different directions. Concretely:
We have a mechanistic algorithm that competes with sampling (in theory and in practice) for estimating the size of an intersection of randomly chosen halfspaces.[16] We have also generalized our algorithm to work for some other problems, such as estimating the satisfaction probability of a random CNF, or the permanent of a random matrix.
We believe we have a mechanistic algorithm that competes with sampling for estimating the expected output of random MLPs on Gaussian inputs. (We have an empirical demonstration of competitiveness with sampling, and a proof sketch that we are working on expanding into a full proof.)
We have made substantial progress toward a mechanistic algorithm that competes with sampling for estimating the expected output of a two-layer MLP on Gaussian inputs, where the second layer of the MLP is trained via gradient descent.
Our results are not yet ready for publication, but hope to get them ready in the coming months. In this section, I will briefly summarize these results and discuss the most interesting directions for future work.
Intersection of random half-spaces
The first problem we tackled in our “matching sampling” framework was mechanistically estimating the volume of the intersection of random half-spaces. Although it’s a somewhat toy problem, it wasn’t trivial to solve, and we learned a lot from solving it.
Problem statement
Find an algorithm
The expected squared error of
(over randomly chosen ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random vectors and outputting the empirical fraction that have a nonnegative dot product).The runtime of
is competitive with the runtime of the naive sampling-based algorithm, i.e. .[17]
Note that this is the MSP in the particular case where
Since
(Alternatively, this problem can be viewed as an instance of the “train and explain” version of the MSP, where
A sketch of our solution
In a sentence, our solution is to build up a polynomial approximation of
To elaborate on this, for
Define
to be the constant 1 function.For all
, compute the best low-degree polynomial approximation to . In the end, will be a polynomial approximation to , and we will output the exact expectation as our estimate of .
To be more precise, “low-degree polynomial” here means degree
In order for
We have also generalized this approach to apply to a broader class of problems than just “intersection of randomly-chosen half-spaces.” Roughly speaking, we can apply our methods to estimate the expected product of symmetric random functions, for a certain representation-theoretic notion of symmetry. Concrete problems that we have solved with this approach include:
Estimating the satisfaction probability of
-CNFs, where each literal is chosen uniformly at random.Estimating the permanent of a matrix where each entry is either
or , independently with probability 50%.
We are aiming to publish the details of our algorithm and this generalization in the coming months.
Related questions
While we have solved this particular MSP instance, there are some related settings for which we do not have a solution:
Non-isotropic half-spaces. What if, instead of
being uniformly randomly selected from the unit sphere (or, equivalently, the standard -variate normal distribution), they are selected from some other -variate normal distribution? Modulo some details, we believe that we have solved this problem if for scalars and . But we don’t have a general solution to for arbitrary covariance matrices .Learned half-spaces. What if some sort of learning algorithm is used to train
to minimize some loss function? Is there an explanation string that we could learn in parallel with the learning algorithm, which would help us estimate the size of the intersection of the half-spaces? Or, perhaps, the setting is still simple enough that no explanation is necessary, even if the half-spaces are learned?
Random MLPs
In the last couple of months, we have been tackling a more sophisticated MSP instance: random MLPs. We now believe that we have an algorithm and a proof of its correctness and efficiency, though we are still verifying details. We also have an empirical demonstration that our algorithm is competitive with sampling.
Problem statement
Consider the following MLP architecture: the input size is
Find an algorithm
The expected squared error of
(over independent, normally distributed[20] weights) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random inputs and outputting the empirical average output).The runtime of
is competitive with the runtime of the naive sampling-based algorithm, i.e. forward passes.
Note that this is the MSP in the particular case where
(Similarly to the case of random half spaces, this problem can also be viewed as an instance of the “train and explain” version of the MSP, where
A sketch of our solution
In a sentence, our solution to this problem is cumulant propagation, a mechanistic estimation algorithm that we introduced in Appendix D of Formalizing the Presumption of Independence.
Cumulants are a type of summary statistic of a probability distribution. Loosely speaking, the cumulant operator
Cumulant propagation is a method that lets us make guesses about the cumulants of layer
Start with a list of cumulants of the inputs (this is easy because the input is Gaussian:
for each input , and all other cumulants are ).Use cumulant propagation to make guesses about the cumulants of the layer-1 activations,[21] going up to
-th order cumulants,[22] where is the largest integer such that . Then do the same for the layer-2 activations, and so on.Output our guess about the mean (i.e. first cumulant) of the output.
(This description leaves out many details, but gets across the main idea.)
Related questions
We would be really interested in finding a way to mechanistically estimate the average output of random recurrent neural networks (RNNs). We believe that this will be much more difficult than the MLP setting, because of the weight sharing. (We think that interesting structure can arise in random RNNs in a way that’s far more improbable for random MLPs; this is related to the fact that RNNs are Turing-complete.) We think it’s possible that finding an algorithm that solves the “random RNNs” instance of the MSP would constitute major progress toward finding an algorithm that solves the MSP in full generality (see the appendix for more discussion).
A shorter-term project might be to adapt our solution to other architectures. Can we solve the problem in the case of narrow MLPs (as opposed to the infinite-width limit)? What about random CNNs? Random transformers?
Two-layer MLPs with a trained second layer
In parallel with tackling random MLPs, we have also been investigating two-layer MLPs where the hidden layer is very wide, and where the second layer of weights is trained. This is our first serious foray into trained and/or worst-case instances—and while we haven’t fully solved it, we have made substantial progress.
Problem statement
Consider the following MLP architecture: the input size is
Problem 1: Find an algorithm
For all
, the expected squared error of (over independent, normally distributed weights in ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random inputs and outputting the empirical average output).The runtime of
is competitive with the runtime of the naive sampling-based algorithm, i.e. forward passes.
Note that this is the MSP in the particular case where
Problem 2: Now, suppose that
This is the “train and explain” version of the MSP in the particular case where
A look at our progress so far
Unlike in the case of random MLPs, we do not expect cumulant propagation to work. That’s because, for worst-case
Consider the function
The function
Computing the
-tensors.ש Approximating the infinite sum, given the
-tensors.ש
Our solution to the first problem is to receive the
We are currently working on the second problem (summing up the tensor networks), and we have made substantial partial progress. If
Related questions
Once we have an on-paper solution, we will be interested to see how well the solution works in practice. There are some reasons to believe that it would be slower than sampling (the algorithm is likely to be quite complex), but also some reasons to believe that it would be faster than sampling (on paper, it would match the performance of sampling under worst-case conditions; in typical conditions, it might outperform sampling).
Additionally, even though we are excited about our progress on this question, “two-layer MLPs with a very large hidden layer, where only the second layer is trained” is ultimately a fairly narrow setting. There are many directions in which we could try to generalize our methods: deeper MLPs; hidden layers that are of similar size to the input layer; training both layers; other distributions of input data; and so on.
Closing thoughts
I consider the MSP to be a significant step forward for ARC. Previously, we were interested in producing mechanistic estimates of mathematical quantities, but had no particular benchmark by which to judge our progress or deem our methods “good enough.” Now, we are holding ourselves to a standard that is philosophically justified (we believe that it ought to be possible for mechanistic estimates to compete with sampling), concrete (we can check whether our methods compete with sampling using empirical tests or formal proofs), and tied to a useful application (estimating properties of neural nets, such as catastrophe probability).
Formulating the MSP has allowed us to ask more concrete questions (e.g. “How can we construct a mechanistic algorithm that competes with sampling for estimating the average output of trained two-layer MLPs?”). We have solved some of these questions, made progress on others, and are continuing to make progress.
We plan to continue attacking the MSP from a number of directions:
Attempting to solve the MSP “on paper” (i.e. using mathematical tools) for specific instances (like the ones described in this post).
Using our theoretical methods to create state-of-the-art algorithms for estimation problems.
Approaching the problem from a more high-level or philosophical perspective in order to discern what sorts of mechanistic algorithms could compete with sampling in full generality.
If you’re interested in working with us on any of these directions, you can apply here!
Appendix
A note on advice verifiability
In our various MSP statements, we do not ask for
At least in the “train and explain” version of the MSP, we do not believe that advice needs to be verifiable. This is for two reasons:
Our eventual goal is to implement any MSP solution for actual neural nets, and to have strong accuracy guarantees (on average over the randomness of training). Solving the “train and explain” version of the MSP for a given neural architecture already comes with an accuracy guarantee, even without the ability to verify the explanation that the explaining algorithm
produces.Suppose that the training algorithm
finds by checking many random candidate values of until it finds a particularly unlucky one (e.g. one where is much lager than a full structural understanding of would suggest, just by chance). By observing the computations done by , can notice that is unlucky, but it cannot succinctly explain that fact in a verifiable way. All it can do is assert (in its explanation ) that is unlucky, and all can do is trust ’s assertion.
This last point seems a little bit at odds with my earlier assertion that
What about our mainline MSP statement, where there is no explaining algorithm to track optimization done during training? If
In my opinion, it’s fine for
For example, imagine that we can model
There might be natural versions of the MSP that require advice to be verifiable. However, such statements would require giving
A special case of the MSP: Universal Turing machines
One interesting case of the MSP is when the architecture
There exists an estimator
such that:For all Turing machines
, there is an explanation ( ), such that:For all tolerance parameters
, satisfies the following three properties:It runs in time
.On average over random
, its error is competitive with sampling:It is mechanistic.
In other words, there is a single, universal
Note also that a solution to this special case would yield a solution to the full MSP: suppose that we had an estimator
The MSP statement can also be used to obtain a claim about mechanistically estimating random Turing machines, but without advice. Concretely, we will let
There exists an estimator
such that:For all tolerance parameters
, satisfies the following three properties:It runs in time
.Its error is competitive with sampling, on average over random
: .It is mechanistic.
This is an interesting and arguably bold statement: it says that as
Getting rid of
Modulo a caveat (see below), it is possible to modify the MSP statement to get rid of the tolerance parameter
For all architectures
(with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:For all parameters
, there exists a short explanation ( ), such that: satisfies the following three properties:It runs in time
.Its error is competitive with sampling, on average over random
: , where and .It is mechanistic.
We claim that our mainline MSP statement almost follows from this
To see this, suppose that we have an estimator
Runs in time
.Has low error on average over random
: .Is mechanistic.
Note that
The fact that
Compression as a possible MSP approach
In this section, I will outline one possible approach to solving the MSP. For the sake of concreteness, I will consider the case where
We will say that a Turing machine
We solve the MSP for instances that are not efficiently compressible. Concretely, we find an estimation algorithm
that works for any that is not efficiently compressible, without any explanation .We define the estimation algorithm
to check that is an efficient compression of , and then to return if so.If
is not efficiently compressible, then ’s output will be accurate (since solves the MSP for instances that are not efficiently compressible).
The hope for Step 1 is that an estimation approach that works in the average case (over random parameters) will work for all parameters that are not efficiently compressible.
The intuition underlying this hope is that structure implies efficient compression. In other words, if
What about the “train and explain” version of the MSP? In order to adapt this approach to that setting, we also need to be able to learn the efficient compression in parallel with learning
If this approach is viable, then solving the MSP in the average case for some Turing-complete architecture
- ^
We could instead imagine that
outputs a probability of catastrophe, but we will keep the range of to for simplicity of exposition. - ^
Running
might take much longer than running , which is why we can’t just run on every input during deployment. - ^
For example, the parameterization of the distribution
needs to be quite flexible, so as to allow distributions that are computationally intractable to sample from. For example, if acts catastrophically when it encounters a factorization of RSA-2048, we want to be able to train that behavior out of even if we can’t factorize RSA-2048. (See here for more discussion.) - ^
In brief, we hope to address issue #1 via mechanistic anomaly detection. A little more concretely, our plan is to:
Build
to be smart and robust enough that can only fool by thinking deliberately about how to fool .In the diamond-in-the-vault analogy, this means that the cameras are robust enough that the SmartVault can only fool them deliberately.
Build a mechanistic explanation for why
tends to be 0 on the training distribution.This would let us flag inputs
on which for an unusual reason. Let be this anomaly detector, i.e. if for an unusual reason.
Instead of training
to minimize , train to minimize (i.e. to maximize the probability that for the usual reason).And similarly to above, train
to maximize this expectation.
This leaves many details unexplained, but that’s the basic concept.
- ^
Or rather, independent up to small random variation that is unpredictable just from understanding the structure of
and . - ^
In practice, we will be interested in the behavior of neural nets on structured (rather than uniformly random) inputs. However, note that it is possible to create structured inputs out of random inputs via a generative model. For example, if we are interested in the behavior of a classification model on pictures of animals, we could let
consist of two parts: first, a generative model that creates an image of an animal from random noise, and second, a classifier that takes the animal image as input.\(\) - ^
Why do we require
to be short? The basic reason is that, as we discuss later, we will be interested in learning in parallel with learning the parameters , and so we will want to be able to do a backward pass through as quickly as doing a backward pass through . We also have some amount of philosophical justification for believing an explanation the size of is sufficient. Essentially, we think that any object’s structure can be described compactly, because if the amount of (non-redundant) structure in an object is much larger than the size of the object itself, that would constitute an “outrageous coincidence”. - ^
This works by taking our (Gaussian) model of layer
and then modeling layer by finding the normal distribution that would minimize the KL divergence from the pushforward of our model of layer . - ^
Covariance propagation does not require an explanation
. However, some modifications of covariance propagation could require advice. For example, if is too large to allow for to compute all of the covariances, then could advise to only keep track of some particular covariances. Or, could tell about some important third-order correlations to keep track of. - ^
Note that we use the word “heuristic” in place of “mechanistic” in that paper. I think that the word “mechanistic” conveys our goal slightly better.
- ^
As a very simple example, consider the circuit
. Mean propagation estimates this circuit’s average output as rather than because it fails to notice the correlation induced by the presence of in the two conjunctive clauses. - ^
Though, see the appendix on advice verifiability for some nuance on this point.
- ^
Eliezer Yudkowsky’s Worse Than Random makes a similar point:
As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid—so that a randomized algorithm seems wiser—you should be able to use the same knowledge to produce a superior derandomized algorithm.
- ^
Roughly speaking, this is the range where
is a substantial fraction of the number of times that one needs to run to fully estimate its unstructured randomness. - ^
We cannot require
to be accurate for all . For example, suppose that interprets as a Turing machine and runs on the Turing machine . Requiring to be accurate for all would mean expecting to be able to mechanistically estimate the output of a worst-case Turing machine, without any structural advice at all. (After all, cannot depend on .) This is too much to ask for. - ^
Note that this problem is equivalent to estimating the probability that a one-layer ReLU network outputs all zeros on a random input. Concretely, if the network is
, then the output is all zeros if and only if for every row of . - ^
Our algorithm’s runtime is
, which is technically too slow in the case where . However, we are most interested in the regime where . - ^
The naive approach is to treat this as a linear regression problem, where the covariance (inner product) between two polynomials
and is defined as the expectation of for drawn from the unit sphere. However, doing this involves multiplying a matrix by a -vector, so the dependence of this algorithm on looks like : not fast enough. - ^
We believe that our proof of correctness and efficiency works in the limit as
, where the MLP depth is constant and . - ^
Mean zero; standard deviation is chosen so that all activations have the same variance.
- ^
One complication to this picture is that, although we’ve defined cumulant propagation for sums and products of random variables, it’s not clear what it means to apply cumulant propagation to an activation function like ReLU: given the cumulants of
, how does one estimate the cumulants of ? Our strategy is to find a polynomial approximation to the ReLU function (see the next paragraph of this footnote for details). Once we’ve done that, we can apply cumulant propagation as we’ve already defined it for sums and products.What is the appropriate notion of polynomial approximation? It turns out that we can take the polynomial that minimizes mean squared error if
is assumed to be normally distributed with mean equal to our estimate of and covariance equal to our estimate of . This is equivalent to taking the first several terms of the Hermite expansion of ReLU (appropriately centered and scaled). - ^
Actually, it is more important to keep track of cumulants in which the same activation appears multiple times, so we need to keep track of some cumulants of order higher than
that involve repeated indices. - ^
That’s the Hebrew letter shin.
- ^
Concretely, a tensor network can only contribute substantially to the sum if every tensor in the network has a large operator norm. Thus, if in the process of contracting the tensor network, we find a tensor that has a small operator norm, we can cut off the computation and move onto the next tensor network.
- Shallow review of technical AI safety, 2025 by (17 Dec 2025 18:18 UTC; 191 points)
- AlgZoo: uninterpreted models with fewer than 1,500 parameters by (26 Jan 2026 17:30 UTC; 181 points)
- Shallow review of technical AI safety, 2025 by (16 Dec 2025 10:42 UTC; 6 points)
- 's comment on Shortform by (18 Feb 2026 9:23 UTC; 5 points)
Language suggestion: I would replace “random sampling” everywhere in this article with “simple Monte Carlo”, which means “estimating an expectation by taking the mean of samples from the distribution”. There are lots of sampling-based methods more sophisticated, and lower-variance than, simple Monte Carlo, such as importance sampling. In fact, you might be able to use the methods you outlined to build better proposal distributions for importance sampling. This would still be a sampling-based method, but better than simple Monte Carlo.
Thanks for the suggestion!
For what it’s worth, we believe that a mechanistic estimator can beat all sampling-based methods, no matter how sophisticated they are. The philosophical reason for this is that sophisticated sampling-based methods outperform simple Monte Carlo by exploiting structure in the function whose average value they’re estimating—but a mechanistic estimator can exploit that same structure, too.
In fact, I think it almost follows from the MSP that we can beat any sampling-based method. To see this, suppose you have some sophisticated estimator Est(Mθ,r), which is given a neural net Mθ and some random coin flips r as input, and produces a sophisticated, unbiased, low-variance estimate of E[Mθ] using r. Now, define the architecture M′ as: M′θ(x)=Est(Mθ,x). The MSP says that we need to be able to estimate the average output of M′θ (which is the same as the average output of Mθ) with squared error less than or equal to the variance of M′θ, in the time that it takes to run M′θ. (We’re taking ε=1 here.) In other words, given any sophisticated sampling algorithm for estimating the average output of Mθ, there needs to be a corresponding mechanistic estimator that gets lower (or equal) error in the same amount of time.
(I think this argument isn’t perfectly tight, because it’ll probably run into the same uniformity issues that I discussed in the “getting rid of ε” appendix, which is why I said “almost follows” rather than “follows”.)
Right, like many people I agree that in most cases derandomization is a good idea if you can afford a little extra code complexity. But the structure of your argument makes it sound like that’s the important part, when actually it’s taking advantage of structure that’s the important part, sampling or not.
Concretely, I would change “Understanding structure helps outperform sampling” to simply “Understanding structure helps”.
Congratulations! I have been getting significantly more excited about what ARC is doing this year. Happy to see so much genuine recent progress after the proverbial 40 years of desert wandering.
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
bilinear
directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
a polynomial
can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.
Have you seen Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes?
I am affiliated with ARC and played a major role in the MLP stuff
I’m loosely familiar with Greg Yang’s work, and very familiar with the ‘Neural Network Gaussian Process’ canon. It’s definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer ‘what is the distribution of quantities x y and z over the set of all NNs’ where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights and tells us about the average over inputs.
I’ve found that this distinction is a powerful antimeme, and every time I give a presentation on the topic I have a slide on the difference between averaging over x versus over theta. By the end of the talk the audience is clamoring to recommend I read Principles of Deep Learning Theory (which is lovely if you want to improve on NNGP, but not relevant to calculating averages for a specific value of theta.).
yeah while I noticed the distinction i usually find it worthwhile to try to steal tools across problem statements that use the same words in a different order, i’ll use your data point to downweight that heuristic a little thanks :p
Did knowing that the joint-gaussian thing generalizes to RNNs influence your decision to look at RNNs next?
Honestly for me it’s more of a strike against RNNs. Real deep neural networks that have been trained don’t have this property, so it’s a bridge we’re going to need to cross at some point regardless. From a derisking point of view I’d kind of like to get to that point ASAP. There’s a lot of talk about looking at random boolean circuits (which very obviously don’t have this property), narrow MLPs, or even jumping all the way to wide MLPs trained in some sort of mean-field/maximum update regime that gets rid of it.
Do you expect that whether the infinite sum can be approximated is related to whether the architecture averts the vanishing gradient problem?
I am affiliated with ARC and played a major role in the MLP stuff
The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can’t matter.
We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don’t fully understand every detail, but I’ll mouth off anyways.
On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it’s exponentially small and you can beat sampling by just answering zero until your compute budget is exponential. Even then you might not have the budget to go through all of input space, but we have ideas on what you could do.
On the other hand, there are regions where gradients vanish. Here, also, it seems like things work out. Since the inputs to neurons in later layers concentrate, you can do very well by just approximating the later non-linearities as linearizations around the concentrating point.
Then there’s the so-called ‘critical hyperparameters’ right in the middle. It turns out that for MLPs the gradients vanish (albeit more slowly) even at criticality. Back-of-the-envelope, it seems to go to zero just fast enough.
There are other convergence-related questions (in particular, a lot of these series are so-called ‘asymptotic series’ which officially never converge), and at smaller epsilons/higher compute budget those issues come into play as well, but I don’t think they’re connected to gradients.