ARC progress update: Competing with sampling

Eric Neyman, Victor Lecomte, Wilson Wu, Mikewins, Jacob_Hilton and George Robinson

18 Nov 2025 17:22 UTC

132 points

In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I (Eric) have seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications.

The purpose of this post is to:

Explain and motivate our “outperforming sampling” agenda from the standpoint of preventing catastrophic AI misalignment.
Introduce what we call the matching sampling principle (MSP) as a semi-formalization of the belief underpinning our research agenda, and discuss why we believe this principle.
Discuss the progress we’ve made toward matching sampling in some specific contexts, such as random MLPs and trained two-layer MLPs.

Also: we’re hiring! If the research direction described in this post excites you, you can apply to ARC!

Outperforming sampling as a step toward preventing AI misalignment

Consider the following simple scheme that attempts to align an AI model , which maps inputs to outputs :

Build a “catastrophe detector” that classifies model outputs as “catastrophic” (1) or “non-catastrophic” (0).^[1] You can do this by, for example, scaffolding together a deliberative system of GPT-5′s that carefully investigate whether the model is doing anything suspicious.
Do adversarial training using the catastrophe detector. Concretely, this means:
1. Optimize a probability distribution over inputs , so as to maximize .
2. At the same time, optimize so as to minimize .

I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues:

A catastrophe detector that’s built in the way I described would be imperfect, even if you did a really good job with the engineering. If were “smarter” than , then it could figure out ways to fool into assigning a low catastrophe probability to a catastrophic output.
Even if the catastrophe detector were perfect, it is not clear how to compute efficiently enough to make this scheme work.^[2] The most straightforward approach would involve estimating this expectation by taking random samples , but this is extremely slow if catastrophes are rare on .

(There are other technical issues as well,^[3] but these are the ones that seem hardest to surmount.)

We believe that ARC’s technical research agenda is capable of addressing both of these issues. However, issue #1 is mostly out of scope for this post (though I’ll very briefly describe our planned approach in this footnote^[4]). The purpose of this post is to explain in detail how we hope to address issue #2.

Understanding structure helps outperform sampling

ARC’s goal is to be able to estimate far better than one can just by drawing random samples from . We believe that this can be done by understanding the structure of , , and .

Here’s a simple example that’s meant to illustrate this point. Suppose that, by understanding the internals of , we are able to notice that is the conjunction of three predicates -- in other words, outputs 1 if and only if all of are true of ’s output.

And suppose that, furthermore, we understand the structure of and to understand that are independent events for .^[5]

Using this structural understanding, we can estimate that . If each is true on with probability one-in-a-million (), then we can estimate that . Obtaining this estimate by sampling would have required roughly samples. By contrast, our structural understanding lets us estimate this probability with roughly samples, and we can potentially do even better than that if we have a structural understanding of the ’s themselves. Thus, our structural understanding lets us estimate the expected value far more efficiently than we could with sampling.

This example is simplistic, of course: in practice, we will need to understand structure that is far more sophisticated than “the output is a conjunction of three independent predicates.” But the example illustrates the point that having a detailed mechanistic understanding of a neural net lets us estimate properties of its outputs far better than black-box methods alone.

A non-human understanding

When we speak of “understanding the structure” of , , and , we are not referring to human understanding. While a conjunctive structure like the one above can be understood by a human, we believe that in general, neural nets will be composed of mathematical structures that are far too complex to allow for a full human understanding.

Instead, we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of . Instead, the goal is to invent an algorithm that takes as input the explanation and estimates expectation of based on that explanation.

This contrasts to almost all other research on neural network interpretability, which aims for a partial, human understanding of neural nets. Our research is instead aimed at full, algorithmic understanding.

In the next section, I will elaborate on what this means.

The matching sampling principle

In this section, I will more formally describe what we hope to accomplish by gaining structural understanding of neural networks. While above I talked about outperforming sampling, in the fully general case we can only hope to match the performance of sampling. In other words, we expect that the performance of our algorithms in the practical setting of trained neural nets will substantially exceed the worst-case bounds that we will be able to state and prove. See below for further discussion of this point.

I will start with a first-pass attempt at stating the matching sampling principle (MSP). As we will discuss, it does not quite make sense; however, it carries across the key intuition.

A first attempt at stating the MSP

In order to state the MSP, we will define a few pieces of notation:

We will use the notation to describe a neural network (or other function from a parameterized family). Here, denotes the architecture of the neural net, and denotes the parameters.
- Concretely, is a function mapping -bit inputs to real numbers. (Note that this notation differs from the notation we used in the previous section; here, describes the composition of the neural net and the catastrophe detector above.)
- We are interested in estimating , where is sampled uniformly.^[6]
We will use the notation to describe a mechanistic explanation of ; is what provides us with a structural understanding of and allows us to estimate its expected output. (This mirrors the way that ARC has historically used the letter .)
We will use the notation to denote an estimator (the subscript is meant to emphasize that different architectures could have different estimators): takes as input the parameters , a mechanistic explanation , and a tolerance parameter . (As we discuss below, the smaller the tolerance parameter, the more accurate ’s estimate, but the longer will be allowed to run.)

With this notation in place, we will make our first attempt to state the MSP:

For all architectures (with parameters ), there exists an estimator such that:
For all parameters , there exists a short^[7] explanation (we require that ), such that:
For all tolerance parameters , satisfies the following three properties:
1. It runs in time .
2. Its error is competitive with sampling:
3. It is mechanistic.

Let’s parse these three requirements:

needs to run in time . This is the time that it takes to estimate by running randomly sampled values of through .
The squared error of must be small. Concretely, the right-hand side represents the expected squared error via taking the empirical average of samples of .
is mechanistic. We have not defined “mechanistic”, so this point requires elaboration.

What makes an algorithm “mechanistic”?

We do not have a formal definition of “mechanistic.” But, loosely speaking, we mean that estimates the expected output of deductively, based on the structure of . This contrasts with sampling-based algorithms for estimating the expected output of , which operate based on inductive reasoning. Mechanistically estimating the expected output involves finding the reason for the expected output being what it is; meanwhile, sampling-based algorithms merely infer the existence of a reason without learning anything about the reason.

To illustrate this difference, suppose that the explanation given to is a simple heuristic argument (such as mean propagation—see §D.2 here), which suggests that but is otherwise uninformative about the structure of . Suppose further that computes on a hundred inputs , and it finds that on every one of those hundred inputs. Then should return : that’s because it knows that on the hundred inputs that it checks, but it has not seen any structural evidence that would suggest that ’s behavior on those hundred inputs has any bearing on how behaves on the inputs that it has not checked. By contrast, a sampling-based estimator that checks the same hundred inputs would return , implicitly assuming that those inputs are representative.

(If indeed always returns , then we believe that there exists a short explanation of this fact; but cannot output unless it is given this explanation.)

In some of our previous work, we discussed covariance propagation: successively modeling each layer of as a multivariate normal distribution.^[8] Covariance propagation (and related methods, like mean propagation and cumulant propagation) is mechanistic, because it deduces an estimate based on the structure of .^[9] More generally, deduction-projection estimators—estimators that successively model each layer of by finding the best-fit model from some parameterized class—are mechanistic.

A simple, though not entirely correct, heuristic for whether an estimation algorithm is deductive, is whether it avoids any random or pseudorandom sampling. This heuristic should work for the purposes of engaging with this post.

(See much more on mechanistic estimation in our earlier paper, “Formalizing the presumption of independence”,^[10] as well as in former ARC intern Gabe Wu’s senior thesis on deduction-projection estimators.)

Why do we require to be mechanistic?

There are multiple reasons for this; in a previous blog post, we discussed how mechanistic estimates can help us detect mechanistic anomalies. But for the purposes of this post, the reason is pretty straightforward: in cases where has a lot of structure, we think that can substantially outperform sampling, if given an explanation that explains that structure (as motivated above).

Thus, loosely speaking, our hope is that if we find a that both (a) is mechanistic and (b) performs at least as well as sampling for all , then it will substantially outperform sampling for parameters with a lot of structure, such as trained neural nets.

The intuition behind the MSP

Sampling is a really powerful tool, because randomly drawn samples are representative (with high probability), and so a sampling-based estimate can’t be off by too much (with high probability). In light of this, why do we think that a mechanistic estimation algorithm can compete with sampling?

Suppose that is a boolean circuit. Suppose, further, that a naive heuristic argument (like mean propagation) suggests that ’s average output is , but that in fact its average output is roughly (far enough from that this discrepancy could not have happened by chance). A sampling-based algorithm can pick up on this discrepancy given about 10,000 samples; but what can a mechanistic algorithm do?

Well, given that the discrepancy could not have happened by chance, there must be structure that explains the discrepancy. For illustration, let’s consider two types of structure.

First, maybe the discrepancy is caused by different gates reusing the same inputs, thereby inducing nontrivial correlations between different parts of the circuit.^[11] In that case, should be able to point out this structure, causing to understand the discrepancy (even without running any inputs through ).

Second, maybe only the first 10 input bits matter to the output of the circuit (perhaps ignores the last 90 input bits entirely, or perhaps they end up not affecting the output for complicated structural reasons). And then—just by chance—it so happens that outputs 1 on only 49% of the 1024 possible 10-bit inputs. In this case, points out that depends only on the first 10 input bits; it does not point out that outputs 1 on 49% of them, because that’s part of the unexplainable randomness of .^[12] Instead, must determine this fact by using its allotted time to check the value of on those 1024 inputs.

(What if is large enough that doesn’t have the necessary runtime to check on all 1024 inputs? In that case, it should check however many it can and estimate the rest as being ⁵⁰⁄₅₀. This will still outperform sampling!)

More generally, the intuition is that knowing the structure of gives the knowledge it needs to do no worse than random sampling. If still does worse than random sampling after reading , that can only be because did not provide a full structural explanation of .^[13]

Why only matching sampling?

Given the above intuition that understanding structure can outperform sampling, why are we only aiming to match the performance of sampling?

Consider the above example, where the average output of depends on 1024 effectively random computations, and suppose that : enough time for to compute the output of on 512 of the 1024 inputs. In that case, we expect both and sampling to have squared error on the order of : ’s expected squared error will be somewhat lower, but not dramatically so.

In general, we expect that there will often be a range of -values for which the best mechanistic estimate is only slightly better than sampling-based estimation.^[14] Thus, for some parameters and tolerance parameters , we only expect to be able to match (or perhaps slightly outperform) sampling, not to strongly outperform sampling.

However, as discussed above, we expect that if our mechanistic estimator matches the performance of sampling in all cases, then it will substantially outperform sampling in structured cases such as trained neural nets, at least for non-tiny values of . We expect that we can leverage this to help with the sort of adversarial training process described in the introduction.

An issue: can just tell the answer

As mentioned earlier, out first attempt at stating the MSP doesn’t quite make sense. The idea of MSP is for to describe the structure of . However, in order to satisfy the MSP statement above, can just write down the value of . Then, can output that value.

To fix this issue, we observe that if understands the structure of , then it ought to be able to answer all sorts of questions about at least as well as sampling—not just its expected value—so long as those questions are not adversarially selected. To formalize this idea, we will modify the type signature of to take two inputs (here, stands for “context”), and require that be able accurately estimate for a random choice of .^[15] This change gives us an MSP statement that we are willing to stand behind.

Our actual MSP statement

Here is ARC’s mainline “matching sampling principle” (MSP):

For all architectures (with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:
For all parameters , there exists a short explanation (), such that:
For all tolerance parameters , satisfies the following three properties:
1. It runs in time .
2. Its error is competitive with sampling, on average over random :, where and .
3. It is mechanistic.

(Just as before, this statement isn’t fully formal, because of the informal “mechanistic” qualifier. But in practice, we have strong enough opinions about what counts as “mechanistic” that this statement is formal enough to guide our research.)

An interesting special case of the MSP is when encodes a universal Turing machine. See the appendix for discussion.

An important variant: Findable explanations

While the above MSP statement is the most theoretically clean one, on its face the statement is not very useful. That’s because it says nothing about being able to find the explanation ; what use is it to merely know that an adequate explanation exists, if we can’t find it?

This leads us to the following alternative statement, which we’ve been calling the “train and explain” formulation of the MSP:

For all architectures (with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:
For all “training” algorithms mapping random seeds to parameters , there exists an “explaining” algorithm mapping random seeds to explanations , with , such that:
For all tolerance parameters , satisfies the following three properties:
1. It runs in time .
2. Its error is competitive with sampling, on average over random and :, where , , and .
3. It is mechanistic.

In this statement, it is useful to think of as being the learning algorithm used to find (e.g. SGD) and as being the random bits used during training (the random initialization and random choices of training data used at each training step). Then, is the algorithm used to find the mechanistic explanation of : intuitively, it works “in parallel” with , observing the training process and “building up” the explanation in a way that mirrors the way that “builds up” structure in by iteratively modifying to get lower and lower loss.

Note that our mainline MSP statement is the special case of the “train and explain” formulation where and are both computationally unbounded (so that can select the worst parameters and can select the most helpful advice ).

In general, we think that for any computational constraints placed on (e.g. on time or memory), there is a corresponding with the same computational constraints that can find an adequate explanation . If we are correct, then that potentially gives us a strategy for efficiently computing properties of trained neural networks (such as catastrophe probability), while paying a relatively small alignment tax. (If finding takes as much time as finding , that’s an alignment tax of 100%: a small price to pay for avoiding catastrophe.)

Another modification: getting rid of

It turns out that the MSP can be stated without reference to a tolerance parameter, by subsuming the number of samples into the architecture instead. See this appendix for details.

Our progress so far

Over the course of 2025, ARC has made progress on the MSP in a few different directions. Concretely:

We have a mechanistic algorithm that competes with sampling (in theory and in practice) for estimating the size of an intersection of randomly chosen halfspaces.^[16] We have also generalized our algorithm to work for some other problems, such as estimating the satisfaction probability of a random CNF, or the permanent of a random matrix.
We believe we have a mechanistic algorithm that competes with sampling for estimating the expected output of random MLPs on Gaussian inputs. (We have an empirical demonstration of competitiveness with sampling, and a proof sketch that we are working on expanding into a full proof.)
We have made substantial progress toward a mechanistic algorithm that competes with sampling for estimating the expected output of a two-layer MLP on Gaussian inputs, where the second layer of the MLP is trained via gradient descent.

Our results are not yet ready for publication, but hope to get them ready in the coming months. In this section, I will briefly summarize these results and discuss the most interesting directions for future work.

Intersection of random half-spaces

The first problem we tackled in our “matching sampling” framework was mechanistically estimating the volume of the intersection of random half-spaces. Although it’s a somewhat toy problem, it wasn’t trivial to solve, and we learned a lot from solving it.

Problem statement

Find an algorithm that takes as input unit vectors and a tolerance parameter , and mechanistically estimates the probability that a randomly chosen unit vector has a nonnegative dot product with all of , such that:

The expected squared error of (over randomly chosen ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random vectors and outputting the empirical fraction that have a nonnegative dot product).
The runtime of is competitive with the runtime of the naive sampling-based algorithm, i.e. .^[17]

Note that this is the MSP in the particular case where is the empty string; ; and returns 1 if for all . (The distribution of and each vector in is uniform over the unit sphere, rather than uniform over bit strings.)

Since is empty, there is no advice in this setting. Despite that, we think that our solution to this MSP instance has helped us progress toward solving the MSP in more generality.

(Alternatively, this problem can be viewed as an instance of the “train and explain” version of the MSP, where , is empty, and the training algorithm simply returns random vectors. In this setting, the explaining algorithm does not have time to do any interesting computation, so might as well be empty.)

A sketch of our solution

In a sentence, our solution is to build up a polynomial approximation of by considering one vector at a time.

To elaborate on this, for , let be the function that outputs 1 if , and 0 otherwise. Let us define (so ). We will:

Define to be the constant 1 function.
For all , compute the best low-degree polynomial approximation to . In the end, will be a polynomial approximation to , and we will output the exact expectation as our estimate of .

To be more precise, “low-degree polynomial” here means degree , where is the largest integer such that (it turns out that that’s the precision we need in order to compete with sampling). And “the best low-degree polynomial approximation” means the best approximation in terms of squared error, for drawn from the unit sphere.

In order for to be efficient enough, it needs to be able to compute this polynomial approximation in time . Getting the dependence on down to turns out to be pretty tricky.^[18] However, we were able to find a suitably efficient algorithm by working in the Hermite basis of polynomials instead of the standard monomial basis.

We have also generalized this approach to apply to a broader class of problems than just “intersection of randomly-chosen half-spaces.” Roughly speaking, we can apply our methods to estimate the expected product of symmetric random functions, for a certain representation-theoretic notion of symmetry. Concrete problems that we have solved with this approach include:

Estimating the satisfaction probability of -CNFs, where each literal is chosen uniformly at random.
Estimating the permanent of a matrix where each entry is either or , independently with probability 50%.

We are aiming to publish the details of our algorithm and this generalization in the coming months.

Random MLPs

In the last couple of months, we have been tackling a more sophisticated MSP instance: random MLPs. We now believe that we have an algorithm and a proof of its correctness and efficiency, though we are still verifying details. We also have an empirical demonstration that our algorithm is competitive with sampling.

Problem statement

Consider the following MLP architecture: the input size is (assume that is very large^[19]); there are hidden layers ( is some fixed constant), each of width ; the output is a scalar; and every hidden layer has some activation function (e.g. ReLU).

Find an algorithm that takes as input the weights of an MLP with the above architecture and a tolerance parameter , and mechanistically estimates the expected output of the MLP on inputs drawn from , such that:

The expected squared error of (over independent, normally distributed^[20] weights) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random inputs and outputting the empirical average output).
The runtime of is competitive with the runtime of the naive sampling-based algorithm, i.e. forward passes.

Note that this is the MSP in the particular case where is the empty string; ; and returns the output of the MLP with weights on input . (The distribution of each component of and is Gaussian.)

(Similarly to the case of random half spaces, this problem can also be viewed as an instance of the “train and explain” version of the MSP, where , is empty, the training algorithm simply returns random weights, and might as well be empty.)

A sketch of our solution

In a sentence, our solution to this problem is cumulant propagation, a mechanistic estimation algorithm that we introduced in Appendix D of Formalizing the Presumption of Independence.

Cumulants are a type of summary statistic of a probability distribution. Loosely speaking, the cumulant operator takes a list of random variables and tells you something like their “multi-way correlation.” For example, is the mean of ; is the variance; is the covariance of and .

Cumulant propagation is a method that lets us make guesses about the cumulants of layer of a neural net based on a partial list of cumulants of layer . (The more complete the list of cumulants, the more accurate the guesses become.) To a first approximation, then, our algorithm is to:

Start with a list of cumulants of the inputs (this is easy because the input is Gaussian: for each input , and all other cumulants are ).
Use cumulant propagation to make guesses about the cumulants of the layer-1 activations,^[21] going up to -th order cumulants,^[22] where is the largest integer such that . Then do the same for the layer-2 activations, and so on.
Output our guess about the mean (i.e. first cumulant) of the output.

(This description leaves out many details, but gets across the main idea.)

Two-layer MLPs with a trained second layer

In parallel with tackling random MLPs, we have also been investigating two-layer MLPs where the hidden layer is very wide, and where the second layer of weights is trained. This is our first serious foray into trained and/or worst-case instances—and while we haven’t fully solved it, we have made substantial progress.

Problem statement

Consider the following MLP architecture: the input size is ; there is one hidden layer of size , where is very large; the output is a scalar; and there is an activation function at the hidden layer and the output layer.

Problem 1: Find an algorithm that takes as input the weights of an MLP with the above architecture ( contains the first-layer weights; contains the second-layer weights), an explanation , and a tolerance parameter , and mechanistically estimates the expected output of the MLP on inputs drawn from , such that:

For all , the expected squared error of (over independent, normally distributed weights in ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling random inputs and outputting the empirical average output).
The runtime of is competitive with the runtime of the naive sampling-based algorithm, i.e. forward passes.

Note that this is the MSP in the particular case where ; ; and returns the output of the MLP with weights on input .

Problem 2: Now, suppose that is trained via SGD to make the MLP match some target function. Extend the solution to Problem 1 by finding a linear-time algorithm that takes as input the full transcript of SGD and outputs .

This is the “train and explain” version of the MSP in the particular case where and the training algorithm is SGD with squared loss on an arbitrary target function.

A look at our progress so far

Unlike in the case of random MLPs, we do not expect cumulant propagation to work. That’s because, for worst-case , the largest cumulants will not necessarily be the low-order ones; thus, dropping high-order might not produce a good approximation. So what can we do instead?

Consider the function that maps the input to the final pre-activation (i.e. the output, but before the final activation function is applied). If we could find the cumulants of (i.e. the mean, variance, etc. of the final pre-activation on random inputs), then we would be able to find the mean of the output of the MLP. So how can we estimate these cumulants?

The function can be well-approximated by a high-degree, -variable polynomial in the inputs. And as it turns out, there is a neat way to express the cumulants of a multivariate polynomial as an infinite sum in terms of the polynomial’s coefficients in the Hermite basis. In particular, for each , the degree- coefficients can be written down in a -dimensional -tensor. (Having run out of Roman and Greek letters, we decided to call this tensor $ש$ .^[23]) Then the -th cumulant is the sum of all tensor contractions across tensor networks consisting of copies of $ש$ . This leaves us with two problems:

Computing the $ש$ -tensors.
Approximating the infinite sum, given the $ש$ -tensors.

Our solution to the first problem is to receive the $ש$ -tensors as advice. It turns out that, so long as (i.e. in the infinite-width limit of the hidden layer), we have enough room in to write down all of the $ש$ -tensors we need. (And for the “train and explain” version of the problem, we believe that we can learn the $ש$ -tensors in parallel with SGD.)

We are currently working on the second problem (summing up the tensor networks), and we have made substantial partial progress. If the hidden layer width is truly huge compared to the input size, then there is enough time to approximate the sum by brute force. If the hidden layer is large but not huge, then a more efficient algorithm is necessary. We are working on finding efficient ways to contract arbitrary tensor networks and being able to notice when a tensor network can only contribute negligibly to the sum (so that we can drop it from the sum).^[24]

Closing thoughts

I consider the MSP to be a significant step forward for ARC. Previously, we were interested in producing mechanistic estimates of mathematical quantities, but had no particular benchmark by which to judge our progress or deem our methods “good enough.” Now, we are holding ourselves to a standard that is philosophically justified (we believe that it ought to be possible for mechanistic estimates to compete with sampling), concrete (we can check whether our methods compete with sampling using empirical tests or formal proofs), and tied to a useful application (estimating properties of neural nets, such as catastrophe probability).

Formulating the MSP has allowed us to ask more concrete questions (e.g. “How can we construct a mechanistic algorithm that competes with sampling for estimating the average output of trained two-layer MLPs?”). We have solved some of these questions, made progress on others, and are continuing to make progress.

We plan to continue attacking the MSP from a number of directions:

Attempting to solve the MSP “on paper” (i.e. using mathematical tools) for specific instances (like the ones described in this post).
Using our theoretical methods to create state-of-the-art algorithms for estimation problems.
Approaching the problem from a more high-level or philosophical perspective in order to discern what sorts of mechanistic algorithms could compete with sampling in full generality.

If you’re interested in working with us on any of these directions, you can apply here!

Appendix

A note on advice verifiability

In our various MSP statements, we do not ask for to be able to “verify” that the explanation is “accurate” (i.e. correctly describes the structure of , instead of making false claims). Is that fine, or should we require to be verifiable by ?

At least in the “train and explain” version of the MSP, we do not believe that advice needs to be verifiable. This is for two reasons:

Our eventual goal is to implement any MSP solution for actual neural nets, and to have strong accuracy guarantees (on average over the randomness of training). Solving the “train and explain” version of the MSP for a given neural architecture already comes with an accuracy guarantee, even without the ability to verify the explanation that the explaining algorithm produces.
Suppose that the training algorithm finds by checking many random candidate values of until it finds a particularly unlucky one (e.g. one where is much lager than a full structural understanding of would suggest, just by chance). By observing the computations done by , can notice that is unlucky, but it cannot succinctly explain that fact in a verifiable way. All it can do is assert (in its explanation ) that is unlucky, and all can do is trust ’s assertion.

This last point seems a little bit at odds with my earlier assertion that only makes claims about the structure of , not the randomness. A more refined version of this assertion would be: should not assert any randomness in that happened by accident; but if has weird randomness due to some fact about the training process, then should reflect that fact.

What about our mainline MSP statement, where there is no explaining algorithm to track optimization done during training? If “comes out of nowhere,” are we comfortable with asserting facts about without a possibility of verification?

In my opinion, it’s fine for to be unverifiable, for essentially the same reason. If it’s fine for to claim that was selected to be as adversarial as possible via a brute force search (in the “train and explain” version of the MSP), then it seems fine for to claim that was selected to be as adversarial as possible by an omniscient oracle, if that’s how was selected.

For example, imagine that we can model as a random oracle—a completely different random function for each -- and the particular that’s chosen happens to be the one whose average output is furthest from ⁵⁰⁄₅₀. Then it seems fine for to assert that is a random oracle whose average output just so happens to be many standard deviations away from ⁵⁰⁄₅₀.

There might be natural versions of the MSP that require advice to be verifiable. However, such statements would require giving more time to run. Concretely, in the mainline MSP statement, we would ask for to run in time , where the extra factor of mitigates the selection pressure put into choosing . In the case where is a random oracle, this is exactly the amount of compute that needs to compete with sampling, if can convey that is a random oracle, but cannot assert anything about being a particularly unlucky random oracle. I like this version less, in part because we don’t expect that paying the extra factor will be feasible in practice.

A special case of the MSP: Universal Turing machines

One interesting case of the MSP is when the architecture is a universal Turing machine . In other words, interprets as the encoding of a Turing machine, and runs on the input -- except that we will say that is forced to halt after one million steps (so that we don’t need to worry about runtime). Applying the MSP to this special case gives the following assertion:

There exists an estimator such that:
For all Turing machines , there is an explanation (), such that:
For all tolerance parameters , satisfies the following three properties:
- It runs in time .
- On average over random , its error is competitive with sampling:
- It is mechanistic.

In other words, there is a single, universal that is able to mechanistically estimate the average output of any (time-bounded) Turing machine, if it is given advice that explains the Turing machine’s structure.

Note also that a solution to this special case would yield a solution to the full MSP: suppose that we had an estimator for universal Turing machines, and consider some other architecture . Then , where is a Turing machine whose size is plus some constant that only depends on . Consider the estimator that, on input , writes down the such that and returns . If some causes to output accurate estimates for , then will also cause to output accurate estimates for . Thus, this estimator solves our mainline MSP for .

The MSP statement can also be used to obtain a claim about mechanistically estimating random Turing machines, but without advice. Concretely, we will let be the empty string, and will instead say that interprets as the encoding of a Turing machine, and runs on the input . (As before, we force to halt after a million steps.) Applying the MSP to this special case gives the following assertion:

There exists an estimator such that:
For all tolerance parameters , satisfies the following three properties:
- It runs in time .
- Its error is competitive with sampling, on average over random :.
- It is mechanistic.

This is an interesting and arguably bold statement: it says that as gets more time to run, it is able to get a more and more accurate mechanistic estimate of the average output of the Turing machine . This is intuitive enough for Turing machines with no interesting structure (as is the case for most random Turing machines). However, in order to satisfy the accuracy guarantee above, must converge to the right answer for all Turing machines (even if the convergence is slower for Turing machines with more sophisticated structure). Such a would probably involve a systematic search for structure: loosely speaking, since it isn’t given an explanation, it must find the explanation on its own.

Getting rid of

Modulo a caveat (see below), it is possible to modify the MSP statement to get rid of the tolerance parameter . Concretely, suppose that the following statement—which specializes our mainline MSP statement to the case of -- is true:

For all architectures (with parameters ) mapping pairs to , there exists an estimator mapping tuples to , such that:
For all parameters , there exists a short explanation (), such that:
satisfies the following three properties:
1. It runs in time .
2. Its error is competitive with sampling, on average over random :, where and .
3. It is mechanistic.

We claim that our mainline MSP statement almost follows from this -less version. To see this, consider an arbitrary architecture , and fix a positive integer . We will define a modified architecture that has the same space of parameters as . Concretely, works as follows: it takes as input a list of inputs to , runs on all of them, and outputs the average value of for . We claim if some estimator solves the above MSP for all regardless of the particular value of , then also solves the mainline MSP for .

To see this, suppose that we have an estimator that solves the above MSP for , in the case of . This means that for any , there is an explanation such that :

Runs in time .
Has low error on average over random :.
Is mechanistic.

Note that and . Note also that runs in time . This means that that also solves the mainline MSP for the architecture if .

Now, if there is a single that solves the MSP for regardless of , then will solve the mainline MSP for for all .

The fact that we need a uniform regardless of means that we don’t quite have a full reduction; however, the above -less MSP statement is another interesting variant of MSP that is almost the same. We decided to make our mainline MSP statement contain a tolerance parameter in order to make the connection to the idea of matching sampling more intuitive.

Compression as a possible MSP approach

In this section, I will outline one possible approach to solving the MSP. For the sake of concreteness, I will consider the case where is a universal Turing machine (which was discussed above). As a reminder, this means that interprets as the encoding of a Turing machine, and then runs on the input .

We will say that a Turing machine is efficiently compressible if there is a significantly shorter Turing machine that, on any input , constructs in time and the runs on input . (We call an efficient compression of .) One possible approach to solving the MSP looks something like this:

We solve the MSP for instances that are not efficiently compressible. Concretely, we find an estimation algorithm that works for any that is not efficiently compressible, without any explanation .
We define the estimation algorithm to check that is an efficient compression of , and then to return if so.
- If is not efficiently compressible, then ’s output will be accurate (since solves the MSP for instances that are not efficiently compressible).

The hope for Step 1 is that an estimation approach that works in the average case (over random parameters) will work for all parameters that are not efficiently compressible.

The intuition underlying this hope is that structure implies efficient compression. In other words, if has structure that would make a mechanistic estimator mis-estimate its average output, then understanding that structure would allow us to represent more compactly (and in a way such that can be recovered quickly from the representation).

What about the “train and explain” version of the MSP? In order to adapt this approach to that setting, we also need to be able to learn the efficient compression in parallel with learning itself. If the training process has enough time to find a with special structure, then is there an “explaining process” that would have enough time to find the corresponding compression? That is unclear to me, but I think this direction is promising enough to be worth exploring.

If this approach is viable, then solving the MSP in the average case for some Turing-complete architecture (such as RNNs) would be a major step forward.

^
We could instead imagine that outputs a probability of catastrophe, but we will keep the range of to for simplicity of exposition.
^
Running might take much longer than running , which is why we can’t just run on every input during deployment.
^
For example, the parameterization of the distribution needs to be quite flexible, so as to allow distributions that are computationally intractable to sample from. For example, if acts catastrophically when it encounters a factorization of RSA-2048, we want to be able to train that behavior out of even if we can’t factorize RSA-2048. (See here for more discussion.)
^
In brief, we hope to address issue #1 via mechanistic anomaly detection. A little more concretely, our plan is to:
- Build to be smart and robust enough that can only fool by thinking deliberately about how to fool .
  - In the diamond-in-the-vault analogy, this means that the cameras are robust enough that the SmartVault can only fool them deliberately.
- Build a mechanistic explanation for why tends to be 0 on the training distribution.
  - This would let us flag inputs on which for an unusual reason. Let be this anomaly detector, i.e. if for an unusual reason.
- Instead of training to minimize , train to minimize (i.e. to maximize the probability that for the usual reason).
  - And similarly to above, train to maximize this expectation.
This leaves many details unexplained, but that’s the basic concept.
^
Or rather, independent up to small random variation that is unpredictable just from understanding the structure of and .
^
In practice, we will be interested in the behavior of neural nets on structured (rather than uniformly random) inputs. However, note that it is possible to create structured inputs out of random inputs via a generative model. For example, if we are interested in the behavior of a classification model on pictures of animals, we could let consist of two parts: first, a generative model that creates an image of an animal from random noise, and second, a classifier that takes the animal image as input.
^
Why do we require to be short? The basic reason is that, as we discuss later, we will be interested in learning in parallel with learning the parameters , and so we will want to be able to do a backward pass through as quickly as doing a backward pass through . We also have some amount of philosophical justification for believing an explanation the size of is sufficient. Essentially, we think that any object’s structure can be described compactly, because if the amount of (non-redundant) structure in an object is much larger than the size of the object itself, that would constitute an “outrageous coincidence”.
^
This works by taking our (Gaussian) model of layer and then modeling layer by finding the normal distribution that would minimize the KL divergence from the pushforward of our model of layer .
^
Covariance propagation does not require an explanation . However, some modifications of covariance propagation could require advice. For example, if is too large to allow for to compute all of the covariances, then could advise to only keep track of some particular covariances. Or, could tell about some important third-order correlations to keep track of.
^
Note that we use the word “heuristic” in place of “mechanistic” in that paper. I think that the word “mechanistic” conveys our goal slightly better.
^
As a very simple example, consider the circuit . Mean propagation estimates this circuit’s average output as rather than because it fails to notice the correlation induced by the presence of in the two conjunctive clauses.
^
Though, see the appendix on advice verifiability for some nuance on this point.
^
Eliezer Yudkowsky’s Worse Than Random makes a similar point:
As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid—so that a randomized algorithm seems wiser—you should be able to use the same knowledge to produce a superior derandomized algorithm.
^
Roughly speaking, this is the range where is a substantial fraction of the number of times that one needs to run to fully estimate its unstructured randomness.
^
We cannot require to be accurate for all . For example, suppose that interprets as a Turing machine and runs on the Turing machine . Requiring to be accurate for all would mean expecting to be able to mechanistically estimate the output of a worst-case Turing machine, without any structural advice at all. (After all, cannot depend on .) This is too much to ask for.
^
Note that this problem is equivalent to estimating the probability that a one-layer ReLU network outputs all zeros on a random input. Concretely, if the network is , then the output is all zeros if and only if for every row of .
^
Our algorithm’s runtime is , which is technically too slow in the case where . However, we are most interested in the regime where .
^
The naive approach is to treat this as a linear regression problem, where the covariance (inner product) between two polynomials and is defined as the expectation of for drawn from the unit sphere. However, doing this involves multiplying a matrix by a -vector, so the dependence of this algorithm on looks like : not fast enough.
^
We believe that our proof of correctness and efficiency works in the limit as , where the MLP depth is constant and .
^
Mean zero; standard deviation is chosen so that all activations have the same variance.
^
One complication to this picture is that, although we’ve defined cumulant propagation for sums and products of random variables, it’s not clear what it means to apply cumulant propagation to an activation function like ReLU: given the cumulants of , how does one estimate the cumulants of ? Our strategy is to find a polynomial approximation to the ReLU function (see the next paragraph of this footnote for details). Once we’ve done that, we can apply cumulant propagation as we’ve already defined it for sums and products.
What is the appropriate notion of polynomial approximation? It turns out that we can take the polynomial that minimizes mean squared error if is assumed to be normally distributed with mean equal to our estimate of and covariance equal to our estimate of . This is equivalent to taking the first several terms of the Hermite expansion of ReLU (appropriately centered and scaled).
^
Actually, it is more important to keep track of cumulants in which the same activation appears multiple times, so we need to keep track of some cumulants of order higher than that involve repeated indices.
^
That’s the Hebrew letter shin.
^
Concretely, a tensor network can only contribute substantially to the sum if every tensor in the network has a large operator norm. Thus, if in the process of contracting the tensor network, we find a tensor that has a small operator norm, we can cut off the computation and move onto the next tensor network.

What links here?

Eric Neyman, Victor Lecomte, Wilson Wu, Mikewins, Jacob_Hilton and George Robinson

18 Nov 2025 17:22 UTC

132 points

11 comments21 min readLW link

AI Alignment Research Center (ARC)

David Duvenaud 19 Nov 2025 15:48 UTC
10 points
2
Language suggestion: I would replace “random sampling” everywhere in this article with “simple Monte Carlo”, which means “estimating an expectation by taking the mean of samples from the distribution”. There are lots of sampling-based methods more sophisticated, and lower-variance than, simple Monte Carlo, such as importance sampling. In fact, you might be able to use the methods you outlined to build better proposal distributions for importance sampling. This would still be a sampling-based method, but better than simple Monte Carlo.
- Eric Neyman 19 Nov 2025 19:00 UTC
  9 points
  0
  Parent
  Thanks for the suggestion!
  For what it’s worth, we believe that a mechanistic estimator can beat all sampling-based methods, no matter how sophisticated they are. The philosophical reason for this is that sophisticated sampling-based methods outperform simple Monte Carlo by exploiting structure in the function whose average value they’re estimating—but a mechanistic estimator can exploit that same structure, too.
  In fact, I think it almost follows from the MSP that we can beat any sampling-based method. To see this, suppose you have some sophisticated estimator $Est (M_{θ}, r)$ , which is given a neural net $M_{θ}$ and some random coin flips $r$ as input, and produces a sophisticated, unbiased, low-variance estimate of $E [M_{θ}]$ using $r$ . Now, define the architecture $M^{'}$ as: $M_{θ}^{'} (x) = Est (M_{θ}, x)$ . The MSP says that we need to be able to estimate the average output of $M_{θ}^{'}$ (which is the same as the average output of $M_{θ}$ ) with squared error less than or equal to the variance of $M_{θ}^{'}$ , in the time that it takes to run $M_{θ}^{'}$ . (We’re taking $ε = 1$ here.) In other words, given any sophisticated sampling algorithm for estimating the average output of $M_{θ}$ , there needs to be a corresponding mechanistic estimator that gets lower (or equal) error in the same amount of time.
  (I think this argument isn’t perfectly tight, because it’ll probably run into the same uniformity issues that I discussed in the “getting rid of $ε$ ” appendix, which is why I said “almost follows” rather than “follows”.)
  - David Duvenaud 20 Nov 2025 19:55 UTC
    2 points
    0
    Parent
    Right, like many people I agree that in most cases derandomization is a good idea if you can afford a little extra code complexity. But the structure of your argument makes it sound like that’s the important part, when actually it’s taking advantage of structure that’s the important part, sampling or not.
    
    Concretely, I would change “Understanding structure helps outperform sampling” to simply “Understanding structure helps”.
Alexander Gietelink Oldenziel 24 Nov 2025 14:21 UTC
5 points
0
Congratulations! I have been getting significantly more excited about what ARC is doing this year. Happy to see so much genuine recent progress after the proverbial 40 years of desert wandering.
Logan Riggs 20 Nov 2025 14:09 UTC
5 points
0
Not claiming to understand your work, but my intuition is that a bilinear layer would be easier to prove things about than an MLP, while also being closer to SOTA. Some (maybe) useful properties for your case:
- bilinear
  - directions are what matters for relationships between components (between any part of any matrix, where the input can be considered just another matrix); scale doesn’t affect compositionality.
  - No implicit computation (eg different polytopes of MLPs), just the structure in the weights.
- a polynomial
- can be turned into a tensor (& combined into a larger tensor when you have multiple layers)
  - Note: a bilinear layer is already a CP decomposition (generalization of SVD, maintaining the outer product format), but combining two bilinear layers, you get a 5th order tensor, which you can decompose as well.
For performance, folks tend to use a swiGLU, where a bilinear layer is similar to and almost as performant (Table 1 in Noam Shazeer’s paper). Interesting enough, it’s better than MLPs w/ ReLU/GeLU/Swish.
Gurkenglas 19 Nov 2025 13:40 UTC
4 points
2
We would be really interested in finding a way to mechanistically estimate the average output of random recurrent neural networks (RNNs).
Have you seen Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes?
- Mikewins 19 Nov 2025 19:33 UTC
  9 points
  8
  Parent
  I am affiliated with ARC and played a major role in the MLP stuff
  I’m loosely familiar with Greg Yang’s work, and very familiar with the ‘Neural Network Gaussian Process’ canon. It’s definitely relevant, especially as an intuition pump, but it tends to answer a different question. They answer ‘what is the distribution of quantities x y and z over the set of all NNs’ where quantities x y and z might be some preactivation on specific inputs. Knowing that they are jointly gaussian with such-and-such covariance has been a powerful intuition pump for us. But the main problem we want is an algorithm that takes in a specific NN with specific weights and tells us about the average over inputs.
  I’ve found that this distinction is a powerful antimeme, and every time I give a presentation on the topic I have a slide on the difference between averaging over x versus over theta. By the end of the talk the audience is clamoring to recommend I read Principles of Deep Learning Theory (which is lovely if you want to improve on NNGP, but not relevant to calculating averages for a specific value of theta.).
  - Gurkenglas 20 Nov 2025 8:40 UTC
    2 points
    0
    Parent
    yeah while I noticed the distinction i usually find it worthwhile to try to steal tools across problem statements that use the same words in a different order, i’ll use your data point to downweight that heuristic a little thanks :p
    Did knowing that the joint-gaussian thing generalizes to RNNs influence your decision to look at RNNs next?
    - Mikewins 20 Nov 2025 17:03 UTC
      1 point
      0
      Parent
      Honestly for me it’s more of a strike against RNNs. Real deep neural networks that have been trained don’t have this property, so it’s a bridge we’re going to need to cross at some point regardless. From a derisking point of view I’d kind of like to get to that point ASAP. There’s a lot of talk about looking at random boolean circuits (which very obviously don’t have this property), narrow MLPs, or even jumping all the way to wide MLPs trained in some sort of mean-field/maximum update regime that gets rid of it.
Gurkenglas 19 Nov 2025 13:35 UTC
2 points
0
Do you expect that whether the infinite sum can be approximated is related to whether the architecture averts the vanishing gradient problem?
- Mikewins 19 Nov 2025 19:21 UTC
  3 points
  0
  Parent
  I am affiliated with ARC and played a major role in the MLP stuff
  The particular infinite sum discussed in this post is used for approximating MLPs with just one hidden layer, so things like vanishing gradients can’t matter.
  We are now doing work on deeper MLPs. In this case, the vanishing gradients story definitely does seem relevant. We definitely don’t fully understand every detail, but I’ll mouth off anyways.
  On one hand, there are hyperparameter choices where gradients explode. It turns out that in this regime, matching sampling: exponentially large gradients mean that the average is a sum over exponentially many tiny and more-or-less independent regions, so it’s exponentially small and you can beat sampling by just answering zero until your compute budget is exponential. Even then you might not have the budget to go through all of input space, but we have ideas on what you could do.
  On the other hand, there are regions where gradients vanish. Here, also, it seems like things work out. Since the inputs to neurons in later layers concentrate, you can do very well by just approximating the later non-linearities as linearizations around the concentrating point.
  Then there’s the so-called ‘critical hyperparameters’ right in the middle. It turns out that for MLPs the gradients vanish (albeit more slowly) even at criticality. Back-of-the-envelope, it seems to go to zero just fast enough.
  There are other convergence-related questions (in particular, a lot of these series are so-called ‘asymptotic series’ which officially never converge), and at smaller epsilons/higher compute budget those issues come into play as well, but I don’t think they’re connected to gradients.

ARC progress update: Competing with sampling

Outperforming sampling as a step toward preventing AI misalignment

Understanding structure helps outperform sampling

A non-human understanding

The matching sampling principle

A first attempt at stating the MSP

What makes an algorithm “mechanistic”?

Why do we require to be mechanistic?

The intuition behind the MSP

Why only matching sampling?

An issue: can just tell the answer

Our actual MSP statement

An important variant: Findable explanations

Another modification: getting rid of

Our progress so far

Intersection of random half-spaces

Problem statement

A sketch of our solution

Related questions

Random MLPs

Problem statement

A sketch of our solution

Related questions

Two-layer MLPs with a trained second layer

Problem statement

A look at our progress so far

Related questions

Closing thoughts

Appendix

A note on advice verifiability

A special case of the MSP: Universal Turing machines

Getting rid of

Compression as a possible MSP approach