It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:

There exists an efficient algorithm which when run on a model input will output which mechanism model uses when run on .

There exists an efficient algorithm which when run on a model will output another programme such that when we run on a model input it will output (in reasonable time) which mechanism uses when run on .

In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)

In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).

It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability—but I think in this case we’d really need (2) in order for these methods to be useful in general.

Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).

But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)

Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.

Here’s an argument that I think works in the limit where m→∞. For very large m, each subset S⊆[n] of size s will occur many times as an Si. Moreover, for each set of coodinate values xS:=(xj)j∈S, there will be many xi such that Si=S and xiS=xS. Therefore, using the law of large numbers for any subset S⊆[n] of size s and any set of coordinate values xS, Bob can infer N(xS,f):=#{x′∈{0,1}n:x′S=xS and f(x′)=1}, i.e. the number of inputs x′ that agree with xS at the coordinates indexed by S for which f(x′)=1. Additionally, the random variable Y contains no other information about f that is not contained in this set of statistics.

Given a function f:{0,1}n→{0,1}, let F(f) denote the set of all functions f′ such that for each subset S⊆[n] of size s and each set of coordinate values xS, N(xS,f)=N(xS,f′). The argument of the previous paragraph then implies that upon observing Y Bob can deduce the equivalence class F(f), but no other information about f from Y.

If we have a random variable X and we define another random variable Y as a deterministic function Y(X), then it is easy to show that the set of distributions of X which maximize the mutual information between Y and X are precisely those for which Y has a uniform distribution. Applying this to the above setup, we see that the distributions over f which maximize the channel capacity are those for which F(f) is uniformly distributed over all the values that it can take.

If we suppose in addition that f has the maximum entropy distribution subject to this constraint, we find that:

P[f]∝1/(#{f′:N(xS,f)=N(xS,f′) for all S⊆[n],xS∈{0,1}S}).

Intuitively, this says that the probability of a given function f is inversely proportional to the number of other functions f′ that have the same number of ones on every hyperplane defined by fixing a set of some s coordinates xS. This seems to correspond roughly to sensitivity: we intuitively expect there to be a lot of such functions f′ when the number of ones that f outputs on most hyperplanes is approximately half of the number of total points on that hyperplane, and saying that a function’s output is approximately half ones and half zeros on a given hyperplane is roughly saying that f is sensitive to the remaining unspecified coordinates.

It’s not obvious to me that the above expression for P[f] is feasible to compute in practice, but I think it is fairly naturally interpretable.

The key reason the problem is tractable in the case m→∞ is that the law of large numbers means the probabilities wash out and we get an essentially deterministic problem. In the case where m is finite, this won’t happen and in particular I expect you’ll run into complications that arise from interaction terms where the method for combining the information from two observations Yi and Yj with Si≠Sj is not clean.

I expect you’re more likely to end up with a tractable solution if you rephrase the problem statement slightly so that the subsets of inputs over which you’re agregating outputs when observed (in the case above, these subsets are the hyperplanes defined by the coordinates xS) are disjoint or overlapping. (For example, taking all the Si to be random but equal would ensure this.) It strikes me that there might be a formulation like this that still captures the key features of the problem you’re interested in, but I’ve not thought about this in detail.