Matthew Farrugia-Roberts

Karma: 52

AI safety researcher. See my personal website: https://far.in.net/

Matthew Farrugia-Roberts 15 Jun 2026 10:42 UTC
2 points
0
on: Optimisation over non-stationary distributions creates weirder minds
Conditional policy: encapsulated specialists under a thin routing layer. For each training regime there is a naive optimiser that gets selected by a conditional policy.
...
In regimes where traits trade off against each other and environments are easy to distinguish from each other, the model can easily learn a router between policies in order to prevent harmful interference.
Pretty unclear to me why a conditional policy should exist as a distinct category of structure. Here are some doubts:
- Doesn’t a conditional policy meet the definition of an ecological generalist as stated? A conditional policy is a single structure (albeit a non-overlapping/branching one) that works everywhere. Maybe you want to clarify that by ‘ecological generalist’ should actually have shared internal structure, but then this seems like an unnatural definition that excludes an extreme case for the sake of it.
- Why would there ever be an incentive to construct a conditional policy’s router? So what I have in mind is that you are at one point training on distribution A, and then abruptly you stop and start training on distribution B. If so, I agree there is a need to learn some specialist structures for distribution A during that first stage, and then to learn specialist structures for distribution B during the second stage. But there is no point at which the model has to actually learn to make complicated routing decisions about which structure to use. During the first stage, it’s always the right move to use the A-specialist structure. During the second stage, it’s always the right move to use the B-specialist structure. In order for the model to be incentivised to distinguish these, I think you need either some overlap between the stages or repeatedly switching back and forth between distributions A and B; time where the model is incentivised to keep both specialist structures and learn when to use each.

Matthew Farrugia-Roberts 15 Jun 2026 10:25 UTC
2 points
0
on: Optimisation over non-stationary distributions creates weirder minds
Similar motivations led to this recent workshop paper.

Prior work showed that when you train transformers on in-context linear regression tasks, they will learn a function that either specialises to the specific tasks you show them during training (task memorisation) or implements a general regression algorithm that can also handle unseen tasks like ridge regression (task generalisation); and this depends on the number of tasks you show it during training.
In this new paper, we tried changing the set of tasks you show the transformer during training to emulate non-stationary post-training. There are two new possibilities: the model can either continually update its memorised algorithm to work with the latest set of tasks (seems like your concept of ‘strategy churn’) or can learn the general regression algorithm that can handle past and future tasks (seems like ‘ecological generalist’). Indeed we see these two outcomes. The slower we change the tasks, the more likely we are to see strategy churn.
It’s pretty unclear to me exactly what is determining the decision. Hypothetically, the transformer has some kind of “recency window” which extends back some way over the history of changing tasks (but not all the way to the start), and it decides which tasks to remember based on what is in the recency window (if there are lots of tasks, it will fit them using the generalising algorithm, like in the stationary case). But this seems pretty mechanistically naive, since if you are memorising a set of tasks and one of the tasks changes by just a little, it seems likely that you’ll just update your internal memory of that tasks in a way that completely forgets the old version and replaces it with the new version. (IDK if this makes sense I think I need to write it out more carefully.)
Whatever the case in this setting, it seems to me a promising next step for testing these principles is to design a toy learning problem that better captures the space of low/high shared structure x low/high distinguishability than this in-context linear regression setting, and then run some experiments like the ones we ran but in the new setting.

Matthew Farrugia-Roberts 15 Jun 2026 10:10 UTC
2 points
1
on: Optimisation over non-stationary distributions creates weirder minds
So what actually emerges from training under non-stationary distributions? Our best guess is that this depends primarily on two factors:

(1) Distinguishability of training distributions: how much computational overhead is required to tell what kind of training distribution you’re in.
(2) Pressure for circuit sharing: how much skills transfer from one distribution to another relative to the cost of circuit separation.
A third factor that seems relevant is how much optimisation pressure you apply towards each distribution in sequence, in other words how slowly you switch between distributions. For a given fixed amount of shared structure and distinguishability (your two factors), if you increase optimisation power / decrease switching frequency, I expect you’ll increase your chance of seeing strategy churn.

Distilling Replacing Guilt

Matthew Farrugia-Roberts9 Mar 2026 12:55 UTC

24 points

0 comments8 min readLW link

(far.in.net)

Matthew Farrugia-Roberts 5 Mar 2026 22:03 UTC
3 points
0
on: Announcement: Iliad Intensive + Iliad Fellowship
I gather that the Iliad Fellowship announced here is the same as the June 2026 version of the PIBBSS x Iliad Research Residency advertised on the PrincInt website. Is that correct?
If so, then appear to be two different ways to apply to the same programme: (1) the form listed on the PrincInt website + (2) your unified form linked in this post. If one previously applied to the former, is there any need to re-apply via the new form?
(I understand that to apply to the intensives, that is only possible via the new form.)

Matthew Farrugia-Roberts 16 Feb 2026 10:01 UTC
1 point
0
on: A Framework for Eval Awareness
In part (c), one of the general classes of methods you propose, for the goal of getting models to avoid conditioning their behaviour on their knowledge that they are in an evaluation, is:
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues.
Do you have anything concrete in mind here? It seems like you are saying that we should avoid models conditioning their behaviour on their knowledge that they are in evaluation by somehow getting them not to condition their behaviour on their knowledge that they are in evaluation. This seems like it’s just a restatement of the goal. (But I could be missing something.)
Generally, I’m worried about proposals that say we should try to make evaluations more robust by making our models less likely to engage in strategic behaviour for misaligned purposes. At least for an important class of evaluations, if we had models we could trust not to engage in strategic behaviour for misaligned purposes, we wouldn’t need to evaluate them. The point of these evaluations should be that it we don’t have to trust the models not to engage in strategic behaviour for misaligned purposes, because if they did, the evaluations would catch them. Evaluations that assume this is not the case are not useful. (I suppose there are some classes of evaluations that might be able to get away with this assumption, because they are testing something orthogonal.)

Matthew Farrugia-Roberts 26 Jan 2025 7:55 UTC
7 points
1
on: Kessler’s Second Syndrome
Seems worth it 👾

Matthew Farrugia-Roberts 27 Nov 2024 11:59 UTC
3 points
0
on: The Queen’s Dilemma: A Paradox of Control
W. Ross Ashby’s Law of Requisite Variety (1956) suggests fundamental limits to human control over more capable systems.
This law sounds super enticing and I want to understand it more. Could you spell out how the law suggests this?
I did a quick search of LessWrong and Wikipedia regarding this law.
- ″… Ashby’s “Law of requisite variety”, which roughly speaking states that a system can only remain in homeostasis if it has more internal states than the external states it encounters.” from Yuxi_Liu, “Cybernetic dreams”.
- “Either the AI is too simple to be an independent robust agent in human society, or it needs to be approximately as complex as humans themselves. Cf. the law of requisite variety.” from Roman Leventov, “For alignment, we should simultaneously use multiple theories of cognition and value”.
- “This law (of which Shannon’s theorem 10 relating to the suppression of noise is a special case) says that if a certain quantity of disturbance is prevented by a regulator from reaching some essential variables, then that regulator must be capable of exerting at least that quantity of selection.” from W. R. Ashby (1960), “Design for a Brain”, p. 229, quoted via Wikipedia page.
Enough testimonials, the Wikipedia page itself describes the law as based on the observation that in a two-player game between the environment (disturber) and a system trying to maintain stasis (regulator), if the environment has D moves that all lead to different outcomes (given any move from the system), and the system has R possible responses, then the best the system can do is restrict the number of outcomes to D/R.
I can see the link between this and the descriptions from Yuxi_Liu, Roman Leventov, and Ashby. Your reading is a couple of steps removed. How did you get from D/R outcomes in this game to “fundamental limits to human control over more capable systems”? My guess it that you simply mean that if the more capable system is more complex / has more moves available moves / more “variety” than humans then the law will apply with the human as the regulator and the AI as the disturber. Is that right? Could you comment on how you see capability in terms of variety?

Matthew Farrugia-Roberts 27 Nov 2024 11:38 UTC
9 points
4
on: The Queen’s Dilemma: A Paradox of Control
I like this analogy, but there are a couple of features that I think make it hard to think about:
1. The human wants to play, not just to win. You stipulated that “the human aims to win, and instructs their AI teammate to prioritise winning above all else”. The dilemma then arises because the aim to win cuts against the human having agency and control. Your takeaway is “Even perfectly aligned systems, genuinely pursuing human goals, might naturally evolve to restrict human agency.”
So in this analogy, it seems that “winning” stands for the human’s true goals. But (as you acknowledge) it seems like the human doesn’t just want to win, but actually wants both some “winning” and some “agency”. You’ve implicitly tried to factor the entirety of the human’s goals into the outcome of the game, but you have left some of the agency behind, outside of this objective, and this is what creates the dilemma.
For an AI system that is truly ‘perfectly aligned’—truly pursuing the human’s goals, it seems like either
- (A) the AI partner would not pursue winning above all else, but would allow some human control at the cost of some ‘winning’, or
- (B) if it were possible to actually factor the human’s meta-preference for having agency into ‘winning’, then we shouldn’t care if the AI plays to win above all else, because that already accounts for the human’s desired amount of agency.
For an AI system not perfectly aligned, this becomes a different game (in the sense of game theory). It’s a three player game between the AI partner, the human partner, and the opponent, each of which have different objectives (the difference between the AI and human partners is that the human wants some combination of ‘winning’ and ‘agency’ while the AI just wants ‘winning’; probably the opponent just wants both of them to lose). One interesting dynamic that could then arise is that the human partner could threaten and punish the AI partner by making worse moves than the best moves they can see if the AI doesn’t give them enough control. To stop the human from doing this, the AI either has to
- (C) negotiate to give the human some control, or
- (D) remove all control from the human (e.g. force the queen to have no bad moves or no moves at all).
In particular, (D) seems like it would be expensive for the AI partner as it requires playing without the queen (against an opponent with no such restriction), so maybe the AI will let the human play sometimes.
2. I don’t think it needs to be a stochastic chess variant. The game is set up so that the human gets to play whenever they roll a 6 on a (presumably six-sided) die. You said this stands in for the idea that in the real world, the AI system makes decisions on a faster timescale than the human. But this particular mechanism of implementing the speed differential as a game mechanism comes at the cost of making the chess variant stochastic. I think that determinism is an important feature of standard chess. In theory, you can solve chess with an adversarial look-ahead search, mini-max, alpha-beta pruning, etc. But as soon as the dice becomes involved, all of the players involved have to switch to expecti-mini-max. Rolling a six can suddenly throw off the tempo in your delicate exchange or your whirlwind manoeuvre. Etc.
I’m a novice at chess, so it’s not like this is going to make a difference to how I think about the analogy (I will struggle to think strategically in both cases). And maybe a sufficiently accomplished chess player is familiar with stochastic variants already. But for someone in-between who is familiar with deterministic chess, maybe it’s easier to consider a non-stochastic variant of the chess game, for example where the human gets the option to play every 6 turns (deterministically), which gives the same speed differential in expectation.

Matthew Farrugia-Roberts 6 Jul 2023 22:12 UTC
3 points
0
on: Jesse Hoogland on Developmental Interpretability and Singular Learning Theory
There is a typo in the transcript. The name of the creator of singular learning theory is “Sumio Watanabe” rather than “Sumio Aranabe”.

Matthew Farrugia-Roberts 26 Jun 2023 9:51 UTC
9 points
1
in reply to: Leon Lang’s comment on: DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks
I think these are helpful clarifying questions and comments from Leon. I saw Liam’s response. I can add to some of Liam’s answers about some of the definitions of singular models and singularities.
1. Conditions of regularity: Identifiability vs. regular Fisher information matrix
Liam: A regular statistical model class is one which is identifiable (so $p (y | x, w_{1}) = p (y | x, w_{2})$ implies that $w_{1} = w_{2}$ ), and has positive definite Fisher information matrix $I (w)$ for all $w \in W$ .
Leon: The rest of the article seems to mainly focus on the case of the Fisher information matrix. In particular, you didn’t show an example of a non-regular model where the Fisher information matrix is positive definite everywhere.
Is it correct to assume models which are merely non-regular because the map from parameters to distributions is non-injective aren’t that interesting, and so you maybe don’t even want to call them singular?
As Liam said, I think the answer is yes—the emphasis of singular learning theory is on the degenerate Fisher information matrix (FIM) case. Strictly speaking, all three classes of models (regular, non-identifiable, degenerate FIM) are “singular”, as “singular” is defined by Watanabe. But the emphasis is definitely on the ‘more’ singular models (with degenerate FIM) which is the most complex case and also includes neural networks.
As for non-identifiability being uninteresting, as I understand, non-regularity arising from certain kinds of non-local non-identifiability can be easily dealt with by re-parametrising the model or just restricting consideration to some neighbourhood of (one copy of) the true parameter, or by similar tricks. So, the statistics of learning in these models is not strictly-speaking regular to begin with, but we can still get away with regular statistics by applying such tricks.
Liam mentions the permutation symmetries in neural networks as an example. To clarify, this symmetry usually creates a discrete set of equivalent parameters that are separated from each other in parameter space. But the posterior will also be reflected along these symmetries so you could just get away with considering a single ‘slice’ of the parameter space where every function is represented by at most one parameter (if this were the only source of non-identifiability—it turns out that’s not true for neural networks).
It’s worth noting that these tricks don’t generally apply to models with local non-identifiability. Local non-identifiability =roughly there are extra true parameters in every neighbourhood of some true parameter. However, local non-identifiability implies that the FIM is degenerate at that true parameter, so again we are back in the degenerate FIM case.
2. Linear independence condition on Fisher information matrix degeneracy
Leon: What is $x$ in this formula [” ${\frac{\partial}{\partial w_{j}} f (x, w)}_{j = 1}^{d}$ is linearly independent”]? Is it fixed? Or do we average the derivatives over the input distribution?
Yeah I remember also struggling to parse this statement when I first saw it. Liam answered but in case it’s still not clear and/or someone doesn’t want to follow up in Liam’s thesis, $x$ is a free variable, and the condition is talking about linear dependence of functions of $x$ .
Consider a toy example (not a real model) to help spell out the mathematical structure involved: Let $f (x, w) = (w_{1} + 2 w_{2}) x$ so that $\frac{\partial}{\partial w_{1}} f (x, w) = x$ and $\frac{\partial}{\partial w_{2}} f (x, w) = 2 x$ . Then let $g$ and $h$ be functions such that $g (x) = x$ and $h (x) = 2 x$ .. Then the set of functions ${g, h}$ is a linearly dependent set of functions because $h - 2 g = 0$ .
3. Singularities vs. visually obvious singularities (self-intersecting curves)
Leon: One unrelated conceptual question: when I see people draw singularities in the loss landscape, for example in Jesse’s post, they often “look singular”: i.e., the set of minimal points in the loss landscape crosses itself. However, this doesn’t seem to actually be the case: a perfectly smooth curve of loss-minimizing points will consist of singularities because in the direction of the curve, the derivative does not change [sic: ‘derivative is zero’, or ’loss does not change, right?]. Is this correct?
Right, as Liam said, often^[1] in SLT we are talking about singularities of the Kullback-Leiber loss function. Singularities of a function are defined as points where the function is zero and has zero gradient. Since $K$ is non-negative, all of its zeros are also local (actually global) minima, so they also have zero gradient. Among these singularities, some are ‘more singular’ than others. Liam pointed to the distinction between degenerate singularities and non-degenerate singularities. More generally, we can use the RLCT as a measure of ‘how singular’ a singularity is (lower RLCT = more singular).
As for the intuition about visually reasoning about singularities based on the picture of a zero set: I agree this is useful, but one should also keep in mind that it is not sufficient. These curves just shows the zero set, but the singularities (and their RLCTs) are defined not just based on the shape of the zero set but also based on the local shape of the function around the zero set.
Here’s an example that might clarify. Consider two functions $J, K : R^{2} \to R$ such that $J (x, y) = x y$ and $K (x, y) = x^{2} y^{2}$ . Then these functions both have the same zero set ${(x, y) : x = 0 \lor y = 0}$ . That set has an intersection at the origin. Observe the following:
- Both $J (0, 0) = 0$ and $\nabla J (0, 0) = \to 0$ , so the intersection is a singularity in the case of $J$ .
- The other points on the zero set of $J$ are not singular. E.g. if $y = 0$ but $x \neq 0$ , then $\nabla J (x, 0) = (0, x) \neq \to 0$ .
- Even though $K$ has the exact same zero set, all of its zeros are singular points! Observe $\nabla K (x, y) = (2 x y^{2}, 2 x^{2} y)$ , which is zero everywhere on the zero set.
In general, it’s a true intuition that intersections of lines in zero sets correspond to singular points. But this example shows that whether non-intersecting points of the zero set are singular points depends on more than just the shape of the zero set itself.
In singular learning theory, the functions we consider are non-negative (Kullback—Leibler divergence), so you don’t get functions like $J$ with non-critical zeros. However, the same argument here about existence of singularities could be extended to the danger of reasoning about the extent of singularity of singular points based on just looking at the shape of the zero set: the RLCT will depend on how the function behaves in the neighbourhood, not just on the zero set.
1. ^
  One exception, you could say, is in the definition of strictly singular models. There, as we discussed, we had a condition involving the degeneracy of the Fisher information matrix (FIM) at a parameter. Degenerate matrix = non-invertible matrix = also called singular matrix. I think you could call these parameters ‘singularities’ (of the model).
  One subtle point in this notion of singular parameter is that the definition of the FIM at a parameter $w$ involves setting the true parameter to $w$ . For a fixed true parameter, the set of singularities (zeros of KL loss wrt. that true parameter) will not generally coincide with the set of singularities (parameters where the FIM is degenerate).
  Alternatively, you could consider the FIM condition in the definition of a non-regular model to be saying “if a model would have degenerate singularities at some parameter if that were the true parameter, then the model is non-regular”.

Matthew Farrugia-Roberts

Distill­ing Re­plac­ing Guilt

Distilling Replacing Guilt