Alex Mallen comments on Richard Ngo’s Shortform

Alex Mallen 8 Feb 2026 17:46 UTC
LW: 2 AF: 1
0
AF
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don’t clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions.
Selection = gaining influence.

Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).
Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.