Schemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of “selection” is pretty vague—I do not know how to assign “the degree to which [one cognitive pattern] is counterfactually responsible” for something even in principle. It also doesn’t even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that’s a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what “schemer” means they would have been forced to develop a more nuanced understanding of what we even mean by “alignment” and “goals” and so on, which is the kind of thinking I want to see more of.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don’t clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions.
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).
Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of “selection” is pretty vague—I do not know how to assign “the degree to which [one cognitive pattern] is counterfactually responsible” for something even in principle. It also doesn’t even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that’s a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what “schemer” means they would have been forced to develop a more nuanced understanding of what we even mean by “alignment” and “goals” and so on, which is the kind of thinking I want to see more of.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).
Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.