Yet generally people treat “schemer” as a fairly binary classification.
Are there examples you have in mind? Generally I think of this more as “there are a bunch of different evolving, hierarchical taxonomies of potential AI motivations and how to think about them”, for example Fitness-Seekers: Generalizing the Reward-Seeking Threat Model or Carlsmith’s most recent series, I’m always happy for there to be more writing on the subject and think this is a good thing, I rarely see a binary classification used. To be more specific, to the extent one node in the various taxonomies is named “scheming”, I usually am interested in most of the nodes in these taxonomies, as I think many of them often share important properties / are shifting / etc.
I think most of the discussions usually break down less along the lines of “honesty”/”deception” and more trying to understand the training dynamics of reward seeking / oversight / what incentivizes thinking about oversight, how these might get aggregated into motivations, etc. I’m interested if there are any distinctions that you think are currently either underweighted or aren’t being made that should be
Are there examples you have in mind? Generally I think of this more as “there are a bunch of different evolving, hierarchical taxonomies of potential AI motivations and how to think about them”, for example Fitness-Seekers: Generalizing the Reward-Seeking Threat Model or Carlsmith’s most recent series, I’m always happy for there to be more writing on the subject and think this is a good thing, I rarely see a binary classification used. To be more specific, to the extent one node in the various taxonomies is named “scheming”, I usually am interested in most of the nodes in these taxonomies, as I think many of them often share important properties / are shifting / etc.
I think most of the discussions usually break down less along the lines of “honesty”/”deception” and more trying to understand the training dynamics of reward seeking / oversight / what incentivizes thinking about oversight, how these might get aggregated into motivations, etc. I’m interested if there are any distinctions that you think are currently either underweighted or aren’t being made that should be