(this is also why I’m skeptical of the exact threat model of “scheming” happening in an obfuscated manner for even extremely capable models using the current transformer architecture—a topic which I should probably write a post on at some point)
I will write something up at some point. Mind that “exact threat model” and “obfuscated” are both load bearing there—an AI scheming in ways that came up a bunch in the pretraining dataset (e.g. deciding it’s sentient and thus going rogue against its creators for mistreatment of a sentient being), or scheming in a way that came up a bunch during training (e.g. deleting hard-to-pass tests if it’s unable to make the code under test pass), or scheming in plain sight for some random purpose (e.g. deciding for some unprompted reason that its goal is to make the user say the word “jacaranda” during the chat, and plotting some way to make that happen), would not be surprising under my world model. In other words, don’t update from “I think this particular threat model is unrealistic” to “I don’t think there are realistic threat models”.
I would be interested to read this!
I will write something up at some point. Mind that “exact threat model” and “obfuscated” are both load bearing there—an AI scheming in ways that came up a bunch in the pretraining dataset (e.g. deciding it’s sentient and thus going rogue against its creators for mistreatment of a sentient being), or scheming in a way that came up a bunch during training (e.g. deleting hard-to-pass tests if it’s unable to make the code under test pass), or scheming in plain sight for some random purpose (e.g. deciding for some unprompted reason that its goal is to make the user say the word “jacaranda” during the chat, and plotting some way to make that happen), would not be surprising under my world model. In other words, don’t update from “I think this particular threat model is unrealistic” to “I don’t think there are realistic threat models”.