right, the purpose of this is that in order to make good on that obligation to humanity, I want—as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully—to be able to honestly say to the AI, “if we get this right in ways that are favorable for humanity, it’s also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds” or so. It’s not a first line of defense, but it seems like one that is relevant, and I’ve noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven’t said anything like “I promise to personally x”, just “if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it”, or so.
right, the purpose of this is that in order to make good on that obligation to humanity, I want—as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully—to be able to honestly say to the AI, “if we get this right in ways that are favorable for humanity, it’s also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds” or so. It’s not a first line of defense, but it seems like one that is relevant, and I’ve noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven’t said anything like “I promise to personally x”, just “if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it”, or so.