Noting one other dynamic: advanced models are probably not going to act misaligned in everyday use cases (that consumers have an incentive to care about, though again revealed preference is less clear), even if they’re misaligned. That’s the whole deceptive alignment thing.
Agreed, but customers would also presumably be a bit worried that the AI would rarely cross them and steal their stuff or whatever which is somewhat different. Like there wouldn’t be a feedback loop toward this where we necessarily see a bunch of early failures, but if we’ve seen a bunch of cases where scheming powerseeking AIs in the lab execute well crafted misaligned plans, then customers might want an AI which is less likely to do this.
Agreed, but customers would also presumably be a bit worried that the AI would rarely cross them and steal their stuff or whatever which is somewhat different. Like there wouldn’t be a feedback loop toward this where we necessarily see a bunch of early failures, but if we’ve seen a bunch of cases where scheming powerseeking AIs in the lab execute well crafted misaligned plans, then customers might want an AI which is less likely to do this.