Buck comments on Untrusted smart models and trusted dumb models

Buck 4 Nov 2023 14:22 UTC
5 points
3
In this post, I’m talking about deceptive alignment. The threat model you’re talking about here doesn’t really count as deceptive alignment, because the organisms weren’t explicitly using a world model to optimize their choices to cause the bad outcome. AIs like that might still be a problem (e.g. I think that deceptively aligned AI probably contributes less than half of my P(doom from AI)), but I think we should think of them somewhat separately from deceptively aligned models, because they pose risk by somewhat different mechanisms.