ryan_greenblatt comments on Untrusted smart models and trusted dumb models

ryan_greenblatt 6 Feb 2024 22:21 UTC
2 points
0
The core claim is that if the AI was sufficiently weak that it couldn’t answer these questions it also likely wouldn’t be able to even come up with the idea of scheming with a particular strategy. Like in principle it has the knowledge, but it would be quite unlikely to come up with an overall plan.

Separately, GPT-4 seems qualitatively pretty close the to my best guess at the conservative capability threshold. (E.g., much closer than GPT-3.)

Also, I DM’d you a draft doc with more of our thoughts on how to evaluate this.