ryan_greenblatt comments on evhub’s Shortform

ryan_greenblatt 14 Jan 2025 0:27 UTC
LW: 7 AF: 6
0
AF
As discussed in How will we update about scheming?:

While I expect that in some worlds, my P(scheming) will be below 5%, this seems unlikely (only 25%). AI companies have to either disagree with me, expect to refrain from developing very powerful AI, or plan to deploy models that are plausibly dangerous schemers; I think the world would be safer if AI companies defended whichever of these is their stance.

I wish Anthropic would explain whether they expect to be able to rule out scheming, plan to effectively shut down scaling, or plan to deploy plausibly scheming AIs. Insofar as Anthropic expects to be able to rule out scheming, outlining what evidence they expect would suffice would be useful.

Something similar on state proof security would be useful as well.

I think there is a way to do this such that the PR costs aren’t that high and thus it is worth doing unilaterially from a variety of perspectives.