ryan_greenblatt comments on Many arguments for AI x-risk are wrong

ryan_greenblatt 20 Mar 2024 1:15 UTC
LW: 3 AF: 2
0
AF
Another overall reaction I have to your comment:

Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.

Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.

The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.