Security/safety is, as always, a property of the system as a whole, and not of any individual part, such as a particular model checkpoint.
Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.
The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.
Another overall reaction I have to your comment:
Yes of course, but the key threat model under discussion here is scheming which centrally involves a specific black box individual part conspiring to cause problems. So operating on the level of that individual part is quite reasonable: if we can avoid this part intentionally causing problems, that would suffice for diffusing the core scheming concern.
The surronding situation might make it more or less easy to avoid the actual model weights intentionally causing problems, but analysis at the individual weight input/output level can (in principle) suffice.