I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; ‘can it do the thing, if it tried’ vs. ‘would it ever try.’ And alignment is about propensity (though safety is about both).
I think that: 1. Being able to design a chemical weapon with probability at least 50% is a capability 2. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn’t have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.
I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; ‘can it do the thing, if it tried’ vs. ‘would it ever try.’ And alignment is about propensity (though safety is about both).
I think that:
1. Being able to design a chemical weapon with probability at least 50% is a capability
2. Following instructions never to design a chemical weapon with probability at least 99.999% is also a capability.
This requires a capability, but also requires a propensity. For example, smart humans are all capable of avoiding doing armed robbery with pretty high reliability, but some of them do armed robbery despite being told not to do armed robbery at a earlier point in their life. You could say these robbers didn’t have the capability to follow instructions, but this would be an atypical use of these (admittedly fuzzy) words.