(Not sure if I’m missing something, but my initial reaction:)
There’s a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.
For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice’s theorem etc.)
Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.
(Not sure if I’m missing something, but my initial reaction:)
There’s a big difference between being able to verify for some specific programs if they have a property, and being able to check it for all programs.
For an arbitrary TM, we cannot check whether it outputs a correct solution to a specific NP complete problem. We cannot even check that it halts! (Rice’s theorem etc.)
Not sure what alignment relevant claim you wanted to make, but I doubt this is a valid argument for it.
Likewise, for some specific programs we can verify that they halt.