I don’t love using continuity arguments, as in ‘if Opus 4.6 had such inclinations and tried something we would have caught it.’ I especially don’t love it this time, and they acknowledge in this case the argument is weaker. I don’t think Opus would have tried, exactly because Opus would have known it would not succeed, so you can’t judge incination distinct from propensity. And I think Anthropic is far too quick to assume that the properties from one model are likely to be copied in the next one.
(bold mine)
I think this indicates a real problem in the current methodology.
It is always a big problem to use properties of a previous model to establish properties of the current one. There is a huge variety of ways this could screw up even the most rigorous correctness-checking process.
We have a well-known example of the Ariane 5 maiden flight disaster, https://en.wikipedia.org/wiki/Ariane_flight_V88, where software failed despite having been “formally proved correct”, because some properties established for Ariane 4 had been incorrectly assumed as valid axioms for Ariane 5.
This was supposed to make a very expensive process of formal correctness proof somewhat less expensive, because otherwise it would be necessary to prove more things from scratch. Unfortunately, taking this route ended up invalidating the whole process of formally establishing correctness.
It’s just an illustration. Here the situation is different, and the traps are different, but the bottom line is that as soon as one is bringing multiple models into a safety and correctness consideration, one is immediately dealing with a much more complex and non-trivial situation one needs to analyze.
(bold mine)
I think this indicates a real problem in the current methodology.
It is always a big problem to use properties of a previous model to establish properties of the current one. There is a huge variety of ways this could screw up even the most rigorous correctness-checking process.
We have a well-known example of the Ariane 5 maiden flight disaster, https://en.wikipedia.org/wiki/Ariane_flight_V88, where software failed despite having been “formally proved correct”, because some properties established for Ariane 4 had been incorrectly assumed as valid axioms for Ariane 5.
This was supposed to make a very expensive process of formal correctness proof somewhat less expensive, because otherwise it would be necessary to prove more things from scratch. Unfortunately, taking this route ended up invalidating the whole process of formally establishing correctness.
It’s just an illustration. Here the situation is different, and the traps are different, but the bottom line is that as soon as one is bringing multiple models into a safety and correctness consideration, one is immediately dealing with a much more complex and non-trivial situation one needs to analyze.