By “reliable” I mean it in the same way as we think of it for self-driving cars. A self-driving car that is great 99% of the time and fatally crashes 1% of the time isn’t really “high skill and unreliable”—part of having “skill” in driving is being reliable.
In the same way, I’m not sure I would want to employ an AI software engineer that 99% of the time was great, but 1% of the time had totally weird inexplicable failure modes that you’d never see with a human. It would just be stressful to supervise, to limit its potential harmful impact to the company, etc. So it seems to me that AI’s won’t be given control of lots of things, and therefore won’t be transformative, until that reliability threshold is met.
I would say:
A theory always takes the following form: “given [premises], I expect to observe [outcomes]”. The only way to say that an experiment has falsified a theory is to correctly observe/set up [premises] but then not observe [outcomes].
If an experiment does not correctly set up [premises], then that experiment is invalid for falsifying or supporting the theory. The experiment gives no (or nearly no) Bayesian evidence either way.
In this case, [premises] are the assumptions we made in determining the theoretical pendulum period; things like “the string length doesn’t change”, “the pivot point doesn’t move”, “gravity is constant”, “the pendulum does not undergo any collisions”, etc. The fact that (e.g.) the pivot point moved during the experiment invalidates the premises, and therefore the experiment does not give any Bayesian evidence one way or another against our theory.
Then the students could say:
“But you didn’t tell us that the pivot point couldn’t move when we were doing the derivation! You could just be making up new “necessary premises” for your theory every time it gets falsified!”
In which case I’m not 100% sure what I’d say. Obviously we could have listed out more assumptions that we did, but where do you stop? “the universe will not explode during the experiment”...?