Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.
Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.