The real gist of Friendly AI isn’t some imaginary 100% perfect safety concept, it’s ideas like, “Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification.” Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you’re 100% doomed.
This analysis isn’t right. If the designers of an intelligent system don’t crack a problem, it doesn’t mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design. You can’t just assume that a bug in an intelligent system’s implementation will persist for a billion iterative modifications without it being discovered and fixed.
It would surely be disingenious to argue that—if everything turned out all right—the original designers must have solved the problem without even realising it.
We should face up to the fact that this may not be a problem we need to solve alone—it might get solved by intelligent machines—or, perhaps, by the man-machine symbiosis.
If the designers of an intelligent system don’t crack a problem, it doesn’t mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design.
The quoted excerpt is not about modifications, it is about self-modifications. If there’s a bug in any part of an AI’s code that’s relevant to how it decides to modify itself, there’s no reason to expect that it will find and correct that bug (e.g. if the bug causes it to incorrectly label bugs). Maybe the bug will cause it to introduce more bugs instead.
Maybe the self-improving system will get worse—or fail to get better. I wasn’t arguing that success was inevitable, just that the argument for near-certain failure due to compound interest on a small probability of failure is wrong.
Maybe we could slap together a half-baked intelligent agent, and it could muddle through and fix itself as it grew smarter and learned more about its intended purpose. That approach doesn’t follow the proposed methodology—and yet it evidently doesn’t have a residual probability of failure that accumulates and eventually dominates. So the idea that—without following the proposed methodology you are doomed—is wrong.
Your argument depends on the relative size of “success” where random stumbling needs to end up in, and its ability to attract the corrections. If “success” is something like “consequentialism”, I agree that intermediate errors might “correct” themselves (in some kind of selection process), and the program ends up as an agent. If it’s “consequentialism with specifically goal H”, it doesn’t seem like there is any reason for the (partially) random stumbling to end up with goal H and not some other goal G.
(Learning what its intended purpose was doesn’t seem different from learning what the mass of the Moon is, it doesn’t automatically have the power of directing agent’s motivations towards that intended purpose, unless for example this property of going towards the original intended purpose is somehow preserved in all the self-modifications, which does sound like a victory condition.)
I am not sure you can legitimately characterise the efforts of an intelligent agent as being “random stumbling”.
Anyway, I was pointing out a flaw in the reasoning supporting a small probability of failure (under the described circumstances). Maybe some other argument supports a small probability of failure. However, the original argument would still be wrong.
Other approaches—including messy ones like neural networks—might result in a stable self-improving system with a desirable goal, apart from trying to develop a deterministic self-improving system that has a stable goal from the beginning.
A good job too. After all, those are our current circumstances. Complex messy systems like Google and hedge funds are growing towards machine intelligence—while trying to preserve what they value in the process.
Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.
Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.
This analysis isn’t right. If the designers of an intelligent system don’t crack a problem, it doesn’t mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design. You can’t just assume that a bug in an intelligent system’s implementation will persist for a billion iterative modifications without it being discovered and fixed.
It would surely be disingenious to argue that—if everything turned out all right—the original designers must have solved the problem without even realising it.
We should face up to the fact that this may not be a problem we need to solve alone—it might get solved by intelligent machines—or, perhaps, by the man-machine symbiosis.
The quoted excerpt is not about modifications, it is about self-modifications. If there’s a bug in any part of an AI’s code that’s relevant to how it decides to modify itself, there’s no reason to expect that it will find and correct that bug (e.g. if the bug causes it to incorrectly label bugs). Maybe the bug will cause it to introduce more bugs instead.
Maybe the self-improving system will get worse—or fail to get better. I wasn’t arguing that success was inevitable, just that the argument for near-certain failure due to compound interest on a small probability of failure is wrong.
Maybe we could slap together a half-baked intelligent agent, and it could muddle through and fix itself as it grew smarter and learned more about its intended purpose. That approach doesn’t follow the proposed methodology—and yet it evidently doesn’t have a residual probability of failure that accumulates and eventually dominates. So the idea that—without following the proposed methodology you are doomed—is wrong.
Your argument depends on the relative size of “success” where random stumbling needs to end up in, and its ability to attract the corrections. If “success” is something like “consequentialism”, I agree that intermediate errors might “correct” themselves (in some kind of selection process), and the program ends up as an agent. If it’s “consequentialism with specifically goal H”, it doesn’t seem like there is any reason for the (partially) random stumbling to end up with goal H and not some other goal G.
(Learning what its intended purpose was doesn’t seem different from learning what the mass of the Moon is, it doesn’t automatically have the power of directing agent’s motivations towards that intended purpose, unless for example this property of going towards the original intended purpose is somehow preserved in all the self-modifications, which does sound like a victory condition.)
I am not sure you can legitimately characterise the efforts of an intelligent agent as being “random stumbling”.
Anyway, I was pointing out a flaw in the reasoning supporting a small probability of failure (under the described circumstances). Maybe some other argument supports a small probability of failure. However, the original argument would still be wrong.
Other approaches—including messy ones like neural networks—might result in a stable self-improving system with a desirable goal, apart from trying to develop a deterministic self-improving system that has a stable goal from the beginning.
A good job too. After all, those are our current circumstances. Complex messy systems like Google and hedge funds are growing towards machine intelligence—while trying to preserve what they value in the process.
Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.
Such flawed self-modifications cannot be logically independent. Either it’s there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we’re all dead), or there isn’t such a flaw.
Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we’re dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can’t tell much about low-frequency flaws.)
But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we’re all dead. Again.