I feel like there are malignant failure modes beyond the categories mentioned by Bostrom. Perhaps it would be sensible to try to break down the topic systematically. Here’s one attempt.
Design by fools: the AI does what you ask, but you asked for something clearly unfriendly.
Perverse instantiation & infrastructure profusion: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen destructive ways, such as redirecting most resources to its infrastructure at our expense.
Partial perverse instantiation & mind crime: the AI does what you ask, which includes both friendly behavior and unfriendly behavior, such as badly treating simulations that have moral status in order to figure out how to treat you well.
Partial instantiation: though the total of what you ask seems friendly, some of what you ask is impossible, the AI does the rest, and the result is imbalanced to an unfriendly degree.
Value drift: changes occur to the AIs code such that it does not do what you ask.
It doesn’t seem to me that mind crime is a malignant failure mode in the sense that Bostrom defines it: it doesn’t stop actual humans from doing their human things, and it seems that it could easily happen twice.
What do you disagree with most in this section?
I feel like there are malignant failure modes beyond the categories mentioned by Bostrom. Perhaps it would be sensible to try to break down the topic systematically. Here’s one attempt.
Design by fools: the AI does what you ask, but you asked for something clearly unfriendly.
Perverse instantiation & infrastructure profusion: the AI does what you ask, but what you ask turns out to be most satisfiable in unforeseen destructive ways, such as redirecting most resources to its infrastructure at our expense.
Partial perverse instantiation & mind crime: the AI does what you ask, which includes both friendly behavior and unfriendly behavior, such as badly treating simulations that have moral status in order to figure out how to treat you well.
Partial instantiation: though the total of what you ask seems friendly, some of what you ask is impossible, the AI does the rest, and the result is imbalanced to an unfriendly degree.
Value drift: changes occur to the AIs code such that it does not do what you ask.
It doesn’t seem to me that mind crime is a malignant failure mode in the sense that Bostrom defines it: it doesn’t stop actual humans from doing their human things, and it seems that it could easily happen twice.