My general problem with the “second type of generalization” is “how are you going to get superintelligence from here?” If your model imitates human thinking, its performance is capped by human performance, so you are not going to get things like nanotech and immortality.
To the question of malgeneralization, I have an example:
Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
To extrapolate this on MNIST example:
Imagine that you have two deck of cards: deck A always has 0 written on it, deck B always has 1. Then you mix two decks to get deck A with 2⁄3 of 0s and 1⁄3 of 1s and vice versa. If you mix decks perfectly random, your predictor of next card from deck is going to learn “always predict 0 for deck A and always predict 1 for deck B”, because optimal predictors do not randomize. When you test your predictor on initial decks, it is going to get 100% accuracy.
But let’s then suppose that you mixed decks not randomly: decks are composed as 1-1-0 (and mirrored for deck B). So your predictor is going to learn “output 1 for every first and second card and 0 for every third card in deck A” and fail miserably during test on initial decks.
You can say: “yes, obviously, if you train model to do wrong thing, it’s going to do wrong thing, nothing surprising”. But when you train superintelligence, you by definition don’t know which thing is “wrong”.
My general problem with the “second type of generalization” is “how are you going to get superintelligence from here?” If your model imitates human thinking, its performance is capped by human performance, so you are not going to get things like nanotech and immortality.
To the question of malgeneralization, I have an example:
To extrapolate this on MNIST example:
Imagine that you have two deck of cards: deck A always has 0 written on it, deck B always has 1. Then you mix two decks to get deck A with 2⁄3 of 0s and 1⁄3 of 1s and vice versa. If you mix decks perfectly random, your predictor of next card from deck is going to learn “always predict 0 for deck A and always predict 1 for deck B”, because optimal predictors do not randomize. When you test your predictor on initial decks, it is going to get 100% accuracy.
But let’s then suppose that you mixed decks not randomly: decks are composed as 1-1-0 (and mirrored for deck B). So your predictor is going to learn “output 1 for every first and second card and 0 for every third card in deck A” and fail miserably during test on initial decks.
You can say: “yes, obviously, if you train model to do wrong thing, it’s going to do wrong thing, nothing surprising”. But when you train superintelligence, you by definition don’t know which thing is “wrong”.