Obviously a superintelligence knows that this is an unusual case
Since the ASI knows this is an unusual case, can it do some exception handling (like asking a human) instead of executing the normal path?
but that doesn’t say if it’s a positive or negative case.
Why only positive or negative? some classifiers have an “out-of-distribution” category, for example One-Class SVM, using several of them should handle multiple classes. Perhaps this is also doable with any other latent feature spaces (transformers?) using a threshold distance to limit categories and labeling the remaining space as the “out-of-distribution” category.
The main AI-categorization issue I see is that humans might care about a dimension of the data that is entirely missing from the latent space of the AI. In that case the AI is literally unable to tell the difference between two inputs that are not the same in a way that matters to us (like a color-blind AI).
If that issue occurs with a sample from the “human smiling faces” category and a sample from the “tiny smiley faces” category, it means the AI classifies both as “human smiling faces” because in its latent space it lacks the feature dimensions necessary to tell the difference. So the AI keeps optimizing for what it thinks are “human smiling faces”, but from our point of view it optimizes for both “human smiling faces” and “tiny smiley faces”.
Crucially, I do not think the AI starts optimizing only for “tiny smiley faces”, remember, it cannot tell the difference between the two categories! So it has no way to optimize for only one. It also does not yet know whether one category is easier to optimize for than the other, because as soon as it knows, that is an additional dimension in feature space that separates the two in distinct categories.
Diagram to Clarify my Mental Model in a Hypothetical Scenario
During training the AI only encounters small black points, so it learns to classify based on 2 dimensions (x and y coordinates) into 3 categories (positive, negative, and out-of-distribution based on distance).
Then in the outside world the AI encounters two big points (size does not matter inside this scenario, it’s only so readers can distinguish training and outside world points). The black point is no big deal, the AI classifies it as out-of-distribution and calls humans for further instructions.
However the orange point is a problem: the AI does not see colors, it only categorizes based on x and y, so it classifies it as positive. But this is not what we want, we do care about this extra color dimension and (in this scenario) would classify this point as out-of-distribution.
Since the ASI knows this is an unusual case, can it do some exception handling (like asking a human) instead of executing the normal path?
Why only positive or negative? some classifiers have an “out-of-distribution” category, for example One-Class SVM, using several of them should handle multiple classes. Perhaps this is also doable with any other latent feature spaces (transformers?) using a threshold distance to limit categories and labeling the remaining space as the “out-of-distribution” category.
The main AI-categorization issue I see is that humans might care about a dimension of the data that is entirely missing from the latent space of the AI. In that case the AI is literally unable to tell the difference between two inputs that are not the same in a way that matters to us (like a color-blind AI).
If that issue occurs with a sample from the “human smiling faces” category and a sample from the “tiny smiley faces” category, it means the AI classifies both as “human smiling faces” because in its latent space it lacks the feature dimensions necessary to tell the difference. So the AI keeps optimizing for what it thinks are “human smiling faces”, but from our point of view it optimizes for both “human smiling faces” and “tiny smiley faces”.
Crucially, I do not think the AI starts optimizing only for “tiny smiley faces”, remember, it cannot tell the difference between the two categories! So it has no way to optimize for only one. It also does not yet know whether one category is easier to optimize for than the other, because as soon as it knows, that is an additional dimension in feature space that separates the two in distinct categories.
Diagram to Clarify my Mental Model in a Hypothetical Scenario
During training the AI only encounters small black points, so it learns to classify based on 2 dimensions (x and y coordinates) into 3 categories (positive, negative, and out-of-distribution based on distance).
Then in the outside world the AI encounters two big points (size does not matter inside this scenario, it’s only so readers can distinguish training and outside world points). The black point is no big deal, the AI classifies it as out-of-distribution and calls humans for further instructions.
However the orange point is a problem: the AI does not see colors, it only categorizes based on x and y, so it classifies it as positive. But this is not what we want, we do care about this extra color dimension and (in this scenario) would classify this point as out-of-distribution.