cubefox comments on Mati_Roy’s Shortform

cubefox 9 May 2024 6:29 UTC
7 points
0
I guess for a cat classifier, disentanglement is not possible, because it wants to classify things as cats if and only if it believes they are cats. Since values and beliefs are perfectly correlated here, there is no test we could perform which would distinguish what it wants from what it believes.

Though we could assume we don’t know what the classifier wants. If it doesn’t classify a cat image as “yes”, it could be because it is (say) actually a dog classifier, and it correctly believes the image contains something other than a dog. Or it could be because it is indeed a cat classifier, but it mistakenly believes the image doesn’t show a cat.

One way to find out would be to give the classifier an image of the same subject, but in higher resolution or from another angle, and check whether it changes its classification to “yes”. If it is a cat classifier, it is likely it won’t make the mistake again, so it probably changes its classification to “yes”. If it is a dog classifier, it will likely stay with “no”.

This assumes that mistakes are random and somewhat unlikely, so will probably disappear when the evidence is better or of a different sort. Beliefs react to such changes in evidence, while values don’t.