Assume we have a disease-detecting CV algorithm that looks at microscope images of tissue for cancerous cells. Maybe there’s a specific protein cluster (A) that shows up on the images which indicates a cancerous cell with 0.99 AUC. Maybe there’s also another protein cluster (B) that shows up and only has 0.989 AUC, A overlaps with B in 99.9999% of true positive. But B looks big and ugly and black and cancery to a human eye, A looks perfectly normal, it’s almost indistinguishable from perfectly benign protein clusters even to the most skilled oncologist.
If I understand this thought experiment right, we are also to assume that we know the slight difference in AUC is not just statistical noise (even with the high co-linearity between the A cluster and the B cluster)? So, say we assume that you still get a slightly higher AUC for A on a data set of cells that have either only A or neither versus a data set of cells with either only B or neither?
In that case, I would say that the model that weighs A a bit more is actually “explainable” in the relevant sense of the term - it is just that some people find the explanation aesthetically unpleasing. You can show what features the model is looking at to assign a probability that some cell is cancerous. You can show how, in the vast majority of cases, a model that looks at the presence or absence of the A cluster assigns a higher probability of a cell being cancerous to cells that actually are cancerous. And you can show how a model that looks at B does that also, but that A is slightly better at it.
If the treatment is going to be slightly different for a patient depending on how much weight you give to A versus B, and if I were the patient, I would want to use the treatment that has the best chance of working without negative side effects based on the data, regardless of whether A or B looks uglier. If some other patients want a version of the treatment that is statistically less likely to work based on their aesthetic sense of A versus B, I would think that is a silly risk to take (though also a very slight risk if A and B are that strongly correlated), but that would be their problem not mine.
It is not obvious to me from reading that transcript (and the attendant commentary) that GPT-3 was even checking to see whether or not the parentheses were balanced. Nor that it “knows” (or has in any way encoded the idea) that the sequence of parentheses between the quotes contains all the information needed to decide between balanced versus unbalanced, and thus every instance of the same parentheses sequence will have the same answer for whether or not it is balanced.
Reasons:
By my count, “John” got 18 out of 32 right which is not too far off from the average you would expect from random chance.
Arthur indicated that GPT-3 had at some point “generated inaccurate feedback from the teacher” which he edited out of the final transcript, so it was not only when taking the student’s perspective that there were errors.
GPT-3 does not seem to have a consistent mental model of John’s cognitive abilities and learning rate. At the end John gets a question wrong (even though John has already been told the answer for that specific sequence). But earlier, GPT-3 outputs that “By the end of the lesson, John has answered all of your questions correctly” and that John “learned all the rules about parentheses” and learned “all of elementary mathematics” in a week (or a day).
I suppose one way to test this (especially if OpenAI can provide the same random seed as was used here and make this reproducible) would be to have input prompts written from John’s perspective asking the teacher questions as if trying to understand the lesson. If GPT-3 is just “play-acting” based on the expected level of understanding of the character speaking, I would expect it to exhibit a higher level of accuracy/comprehension (on average, over many iterations) when writing from the perspective of the teacher rather than the student.