This metric also ignores invalid answers (refusals or gibberish).
If you don’t ignore invalid answers, do the results change significantly?
Refusals were mostly 1-2%, so ignoring them doesn’t change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn’t matter
If you don’t ignore invalid answers, do the results change significantly?