It would be helpful if, on the heatmap plot, you included/boxed/highlighted/hatched the column that is the actual LLM in question (e.g., what we would expect them to answer all the time) so we got a sense of how bad they were, I was a little confused. Two natural questions: did you try other languages, and I wonder what the full results would look like if you prefilled them giving the wrong name, and then doing the full self report Q&A after?
It would be helpful if, on the heatmap plot, you included/boxed/highlighted/hatched the column that is the actual LLM in question (e.g., what we would expect them to answer all the time) so we got a sense of how bad they were, I was a little confused. Two natural questions: did you try other languages, and I wonder what the full results would look like if you prefilled them giving the wrong name, and then doing the full self report Q&A after?