The benchmark is open ended Q&A, not multiple choice, so there is no 25% random baseline. The model generates free text responses and has no options to select from.43% is the percentage of questions scoring 2 which is fully correct. I should have stated that more clearly.
For the questions, they were constructed from Nigerian veterinary curriculum materials and my years of specific training as a vet student. They are not answerable by general veterinary knowledge, breed specific production parameters for White Fulani or field recognition cues for trypanosomiasis as a Fulani herdsman would use them do not appear in Western literature. That is the gap being measured.
The benchmark is open ended Q&A, not multiple choice, so there is no 25% random baseline. The model generates free text responses and has no options to select from.43% is the percentage of questions scoring 2 which is fully correct. I should have stated that more clearly.
For the questions, they were constructed from Nigerian veterinary curriculum materials and my years of specific training as a vet student. They are not answerable by general veterinary knowledge, breed specific production parameters for White Fulani or field recognition cues for trypanosomiasis as a Fulani herdsman would use them do not appear in Western literature. That is the gap being measured.