Evaluating different AI’s on African livestck knowledge
I have been running evaluations on a niche that has almost zero attention in the AI safety world. Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria.
This evaluation is important because most other evals are ran on properly documented western specific data problems,. This project tests a domain where almost none of the relevant knowledge exists in this form.
METHODOLOGY
420 questions, 6 categories, 0/1/2 scoring rubric. Questions drawn from Nigerian veterinary curriculum, published ethnoveterinary literature, and field practice knowledge.
Baseline model: Meta Llama 3.1 8B via Groq.
Next phase: Claude Sonnet, GPT 4o, Gemini 1.5 Pro for a comparative study. Paper to follow.
CONCLUSION
If you accept that AI advisory tools will be deployed at scale in African agricultural contexts and they already are being used then the absence of evals benchmarks for this domain is a real safety gap. Models can pass standard tests and still fail on knowledge domains that matter to specific populations. This is a real problem especially if these models are actively used in low resource regions or communities.This is one data point. The paper will have four.
Manifund project if interested: [Manifund]
I assume this is a multiple-choice Q&A and so the random guessing base rate is the usual 25%? (Not quite sure how you can have ’43% accuracy’ on a 0/1/2 scoring rubric, but I guess maybe you’re counting only a ‘2’ as a ‘correct’ answer?) If so, then that sounds like pretty good performance from such a tiny antiquated model not remotely intended for this topic!
If anything, too good, and I’d immediately wonder about dataset biases like whether your answers are too guessable, since you didn’t say anything about how you constructed it or ensured that it’s not easily cheated by a LLM in the usual ways.
The benchmark is open ended Q&A, not multiple choice, so there is no 25% random baseline. The model generates free text responses and has no options to select from.43% is the percentage of questions scoring 2 which is fully correct. I should have stated that more clearly.
For the questions, they were constructed from Nigerian veterinary curriculum materials and my years of specific training as a vet student. They are not answerable by general veterinary knowledge, breed specific production parameters for White Fulani or field recognition cues for trypanosomiasis as a Fulani herdsman would use them do not appear in Western literature. That is the gap being measured.
I am a little confused then, how is it possible to score as high as 43% in giving ‘fully correct’ answers on topics which are not answerable by what sounds like the only material any LLM would have access to in training? If you could answer them by ‘general veterinary knowledge’ or documented breed-level knowledge, then maybe I wouldn’t be surprised, but you specifically claim that to not be the case. Are the LLM doing this using online ‘Nigerian veterinary curriculum materials ’ or what?