For people who do test prep seriously (I used to be a full time tutor), this has been known for decades. One of the standard things I used to tell every student was if you have no idea what the answer is, guess B, because B is statistically most likely to be the correct answer. When I was in 10th grade (this was 2002), I didn’t have anything to gain by doing well on the math state standardized test, so I tested the theory that B is most likely to be correct. 38% of the answers on that test were in fact B.
> This is pretty weird. As far as we know, humans don’t tend to prefer choices labeled B, so we’re not sure where this could have come from in the training data. As humans, it initially didn’t even occur to us to look for it!
Remember, LLMs aren’t modeling how a human reading text would process the text. LLMs are trying to model the patterns in the texts that are in the training data itself. In this case, that means they are doing something closer to imitating test writers than test takers. And it is well known that humans, including those who write tests, are bad at being random.
This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.
For people who do test prep seriously (I used to be a full time tutor), this has been known for decades. One of the standard things I used to tell every student was if you have no idea what the answer is, guess B, because B is statistically most likely to be the correct answer. When I was in 10th grade (this was 2002), I didn’t have anything to gain by doing well on the math state standardized test, so I tested the theory that B is most likely to be correct. 38% of the answers on that test were in fact B.
> This is pretty weird. As far as we know, humans don’t tend to prefer choices labeled B, so we’re not sure where this could have come from in the training data. As humans, it initially didn’t even occur to us to look for it!
Remember, LLMs aren’t modeling how a human reading text would process the text. LLMs are trying to model the patterns in the texts that are in the training data itself. In this case, that means they are doing something closer to imitating test writers than test takers. And it is well known that humans, including those who write tests, are bad at being random.
This is so interesting. I had no idea that this was a thing! I would have assumed that test-writers wrote all of the answers out, then used a (pseudo-)randomizer to order them. But if that really is a pattern in multiple choice tests, it makes absolute sense that Llama would pick up on it.