Hey, I have a weird suggestion here: Test weaker / smaller / less trained models on some of these capabilities, particularly ones that you would still expect to be within their capabilities even with a weaker model. Maybe start with Mixtral-8x7B. Include Claude Haiku, out of modern ones. I’m not sure to what extent what I observed has kept pace with AI development, and distilled models might be different, and ‘overtrained’ models might be different.
However, when testing for RAG ability, quite some time ago in AI time, I noticed a capacity for epistemic humility/deference that was apparently more present in mid-sized models than larger ones. My tentative hypothesis was that this had something to do with stronger/sharper priors held in larger models, interfering somewhat with their ability to hold a counterfactual well. (“London is the capital of France” given in RAG context retrieval being the specific little test in that case.)
This is only applicable to some of the failure modes you’ve described, but since I’ve seen overall “smartness” actively work against the capability of the model in some situations that need more of a workhorse, it seemed worth mentioning. Not all capabilities are on the obvious frontier.
Hey, I have a weird suggestion here:
Test weaker / smaller / less trained models on some of these capabilities, particularly ones that you would still expect to be within their capabilities even with a weaker model.
Maybe start with Mixtral-8x7B. Include Claude Haiku, out of modern ones. I’m not sure to what extent what I observed has kept pace with AI development, and distilled models might be different, and ‘overtrained’ models might be different.
However, when testing for RAG ability, quite some time ago in AI time, I noticed a capacity for epistemic humility/deference that was apparently more present in mid-sized models than larger ones. My tentative hypothesis was that this had something to do with stronger/sharper priors held in larger models, interfering somewhat with their ability to hold a counterfactual well. (“London is the capital of France” given in RAG context retrieval being the specific little test in that case.)
This is only applicable to some of the failure modes you’ve described, but since I’ve seen overall “smartness” actively work against the capability of the model in some situations that need more of a workhorse, it seemed worth mentioning. Not all capabilities are on the obvious frontier.