I think the actual answer is: the AI isn’t smart enough and trips up a lot.
But I haven’t seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it’s not measuring the one thing that I’m really interested in: why don’t we have AI agents replacing workers. I suspect that some startup’s internal doc on “why does our agent not work yet” would be super interesting to read and track over time.
I think the actual answer is: the AI isn’t smart enough and trips up a lot.
But I haven’t seen a detailed write up anywhere that talks about why the AI trips up and what are the types of places where it trips up. It feels like all of the existing evals work optimize for legibility/reproducibility/being clearly defined. As a result, it’s not measuring the one thing that I’m really interested in: why don’t we have AI agents replacing workers. I suspect that some startup’s internal doc on “why does our agent not work yet” would be super interesting to read and track over time.