I have experienced similar problems to you when building an AI tool—better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:
Selection Bias—when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it
Alignment—You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue
I have experienced similar problems to you when building an AI tool—better models did not necessarily lead to better performance despite external benchmarks. I believe there are 2 main reasons why this is, alluded to in your post:
Selection Bias—when a foundation model company releases their newest model, they show performance on benchmarks most favorable to it
Alignment—You mentioned how AI is not truly understanding the instructions you meant. While this can be mitigated by creating better prompts, it does not fully solve the issue