In this comment, we’ll collect the results of tests we run with models released after the above post was written. We’ll only test new visual LLMs for which there seems to be a significant chance of outperforming GPT-4V; we will also, to save on API costs (e.g. for Claude 3 Opus each 0-shot game costs about 0.5 USD), run just the basic prompt with 0-shot, see whether that produces significantly better results than GPT-4V and on that basis decide whether to comprehensively test the new model.
Initial testing
Claude 3 Opus: Scored 17.5 with the basic prompt (0-shot), averaged over 10 games, compared to 19.6 scored by GPT-4V. It misclassifies tetrominoes a lot. We decided not to test it more comprehensively.
Comprehensive testing
So far, no models have seemed worth to comprehensively test. If you believe there is a model that could plausibly outperform GPT-4V, that we have missed, please send me an email.
Benchmarking new models
In this comment, we’ll collect the results of tests we run with models released after the above post was written. We’ll only test new visual LLMs for which there seems to be a significant chance of outperforming GPT-4V; we will also, to save on API costs (e.g. for Claude 3 Opus each 0-shot game costs about 0.5 USD), run just the basic prompt with 0-shot, see whether that produces significantly better results than GPT-4V and on that basis decide whether to comprehensively test the new model.
Initial testing
Claude 3 Opus: Scored 17.5 with the basic prompt (0-shot), averaged over 10 games, compared to 19.6 scored by GPT-4V. It misclassifies tetrominoes a lot. We decided not to test it more comprehensively.
Comprehensive testing
So far, no models have seemed worth to comprehensively test. If you believe there is a model that could plausibly outperform GPT-4V, that we have missed, please send me an email.