Lovre comments on Putting multimodal LLMs to the Tetris test

Lovre 8 Mar 2024 19:34 UTC
1 point
0
Benchmarking new models
We aim to test any new promising models; initially just running the basic prompt with 0-shot, over 10 games, and depending on the results deciding whether to run the full test. So far none of the newer models have seemed promising enough to do so.
Model Average lines cleared
GPT-4V (from the post) 19.6
Claude 3 Opus 17.5
GPT-4o 20.4
Claude 3.5 Sonnet 19.1