anithite comments on A Visual Task that’s Hard for GPT-4o, but Doable for Primary Schoolers

anithite 30 Jul 2024 0:29 UTC
3 points
1
If we consider each (include,exclude) decision for (1,2,3,4,5) as a separate question, error rates are 20%-ish. Much better than random guessing. So why does it make mistakes?
If bottlenecking on data is the problem, more data in the image should kill performance. So how about a grid of 3 digit numbers (random vals in range 100...999)?
3.5 sonnet does perfectly. Perfect score answering lookup(row/col) and (find_row_col(number), find duplicates and transcription to CSV.
So this isn’t a bottleneck like human working memory. Maybe we need to use a higher resolution image so it has more tokens to “think” with? That doesn’t seem to work either for the above yellow areas thing either though.
I’m guessing this is straightforward failure to generalize. Tables of numbers are well represented in the training data (possibly synthetic data too), visual geometry puzzles, not so much. The model has learned a few visual algorithms but hasn’t been forced to generalise yet.
Root cause might be some stupid performance thing that screws up image perception the same way BPE text encoding messes up byte level text perception. I’m guessing sparse attention.
Text representations
Text representations are no panacea. Often similar problems (EG:rotate this grid) have very formatting dependent performance. Looking for sub-tasks that are probably more common in training data and composing with those (EG:rotate by composing transpose and mirror operations) allows a model to do tasks it otherwise couldn’t. Text has generalisation issues just like with images.

If Claude3.5 sonnet has pushed the frontier in tetris that would be evidence for generalisation. I predict it still fails badly.

anithite comments on A Visual Task that’s Hard for GPT-4o, but Doable for Primary Schoolers

Text representations