O1 now passes the simpler “over yellow” test from the above. Still fails the picture book example though.
For a complex mechanical drawing, O1 was able to work out easier dimensions but anything more complicated tends to fail. Perhaps the full O3 will do better given ARC-AGI benchmark performance.
Meanwhile, Claude 3.5 and 4o fail a bit more badly failing to correctly identify axial and diameter dimensions.
Visuospatial performance is improving albeit slowly.
O1 now passes the simpler “over yellow” test from the above. Still fails the picture book example though.
For a complex mechanical drawing, O1 was able to work out easier dimensions but anything more complicated tends to fail. Perhaps the full O3 will do better given ARC-AGI benchmark performance.
Meanwhile, Claude 3.5 and 4o fail a bit more badly failing to correctly identify axial and diameter dimensions.
Visuospatial performance is improving albeit slowly.
A much appreciated update, thank you!