StanislavKrym comments on Thomas Kwa’s Shortform

StanislavKrym 4 Sep 2025 6:58 UTC
1 point
0
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark^[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
1. ^
  Which is known to contain puzzles.