What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
Which is known to contain puzzles.