I ran 3 of my benchmarks so far:
Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
Claude 3.7 Sonnet: 11th place GitHub Repository
Claude 3.7 Sonnet Thinking: 1st place
Claude 3.7 Sonnet: 6th place GitHub Repository
Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
Claude 3.7 Sonnet: 4th place GitHub Repository
Note that Grok 3 has not been tested yet (no API available).
I ran 3 of my benchmarks so far:
Extended NYT Connections
Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
Claude 3.7 Sonnet: 11th place
GitHub Repository
Thematic Generalization
Claude 3.7 Sonnet Thinking: 1st place
Claude 3.7 Sonnet: 6th place
GitHub Repository
Creative Story-Writing
Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
Claude 3.7 Sonnet: 4th place
GitHub Repository
Note that Grok 3 has not been tested yet (no API available).