Personally, when I want to get a sense of capability improvements in the future, I’m going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Same, and I’d adjust for what Julian pointed out by not just looking at benchmarks but viewing the actual stream.
Same, and I’d adjust for what Julian pointed out by not just looking at benchmarks but viewing the actual stream.