I give more weight to prompts where I can easily evaluate the answer as true or false, e.g., questions about the opening hours of places, prime numbers or what cities are closest to London, especially if the correct answer would be my best prediction for a human answer.
Interesting—to me these kinds of prompts seem less interesting, since they’re largely a question of just looking things up. It’s certainly true that they’re easier to evaluate. But more creative tasks feel like they test the ability to apply knowledge in a novel way and to understand what various words and concepts mean, which are the kinds of tasks that feel more relevant to testing whether GPT-4 is more “actually intelligent”.
Interesting—to me these kinds of prompts seem less interesting, since they’re largely a question of just looking things up. It’s certainly true that they’re easier to evaluate. But more creative tasks feel like they test the ability to apply knowledge in a novel way and to understand what various words and concepts mean, which are the kinds of tasks that feel more relevant to testing whether GPT-4 is more “actually intelligent”.