As a partial point of comparison, in Wason’s testing only about 20% of humans solved the problem tested, but Wason’s experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, ‘strictly increasing order’).
I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.
Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I’m running this as an AI Safety Camp project with limited resources) but I’ll aim for it.
I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.
Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I’m running this as an AI Safety Camp project with limited resources) but I’ll aim for it.
Would also love to take the tests. If possible you could grab human test subjects from certain areas: a less wrong group, a reddit group, etc.
I volunteer myself as a test subject; dm if interested