I think a moderately-skilled person could outperform Claude here, but it’s closer than you might think. Have you thought of running this experiment with a human on the other end?
Seconded, especially because the very low temporal resolution of intermittent still photos is a real challenge. I’m potentially open to playing as either LLM or robot here — the natural thing to do would be for @philh to play the robot again for consistency, but it would have to be either a different flat (ideal) or a different task (harder to compare). It might be better to have a Brit play the LLM, though; I certainly wouldn’t (for example) have recognized ‘Douwe Egberts’ as coffee. For fair comparison, I think it would be important to do it over text rather than a voice / video call.
Seconded, especially because the very low temporal resolution of intermittent still photos is a real challenge. I’m potentially open to playing as either LLM or robot here — the natural thing to do would be for @philh to play the robot again for consistency, but it would have to be either a different flat (ideal) or a different task (harder to compare). It might be better to have a Brit play the LLM, though; I certainly wouldn’t (for example) have recognized ‘Douwe Egberts’ as coffee. For fair comparison, I think it would be important to do it over text rather than a voice / video call.