I just tested frontier models on a riddle from the podcast Lateral with Tom Scott: “a woman microwaves a chocolate bar once a year. What is her job and what is this procedure for?”
[GPT 5](https://chatgpt.com/s/t_68bf2e159be48191984b8b7f73accf97) gets it first try. She is a school physics teacher, using the uneven melting spots on chocolate to show children the speed of light.
[Opus 4.1 with extended thinking](https://claude.ai/share/54dd3eef-0a25-4f6e-b772-024840be7a52) first says that she’s a quality control engineer at a chocolate factory. I point out that if that were true, she would do it more than once a year. It then gets the answer.
Gemini 2.5 Pro first says that she’s a lawyer putting her license on inactive status for a year (???) Its reasoning traces seemed really committed to the idea that this must be a pun or wordplay riddle. After I clarify that the question was literal, it gets the answer.
I recall seeing someone do a pretty systematic evaluation of how models did on Lateral questions and other game/connections shows, but with the major drawback that those are retrospective and thus in the training data. The episode with this riddle came out less than a week ago, so I assume not in training data. I also didn’t give any context, other than that this was a riddle.
I’m interested in seeing more lateral thinking/creative reasoning tests of LLMs, since I anticipate that’s what will determine their ability to make new science breakthroughs. I don’t know if there are any out there.
Imagine the question was instead inverted to “a physics teacher wants to do an experiment demonstrating the speed of light to children. What would the experiment be?” Now this is straightforward recall because the model can look up what physics teachers do. But in the form that it is, what is the actual lookup that would help solve this question trivially?
The metric is not whether the information to answer this question is available in a model’s corpus, but whether the model can make the connection between the question and the information in its corpus. Cases in which it isn’t straightforward to make that connection are riddles. But that’s also a description of a large class of research breakthroughs – figuring out that X solution from one domain can help answer Y question from another domain. Even though both X and Y are known, connecting them was the trick. That’s the ability I wanted to test.