I just tested frontier models on a riddle from the podcast Lateral with Tom Scott: “a woman microwaves a chocolate bar once a year. What is her job and what is this procedure for?”
[Opus 4.1 with extended thinking](https://claude.ai/share/54dd3eef-0a25-4f6e-b772-024840be7a52) first says that she’s a quality control engineer at a chocolate factory. I point out that if that were true, she would do it more than once a year. It then gets the answer.
Gemini 2.5 Pro first says that she’s a lawyer putting her license on inactive status for a year (???) Its reasoning traces seemed really committed to the idea that this must be a pun or wordplay riddle. After I clarify that the question was literal, it gets the answer.
I recall seeing someone do a pretty systematic evaluation of how models did on Lateral questions and other game/connections shows, but with the major drawback that those are retrospective and thus in the training data. The episode with this riddle came out less than a week ago, so I assume not in training data. I also didn’t give any context, other than that this was a riddle.
I’m interested in seeing more lateral thinking/creative reasoning tests of LLMs, since I anticipate that’s what will determine their ability to make new science breakthroughs. I don’t know if there are any out there.
Imagine the question was instead inverted to “a physics teacher wants to do an experiment demonstrating the speed of light to children. What would the experiment be?” Now this is straightforward recall because the model can look up what physics teachers do. But in the form that it is, what is the actual lookup that would help solve this question trivially?
The metric is not whether the information to answer this question is available in a model’s corpus, but whether the model can make the connection between the question and the information in its corpus. Cases in which it isn’t straightforward to make that connection are riddles. But that’s also a description of a large class of research breakthroughs – figuring out that X solution from one domain can help answer Y question from another domain. Even though both X and Y are known, connecting them was the trick. That’s the ability I wanted to test.
I tried googling to find the answer. First I tried “melting chocolate in microwave” and “melting chocolate bar in microwave”, but those just brought up recipes. Then I tried “melting chocolate bar in microwave test”, and the experiment came up. So I had to guess it involved testing something, but from there it was easy to solve. (Of course, I might’ve tried other things first if I didn’t know the answer already.)
I just tested frontier models on a riddle from the podcast Lateral with Tom Scott: “a woman microwaves a chocolate bar once a year. What is her job and what is this procedure for?”
[GPT 5](https://chatgpt.com/s/t_68bf2e159be48191984b8b7f73accf97) gets it first try. She is a school physics teacher, using the uneven melting spots on chocolate to show children the speed of light.
[Opus 4.1 with extended thinking](https://claude.ai/share/54dd3eef-0a25-4f6e-b772-024840be7a52) first says that she’s a quality control engineer at a chocolate factory. I point out that if that were true, she would do it more than once a year. It then gets the answer.
Gemini 2.5 Pro first says that she’s a lawyer putting her license on inactive status for a year (???) Its reasoning traces seemed really committed to the idea that this must be a pun or wordplay riddle. After I clarify that the question was literal, it gets the answer.
I recall seeing someone do a pretty systematic evaluation of how models did on Lateral questions and other game/connections shows, but with the major drawback that those are retrospective and thus in the training data. The episode with this riddle came out less than a week ago, so I assume not in training data. I also didn’t give any context, other than that this was a riddle.
I’m interested in seeing more lateral thinking/creative reasoning tests of LLMs, since I anticipate that’s what will determine their ability to make new science breakthroughs. I don’t know if there are any out there.
This is a neat question, but it’s also a pretty straightforward recall test because descriptions of the experiment for teachers are available online.
Imagine the question was instead inverted to “a physics teacher wants to do an experiment demonstrating the speed of light to children. What would the experiment be?” Now this is straightforward recall because the model can look up what physics teachers do. But in the form that it is, what is the actual lookup that would help solve this question trivially?
The metric is not whether the information to answer this question is available in a model’s corpus, but whether the model can make the connection between the question and the information in its corpus. Cases in which it isn’t straightforward to make that connection are riddles. But that’s also a description of a large class of research breakthroughs – figuring out that X solution from one domain can help answer Y question from another domain. Even though both X and Y are known, connecting them was the trick. That’s the ability I wanted to test.
I tried googling to find the answer. First I tried “melting chocolate in microwave” and “melting chocolate bar in microwave”, but those just brought up recipes. Then I tried “melting chocolate bar in microwave test”, and the experiment came up. So I had to guess it involved testing something, but from there it was easy to solve. (Of course, I might’ve tried other things first if I didn’t know the answer already.)