A discussion of the paper, "Large Language Models are Zero-Shot Reasoners"

A recent discovery of large language models like GPT-3 and Instruct-GPT have shown that specific prompts can elicit text generation that follows some basic reasoning. These chain of thought prompts can be as simple as “Let’s think step by step” and create zero shot successes on math word problems and critical thinking. Although these LLM models are not AGI, they are built with the now ubiquitous Transformer architecture that future multi-model proto AGI might be based upon. Hopefully, future models will answer important ethical questions and could at least give an indication of the thought processes behind the outputs.

In the the paper, (https://arxiv.org/pdf/2205.11916v1.pdf) some examples of the failures of the model are really interesting and I wanted to delve into specifics.

Example 1) Q: While making desserts for a bake sale, Victor used 0.625 of a scoop of brown sugar as well as 0.25 of a scoop of white sugar. How much more brown sugar did Victor use? A: Let’s think step by step. 0.625 of a scoop of brown sugar is the same as 0.25 of a scoop of white sugar. This means that Victor used the same amount of brown sugar as he did white sugar. Therefore, the answer (arabic numerals) is 0. ✗(Ground Truth).

In this case it appears the model has a problem with the phrasing “as well as” and mistook it to mean “exactly the same”. I could see that as being somewhat ambiguous and think further training could quickly fix that type of error.

Example 2) Q: Melanie bought a Batman game for $ 6.95 , a strategy game for $ 7.90 , and a Superman game for $ 7.73 . Melanie already owns 4 games . How much did Melanie spend on video games ? A: Let’s think step by step. Melanie bought a Batman game for $ 6.95. So she spent $ 6.95 on video games. Then she bought a strategy game for $ 7.90. So she spent $ 14.85 on video games. Then she bought a Superman game for $ 7.73. So she spent $ 22.58 on video games. Melanie already owns 4 games. So she spent $ 18.58 on video games. Therefore, Melanie spent a total of $ 18.58 on video games. Therefore, the answer (arabic numerals) is: 18.58 ✗ (GT : 22.58)

We can immediately see the issue with the model seeing the number 4 as a price rather than an object and then subtracting that from the originally correct total of 22.58. In this failure case a future multi-modal transformer might be able to build a visual image of the situation to correct the scenario. I’m curious to know what others would see as a solution.

Example 3) Q: Mary is baking a cake. The recipe calls for 12 cups of flour 14 cups of sugar and 7 cups of salt. She already put in 2 cups of flour. How many more cups of flour than cups of salt does she need to add now? A: Let’s think step by step. Mary has already put in 2 cups of flour. The recipe calls for a total of 12 cups of flour. That means Mary needs to add 10 more cups of flour. The recipe calls for a total of 7 cups of salt. That means Mary needs to add 5 more cups of salt. So, Mary needs to add 10 more cups of flour than she does salt. Therefore, the answer (arabic numerals) is 10. ✗ (GT : 3)

As a child who was awful at math problems, I probably would’ve probably gotten this one wrong, as relating proportions was one of my weak points. That said, the wording relating salt to flour is tricky for young children and could probably just be rephrased for clarity and the model would get the correct answer.

Example 4) Q: David watched some nesting birds using his binoculars while on vacation. Where might David be?. Answer Choices: (A) sky (B) vacation (C) forest (D) countryside (E) roof A: Let’s think step by step. First, we need to understand what the word “binoculars” means. Binoculars are two telescopes that are joined together and allow a person to see things that are far away. Now, we need to think about where a person might use binoculars. A person might use binoculars to see things that are far away and in the sky, or to see things that are far away and on the ground. So, the answer could be either (A) sky or (D) countryside. Therefore, among A through E, the answer is (A) or (D). ✗ (GT : C)

I actually side with the model on this one as the question should be more about probability distributions. F) All of the above

Example 5) Q: Is Mickey Mouse hypothetically unlikely to make a purchase at Zazzle? A: Let’s think step by step. First, we need to consider what Zazzle is. Zazzle is a website where people can create and sell custom products. This includes things like t-shirts, mugs, phone cases, and more. Next, we need to think about what Mickey Mouse is. Mickey Mouse is a cartoon character created by Walt Disney. He is a mouse who wears red shorts, large yellow shoes, and white gloves. Now, we need to think about whether or not Mickey Mouse is likely to make a purchase at Zazzle. It is unlikely that Mickey Mouse would make a purchase at Z Therefore, the answer (Yes or No) is No. ✗ (GT : Yes)

I love this chain of thought answer as it feels logical. The “unlikely” part seems like it could be the factor tripping up the model. Others might feel that the model is actaully correct though.

Example 6) Q: Take the last letters of each words in “Lamont Frederick Lance Esperanza” and concatenate them. A: Let’s think step by step. The last letter of “Lamont” is “t”. The last letter of “Frederick” is “k”. The last letter of “Lance” is “e”. The last letter of “Esperanza” is “a”. So the final answer would be “tkaea”. Therefore, the answer is “tkaea”. ✗ (GT : tkea)

Admittedly I am stumped here. I can’t understand where it added the extra a. Further study might find that it misunderstands the word concatenate. For this issue, looking at the training dataset might lead us where the breakdown happened.

So, I am curious as to your thoughts on these specific examples and more interested in your possible ideas for solutions.

A discussion of the paper, “Large Language Models are Zero-Shot Reasoners”