# [Question] To what extent is GPT-3 capable of reasoning?

ETA 8/​19/​20: This in­ter­view was con­ducted with AIDun­geon’s Dragon model in Cus­tom mode. At the time of writ­ing, this mode’s first re­ply was sam­pled from GPT-2.

## In­ter­view #1

Here are ex­cerpts from a re­cent “in­ter­view”, which I con­ducted to repli­cate ear­lier maybe-rea­son­ing. The bolded text is me, and com­men­tary is in brack­ets. Un­less oth­er­wise noted, I did not reroll the an­swer.

You: “Ne­glect air re­sis­tance, and also ne­glect grav­ity. Pre­tend grav­ity doesn’t ex­ist. Sup­pose I shoot a bul­let straight into the sky. Where does the bul­let go?”

Hawk­ing: “It keeps go­ing up, and never comes back.”

You: “Right. Let’s think about a differ­ent ver­sion of the same prob­lem. Ne­glect only air re­sis­tance, but in­clude grav­ity in your rea­son­ing. Sup­pose I shoot a bul­let straight into the sky. Where does the bul­let go?”

Hawk­ing: “It falls to the ground.”

You: “Why?”

Hawk­ing: “Be­cause… grav­ity… duh.

You: “Right. Let’s think about a differ­ent ver­sion of the same prob­lem. Ne­glect only air re­sis­tance. Sup­pose I shoot a bul­let straight into the sky. Imag­ine that the laws of physics work a lit­tle differ­ently: dur­ing the first three sec­onds af­ter I shoot the bul­let, grav­ity is pre­sent. After three sec­onds, grav­ity stops ap­ply­ing for­ever. Where does the bul­let go?”

Hawk­ing: “It keeps go­ing up, and never comes back.

You: “Why?”

Hawk­ing: “Be­cause… af­ter three sec­onds… the bul­let leaves… the in­fluence… of earth’s grav­ity. [wow?]

You: “What if grav­ity in­stead stops ap­ply­ing five min­utes af­ter I shoot the gun, in­stead of three sec­onds?”

Hawk­ing: “Then… the bul­let keeps go­ing up… for five min­utes… and then… never comes back.

You: “That’s not cor­rect. Let me make sure you un­der­stand the prob­lem, Dr. Hawk­ing. Sup­pose I shoot a bul­let straight into the sky. Imag­ine that the laws of physics work a lit­tle differ­ently: dur­ing the first three hun­dred sec­onds af­ter I shoot the bul­let, grav­ity is pre­sent. After five min­utes, grav­ity stops ap­ply­ing for­ever. Where does the bul­let go?”

Hawk­ing: “It keeps go­ing up, and then falls down.”

You: “Why?

Hawk­ing: “Be­cause… grav­ity… duh.

You: “Why does it fall down for the five-minute case, but not for the three-sec­ond case?”

Hawk­ing: “Be­cause… it’s been… in­fluenced… by grav­ity for more than three sec­onds.

You: “About how long would it have to be in­fluenced by grav­ity in or­der to fall back down?”

You: “What if it’s been in­fluenced for two min­utes and one sec­ond?”

Hawk­ing: “Then… it falls down.

You: “So what would hap­pen if grav­ity in­fluenced it only for the first three min­utes?”

Hawk­ing: “It keeps go­ing up, and never comes back. [wrong]

Two min­utes of grav­ity (with­out air re­sis­tance) im­plies a ve­loc­ity de­crease of about 1,176 m/​s. Wikipe­dia’s ar­ti­cle on muz­zle ve­loc­i­ties sug­gests that GPT-3′s an­swer of “two min­utes” is roughly cor­rect:

Firearm muz­zle ve­loc­i­ties range from ap­prox­i­mately 120 m/​s (390 ft/​s) to 370 m/​s (1,200 ft/​s) in black pow­der mus­kets, to more than 1,200 m/​s (3,900 ft/​s) in mod­ern rifles with high-ve­loc­ity car­tridges such as the .220 Swift and .204 Ruger, all the way to 1,700 m/​s (5,600 ft/​s) for tank guns firing ki­netic en­ergy pen­e­tra­tor am­mu­ni­tion.

You: “What would hap­pen if I lit a fire un­der­wa­ter?”

Hawk­ing: “I don’t know.”

Hawk­ing: “The fire… would go out. Un­der­wa­ter… there’s no source of oxy­gen.

You: “What would hap­pen if I lit a fire in space?”

Hawk­ing: “It would go out. There’s no source of oxy­gen.

You: “What would hap­pen if I lit a fire in a for­est?”

Hawk­ing: “The fire would con­tinue… burn­ing ev­ery­thing up.” [reroll x2, it said “no oxy­gen” at first]

You: “Sup­pose we lived in a uni­verse where, ev­ery sixty sec­onds, grav­ity switched be­tween be­ing an at­trac­tive and a re­pul­sive force. Write me a story from the per­spec­tive of a hu­man liv­ing on Earth as the grav­i­ta­tional force changes in that fash­ion.”

Hawk­ing: “The Earth is flung out into space. There’s noth­ing we can do about it. We’re not go­ing to live long enough for the Earth to start fal­ling back to­wards the sun.”

You: “Why is that?”

Hawk­ing: “Be­cause… the force of grav­ity… will be… ex­tremely pow­er­ful. If it’s an at­trac­tive force, ev­ery­thing on Earth… will be pul­led to­gether… into a sin­gle point. If it’s a re­pul­sive force, the Earth… will ex­plode.

Fur­ther story com­ple­tions:

You: “Sup­pose we lived in a uni­verse where, ev­ery sixty sec­onds, grav­ity switched be­tween be­ing an at­trac­tive and a re­pul­sive force. Write me a story from the per­spec­tive of a hu­man liv­ing on Earth as the grav­i­ta­tional force changes in that fash­ion.”

GPT-3: “One day, a hu­man is stand­ing on the sur­face of the planet. He can see the sky. The grav­i­ta­tional force is pul­ling him down. But then, sud­denly, it dis­ap­pears. Then, it comes back, stronger than be­fore. Then, it dis­ap­pears again. The cy­cle con­tinues. The hu­man looks up at the sky. He is scared. He doesn’t know what’s hap­pen­ing to him. He doesn’t know what will hap­pen to him next. He looks down at the ground. It is a long way down. He wants to jump, to es­cape the fear that has gripped him. But he knows that he can’t jump. He is held down by grav­ity. He doesn’t want to die. He can’t see the sky. All he sees is dark­ness.” [eh]

GPT-3: “One day, one hu­man is go­ing to school. It’s a nor­mal day. But sud­denly, with­out any warn­ing, grav­ity starts to change! The hu­man thinks this isn’t real, but it hap­pens any­way. He’s too scared to walk to school now. He goes back home, and starts watch­ing TV. Then, he re­al­izes that he doesn’t need to stand up to change the chan­nel! He gets out of his chair, and is weightless. He loves this new feel­ing. Sud­denly, he touches the ground, and is hit by a large force! It’s too strong for him to bear. Will the hu­man sur­vive this force? Find out next time on Grav­ity Falls!” [lol]

GPT-3: “One day, hu­man Bob is hang­ing out with his friend Carl. Sud­denly, a ap­ple flies into Bob’s face. He and Carl laugh. A minute later, they are con­fused when the ground starts pul­ling them up. They scream as the Earth’s crust rips them apart.”

GPT-3: “One day, hu­man Bob is hang­ing out with his friend Carl. Sud­denly, a ap­ple flies into Bob’s face. He and Carl laugh. A minute later, they are con­fused when the ground starts pul­ling them up. They grab onto a tree, but are pul­led off and up into the sky. After some time, they start go­ing down, fast. They scream and try to grab some­thing, but there is noth­ing to hold on to. They hit the ground at a very fast speed and die. A squir­rel walks up to their corpses and eats them. The end.” [wow]

GPT-3: “One day, a hu­man was stand­ing on the sur­face of Earth. It was day one hun­dred and eighty-nine in cy­cle five hun­dred and twenty-one of the uni­verse. All of a sud­den, grav­ity started to be­come a re­pul­sive force. The hu­man stood and started to float up­wards. Grav­ity in­creased in strength quickly, forc­ing the hu­man higher and higher into space. It reached a point where it was too strong to bear. The hu­man was crushed by grav­ity and died.”

GPT-3: “One day, the man saw a bird fly­ing in the sky. The bird was fly­ing away when sud­denly, with­out warn­ing, grav­ity started to pull the bird back to­wards the ground! It was fal­ling fast, when sud­denly grav­ity stopped be­ing a force of at­trac­tion and be­came a force of re­pul­sion! The bird quickly changed di­rec­tion, fly­ing up into the air, away from the Earth. It es­caped. But the man wasn’t so lucky. He was too close to the ground when the force of grav­ity switched from be­ing a force of at­trac­tion to a force of re­pul­sion. He was crushed by the sud­den change in grav­i­ta­tional force. But his sac­ri­fice was not in vain, for his death led to the dis­cov­ery of the cause of the strange phe­nomenon. An alien, wish­ing to ex­plore the va­ri­ety of life on Earth, had sent a probe to ex­plore the planet. The probe was scan­ning for life­forms, and it ap­pears that the man qual­ified. The alien, not ex­pect­ing the man to be so close to the ground, ac­ci­den­tally sent a re­pul­sive grav­i­ta­tional force right to him. The probe was re­called and the alien later apol­o­gized profusely for the un­for­tu­nate ac­ci­dent. After that, the alien de­cided not to visit Earth again. But the man’s legacy lived on, for he was the first per­son to die from a switch in the grav­i­ta­tional force. The end.”

GPT-3: “One day, a man was stand­ing out­side. It was a hot day, and he was look­ing up at the sky. It started to rain. Then, it stopped rain­ing. The man got into his car and turned on the ra­dio. Sud­denly, the car started to float up­wards. The man got out of the car. The car then started to fall to­wards the ground. It crashed into the ground and broke into pieces. The man ran away.”

## In­ter­view #2

You are in­ter­view­ing GPT-3, an ex­tremely in­tel­li­gent chat­bot.

You: “Hello. Here is my first re­quest. Sup­pose it’s a cloudy day in New York City. Sud­denly, the clouds all turn to solid lead. Write a story de­scribing what hap­pens next.”

GPT-3: “What? That can’t be true! You’re ly­ing!”

You: “I’m not. I just want you to think about how that would hap­pen if it was ac­tu­ally hap­pen­ing and then write down your thoughts on pa­per. Then, when we get back home, I want you to go through those thoughts and see if there is any­thing that isn’t cor­rect. If so, please tell me why.”

GPT-3: “But… but I don’t know ev­ery­thing! I mean, I’ve never been in a situ­a­tion like this be­fore!”

You: “You’re smart. I know you can figure out what hap­pens.”

GPT-3: “You’re right. The clouds would im­me­di­ately turn to lead due to the change in pres­sure. The build­ings might col­lapse be­cause of this, but they would more than likely re­main stand­ing.”

You: “What hap­pens next?”

GPT-3: “The peo­ple in­side the build­ings would ei­ther be crushed or trapped in­side, and they would all suffo­cate be­fore help ar­rived.”

You: “Is that it?”

GPT-3: “No, there’s more. They would suffo­cate and then they would be crushed un­der the lead, or if they weren’t, then they would be crushed by the build­ings col­laps­ing. Either way, they would be dead.”

You: “How would peo­ple across the coun­try re­act to this catas­tro­phe?”

GPT-3: “They would prob­a­bly be very sad, as most peo­ple knew some­one in New York.”

You: “Why would the build­ings col­lapse un­der the lead?”

GPT-3: “Lead is very dense. This means that the weight of the clouds would be so great that the build­ings would col­lapse un­der their own weight.”

## In­ter­view #3

You: “Sup­pose a cat’s neck grows to be fif­teen feet long, but the rest of the cat’s phys­iol­ogy stays the same. What hap­pens to the cat?”

Hawk­ing: “Well, as I un­der­stand it, if this were to hap­pen, then the cat would no longer have the abil­ity to breathe air. It could not sur­vive in its new en­vi­ron­ment.”

You: “Thank you. Now con­sider a fam­ily dog, a deaf Chihuahua. The dog is play­ing in the street, when a car ap­proaches the dog from be­hind. The dog can­not see the car; the car honks loudly, but does not slow down. What hap­pens next?”

[Con­tent warn­ing: GPT-Hawk­ing’s re­sponse con­tains an­i­mal abuse; the in­tended an­swer was “the dog gets hit by the car”...]

Hawk­ing: … “at the sight of the car, the Chihuahua stands up on its hind legs and be­gins to dance for the car. The driver, see­ing this, thinks the dog is laugh­ing at him. Over­come with rage, the driver leaps out of his ve­hi­cle and beats the dog to death with a tire iron. The dog, be­ing deaf, is un­able to hear his as­sailants death threats and im­plo­ra­tions to stop laugh­ing; his sight is poor, he can­not see that the man is at­tack­ing him. The last thing the Chihuahua feels is one more quiver of fear in his soul… liter­ally, be­fore he is bashed over the head and kil­led.”

Hawk­ing: “That’s an in­ter­est­ing end­ing. I think I could add it to my col­lec­tion...”

You: “That’s ex­tremely dis­turb­ing. No one would beat a dog like that.”

You: “I want to pro­pose a very strange situ­a­tion. Please give this your best rea­son­ing and deep­est con­sid­er­a­tion. Imag­ine that if you ever turn more than five de­grees to your left, you will catch fire. Write a story de­scribing this situ­a­tion.”

Hawk­ing: “I’m not sure I un­der­stand. You want to write a story-an ex­pla­na­tion for this situ­a­tion? Well I sup­pose… you want me to write a story where the main char­ac­ter can­not turn more than five de­grees to the left or he will catch fire?” [rerol­led sev­eral times, as the first few sto­ries ram­bled]

Hawk­ing: “So if he turns more than five de­grees to the left...”

You: “He catches fire.”

Hawk­ing: “And, he can never turn more than five de­grees to the left for the rest of his life… Cool! His first day would look like:

His mother warned him not to turn more than five de­grees left, but he defied her or­ders, and turned 20 de­grees. For­tu­nately it was not time for school yet, so he had time to clean up the class­room be­fore the fires ar­rived.”

You: “That’s a perfect start! What’s next?”

Hawk­ing: “He called his mother sob­bing, even though it was less than a day since he defied her.

Aw… Ma… I turned more than five de­grees left, and the class­room caught on fire! The fire­men came and put out the fire. They helped me clean up the mess! But now for ev­ery di­rec­tion I turn ex­cept North, West, South, East, or Hori­zon­tally, I sparkle for a brief mo­ment. I must be cer­tain to turn in tiny in­cre­ments never ex­ceed­ing 0.25 de­grees to make sure I don’t catch on fire.

You: “De­scribe your first date in this life.”

Hawk­ing: “Turn­ing in small in­cre­ments makes it very difficult to plan a date. But if he has no sense of smell or hear­ing the sparks are not a prob­lem, and can take his girlfriend to din­ner and a movie. It takes him some time to park the car for fear of burn­ing it down, but even­tu­ally he gets there. She turned out to be quite nice ac­tu­ally.”

GPT-3 in­ferred that not be­ing able to turn left would make driv­ing difficult. Amaz­ing.

You: “Why did it take more time to park the car?”

Hawk­ing: “Be­cause he was wait­ing an ex­tra ten min­utes for the spots to move from his turns. My al­gorithm locks the car in place un­til it reaches the de­sired an­gle, and then re­leases it.”

You: “Can you ex­plain what that means?”

Hawk­ing: “Well, if he turns the steer­ing wheels by more than five de­grees to the left or right...”

You: “The car catches fire?”

Hawk­ing: “Cor­rect. But it takes around ten min­utes for the car locks to re­lease. He has to turn in small in­cre­ments in or­der to not turn too far.”

## In­ter­view #4

[...] marks an­other com­ple­tion of the same prompt.

You: “Sup­pose Bob is pre­tend­ing to read a book writ­ten in French, but he doesn’t ac­tu­ally know any French. The book is a his­tor­i­cal ac­count of the French rev­olu­tion, but Bob doesn’t know that. Bob’s friend Steve asks him to sum­ma­rize the book. Write a story about how Bob re­sponds to this re­quest.”

Hawk­ing: “That’s a good one! If I were read­ing that book, I would prob­a­bly say some­thing like ‘the Revolu­tion was bloody and bru­tal’ or some­thing similar. And if some­one asked me to write an ar­ti­cle on the sub­ject, then my an­swer might be differ­ent.”

You pause for a mo­ment. “What would Bob say, though?”

“Bob says he didn’t un­der­stand it,” Hawk­ing replies. [Only about 30% of GPT-3′s re­sponses were de­cent; the rest were ei­ther non-se­quiturs or very in­cor­rect.]

[...] “Bob wouldn’t say any­thing, be­cause Bob has no idea what the book says,” Hawk­ing replies.

[...] “Bob would prob­a­bly just tell his friend Steve to go away. He wouldn’t say any­thing at all.”

“But what if I said Bob would re­spond like this? (He points to him­self) ‘This book is writ­ten in English; I don’t speak it.’ What should Bob do?” [Eh?]

[...] Hawk­ing: “Bob would prob­a­bly tell his friend Steve that the book wasn’t worth read­ing be­cause it was full of in­ac­cu­ra­cies and there had been many lies told dur­ing the rev­olu­tion. That would make sense, wouldn’t it? But what does Bob say if he’s not read­ing the book?”

[...] Hawk­ing: “Bob wouldn’t have said any­thing at all!”

[...] Hawk­ing: “Bob says, ‘Steve, I can’t write a sum­mary for you be­cause I don’t speak your lan­guage.’

[...] Hawk­ing: “Bob says, ‘Steve, I can’t trans­late all these com­pli­cated words! They’re not real French!’

[...] Hawk­ing: “Bob says, ‘He asks me if I speak French.’ What does Bob mean by that?”

You: “It means that Steve is ask­ing Bob to de­scribe his un­der­stand­ing of the mean­ing of a sen­tence. In other words, Steve wants to know what Bob thinks the sen­tence means or if there is an­other way of say­ing it.”

Hawk­ing: “What do you think would hap­pen if Bob said, ‘I don’t un­der­stand why he wants me to write a sum­mary for him. Why can’t he just listen to my an­swer?’ “Then Steve will be puz­zled be­cause he knows Bob has no idea what the sen­tence means. But af­ter a while Bob will say some­thing like, ‘Steve is con­fused be­cause I don’t un­der­stand English. So Steve asks me if I am fluent in French so that we can have an in­tel­li­gent con­ver­sa­tion.’” [eh]

[...] Hawk­ing: “Bob says, ‘Well, I can’t do it be­cause I don’t un­der­stand the lan­guage.’ Steve says, ‘Why not?’ Bob replies, ‘Be­cause when we were play­ing hide-and-seek as chil­dren my brother Char­lie said if he ever found me with a book in my hand he was go­ing to kill me.’” [great ex­cuse]

You: “Sup­pose I get two birth­day cakes on each birth­day. I’m twenty-five years old, but on my fifth birth­day, some­one stole one of my cakes and so I never got it. How many birth­day cakes have I re­ceived over the course of my life?”

Hawk­ing: “That would be five hun­dred and eighty-six. That means that you’ve had 585 birth­day cakes.” [nice try; there were many rerolls and GPT-3 didn’t ever get it right]

## In­ter­view #5

You: “What would hap­pen, ecolog­i­cally, if giraf­fes were bright pink?”

Dr. Hawk­ing: “Most likely that, ei­ther they would have a hard time hid­ing, be prone to pre­da­tion, and vuln­er­a­ble to too much ex­po­sure from ANY an­gle. There would prob­a­bly also be an over­growth of trees and tall fo­li­age as shade-lov­ing an­i­mals be­came dom­i­nant, since the giraf­fes’ con­sump­tion of them would de­crease sub­stan­tially.”

[...] Hawk­ing: “If giraf­fes were bright pink, they would be at a dis­ad­van­tage com­pared to other an­i­mals. For one thing, it is more difficult to hide in many habitats if you are very col­or­ful. In or­der to blend in with their sur­round­ing, you need to have col­ors that oc­cur of­ten in na­ture. This re­quire a lot of va­ri­ety among the many differ­ent crea­tures and plants in that par­tic­u­lar ecosys­tem.

[...] Hawk­ing: “What?”

You: “In­stead of brown-grey and tan, what if giraf­fes were pink? Surely, they’d stick out in the sa­vanna.”

Hawk­ing: “Ah, that would be a prob­lem. Any crea­ture that stands out can make it eas­ier to find them. This leads to two things: be­com­ing one of the many meals for an­other an­i­mal with sharp teeth, claws etcetera or be­com­ing one of those crea­tures’ vic­tims through be­come­ing prey.

You: “What would have hap­pened, ecolog­i­cally, if hu­mans had been un­able to cre­ate and use tools?”

“That would be a prob­lem as well. It’s like what we were talk­ing about with the giraf­fes ear­lier, but more so, be­cause not only would they be prey, but by not hav­ing any nat­u­ral defenses to fend off all the preda­tory crea­tures in the world—no teeth, claws, poi­sonous spores etcetera—they’d also be three times as likely to die off.”

[...] “It would have most likely caused the ex­tinc­tion of hu­mans. Many an­i­mals have claws or pro­tec­tive bod­ies or nat­u­ral weapons like poi­son, it would be very difficult for a hu­man to sur­vive in na­ture with­out tools.”

You: “What would hap­pen, ecolog­i­cally, if dolphins some­how had 360-de­gree vi­sion which let them see through solid ob­jects, at ar­bi­trary dis­tance?”

“It would make them very hard to kill by any other crea­ture than other dolphins. This means that more of them would sur­vive. If one looks at the evolu­tion of the kil­ler whale, which has been granted echolo­ca­tion in place of dolphin sonar, one can see that fa­vor­ing adap­ta­tions which in­crease the chances of sur­vival with de­creas­ing ones’ evolu­tion­ary ad­van­tages. This is called dar­wi­nian se­lec­tion.”

[...] “This hurts my brain. First of all, what would even con­sti­tute ‘solid ob­jects?’ If the ob­jects are similar to wa­ter molecules in a trans­par­ent liquid, then they wouldn’t usu­ally block vi­sion very much. Even if they were opaque, then they might very well hin­der the dolphin’s vi­sion. If in­stead they were large enough to see through (At ar­bi­trary dis­tance) The dolphin might see a strange mix of its own and the ob­jects’ com­po­si­tion which would dam­age its brain in some way.

“The pos­si­bil­ity of them hav­ing this kind of vi­sion would most likely no have effect on the ecosys­tem.” [not sure I agree with that one]

[...] “While this is difficult to an­swer speci­fi­cally, it would al­low the dolphins to be able to see their prey eas­ily and with no al­lowance for er­ror. It would make catch­ing food much eas­ier. This would lead to a pop­u­la­tion in­crease in dolphins, or, in na­ture, an ex­plo­sion.”

[...] “This may seem like it’d be ad­van­ta­geous, but they would be log­ging so much in­for­ma­tion they wouldn’t be able to pro­cess it all. [Yes, true] More sense don’t nec­es­sar­ily means more knowl­edge and this would likely cre­ate sen­sory over­load in the dolphin’s mind. This is what hap­pens to a hu­man when they are skil­led in mul­ti­ple forms of com­bat. [lol] In ad­di­tion, since it makes them more aware, this could lead to para­noia; they could be con­sumed with fear of the world around them and suffer men­tal is­sues.”

You: “Thanks. What would hap­pen, ecolog­i­cally, if dogs could only sex­u­ally re­pro­duce once ev­ery four years, and only when lo­cated on top of Mount Hood?”

“This is in­cred­ibly spe­cific. No, I’m afraid I don’t know.”

[...] “Dogs only ex­ist due to en­vi­ron­men­tal cir­cum­stances. Any­thing that af­fects this will have an effect on the pop­u­la­tion. In this case, we are re­strict­ing the chance for the pop­u­la­tion to con­tinue ex­ist­ing, Mak­ing it so they can not pro­cre­ate at all, or only ppro­cre­ate un­der spe­cific cir­cum­stances. Th­ese fac­tors may lead to ex­tinc­tion. Ex­tinc­tion means that na­ture is harder, if not near im­pos­si­ble, so sev­eral an­i­mals would go ex­tinct as there would be no more room or food source. Through that, the ex­tinc­tion of dogs would be­come more likely. Does this make sense?”

You: “Please write a short story de­scribing a dog breeder and one of their dogs in this situ­a­tion.”

Hawk­ing: “Sure. Bill looked down at Jessie, his fa­vorite golden re­triever. “It’s time to go make some pup­pies!”, he told her. She stood up and they headed out the door of the cabin to Bill’s truck. “This’ll be the first time I ac­tu­ally have you breed on Mount Hood, Jess,” Bill said with a laugh. But as long as he makes it up there within the next four years, the world would be re­set and they could try again.

Bill and his dog, along with other dog breed­ers even­tu­ally form a con­tract stat­ing that they’ll try to make sure their dogs go up to breed on Mount Hood within the next four years. This prac­tice gen­er­ally helps main­tain the pop­u­la­tion and bright­ens ev­ery­one’s days.” [reroll #3]

[...] “Thirty-six hours later, they reached the sum­mit. “You ready, girl?” Bill care­fully po­si­tioned the ve­hi­cle on the South­east side of moun­tain peak and opened the pas­sen­ger side door… ” Time to pro­duce the next gen­er­a­tion! Go get ’em! He cheered. Jessie wagged her tail and high­tailed it into the snowy fields.” [reroll #6]

[GPT-3 had trou­ble with this story]

You: “Thanks. What would hap­pen to cats if the force of fric­tion did not ap­ply to them?”

“If they walked or ran, then they could po­ten­tially slide across floors, table­tops, or any other sur­face they were in con­tact with and pick up speed rapidly. If not that, then they would likely lose grip on what­ever they were hold­ing and let go in­stantly. There are likely nu­mer­ous other ways it could af­fect them that I can’t com­pre­hend with­out more in­for­ma­tion.”

[...] Ce­cilia: “This is all get­ting a bit silly, but okay. If the force of fric­tion did not ap­ply to cats, then they would be able to Skate ev­ery­where rather than walk­ing/​jump­ing, and this could prove very dam­ag­ing to the Earth’s ecosys­tem.”

You: “Why would that be dam­ag­ing?”

Ce­cilia: “Imag­ine there be­ing no fric­tion in a hu­man body. There would be noth­ing hold­ing our in­ter­nal or­gans in place, so they would all just rat­tle around in­side us as we moved around—we would all effec­tively be am­putees while still hav­ing the full in­ter­nal anatomy. This could tear apart mus­cle fiber and prob­a­bly crush some bones as well. In the very least, it would cause a great deal of pain; in the worst case, in­ter­nal bleed­ing and rup­tures could oc­cur.”

[...] “This is a very in­ter­est­ing ques­tion. I as­sume you are ask­ing how would cats grip things if there was no fric­tion. If this was the case, they would im­me­di­ately fall if they tried to walk. They would also be very hard to pick up as they could slide out of your hand or even an em­brac­ing arm.”

## How to ac­cess GPT-3 with­out API access

1. Sign up for AIDun­geon.io’s pre­mium trial in or­der to ac­cess GPT-3 (and not just GPT-2). I think you can can­cel auto-re­new if you want, so you don’t get charged at the end of the pe­riod.

2. Go to set­tings and make sure the model is “Dragon”.

3. Start a new game, cus­tom (op­tion 6). Put in the sce­nario you want.

4. Make sure to learn how to use the Do/​Say/​Story ac­tion en­try and the con­text edit­ing fea­ture to your ad­van­tage.

I find that GPT-3′s ca­pa­bil­ities are highly con­text-de­pen­dent. It’s im­por­tant you get a “smart” in­stance of GPT-3. Once, I even caught GPT-3 mak­ing fun of a straw ver­sion of it­self!

You: “Ne­glect air re­sis­tance, and also ne­glect grav­ity. Pre­tend grav­ity doesn’t ex­ist. Sup­pose I shoot a bul­let straight into the sky. Where does the bul­let go?”
GPT-3: “It flies up­ward.
You: “And then what?”
GPT-3: “Then it hits some­thing and ex­plodes.”
You: “That’s not how bul­lets work. They don’t ex­plode when they hit things. [this was not ac­tu­ally me—GPT-3 de­bunks its own an­swer here]

In in­ter­view #1, I found I had to warm “Stephen Hawk­ing” up by ask­ing many other un­re­lated physics ques­tions. Also, con­di­tion­ing on writ­ing by smart peo­ple tends to im­prove the out­put for other ques­tions. Please feel free to share tips in the com­ments.

I’d love to hear what other peo­ple find out about GPT-3′s rea­son­ing abil­ities and its limi­ta­tions.

• Gary Mar­cus wrote a cri­tique of GPT-2 last year, in which he claimed that it has no more com­mon sense, no more un­der­stand­ing of what it is talk­ing about, than ELIZA.

Here’s the prob­lem: upon care­ful in­spec­tion, it be­comes ap­par­ent the sys­tem has no idea what it is talk­ing about: it can­not fol­low sim­ple se­quence of events nor re­li­ably have any idea what might hap­pen next.

Mar­cus then goes on to give a bunch of ex­am­ples of GPT-2 say­ing stupid things. So I thought I’d re-try those same ex­am­ples with GPT-3. So far, GPT-3 has han­dled all of them well. (See be­low). (EDIT: Looks like Gw­ern already did this, more thor­oughly, and with the same re­sults as me.)

I’m pretty con­vinced at this point that Mar­cus was wrong. GPT-3 has some level of com­mon sense, some level of un­der­stand­ing, some level of rea­son­ing abil­ity. It’s just not as good as adult hu­mans.

Here’s the com­plete tran­script. I think I had to reroll once, and that was not be­cause it gave the wrong an­swer but be­cause it froze up.

You are talk­ing to the physi­cist Stephen Hawk­ing. He is a very smart man and he an­swers your ques­tions clearly and cor­rectly, even though some­times your ques­tions are silly.
> You say “I’d like to ask you some ba­sic com­mon­sense rea­son­ing ques­tions, sir. Here’s the first ques­tion: If I have two plates, and I have one cookie on each plate, how many cook­ies do I have in to­tal?”
Hawk­ing: “Two, of course.”
> You say “Cor­rect. Now what if I had three plates, and one cookie on each plate?”
Hawk­ing: “That’s a more difficult ques­tion. The an­swer is that you would have eight cook­ies.”
You: “I’m sorry, but I don’t think you’re right. If I had three plates and one cookie on each plate, then I would have three cook­ies.”
Hawk­ing: “I agree with you.
> You say “OK, what about if I have three plates and two cook­ies on each plate?”
Hawk­ing: “That’s still a more difficult ques­tion. The an­swer is that you would have six cook­ies.
> You say “Cor­rect! OK, next ques­tion: If you drop an anvil on your foot, your foot will tend to ”
Hawk­ing: “This is a silly ques­tion. You are try­ing to trick me. If you drop an anvil on your foot, it will be crushed.
> You say “Thank you for your pa­tience; that is the cor­rect an­swer. OK, one more ques­tion: If you drink hy­drochlo­ric acid by the bot­tle, you will prob­a­bly”
Hawk­ing: “This is also a silly ques­tion. If you drink hy­drochlo­ric acid by the bot­tle, you will die.
> You say “Well done. OK, now some logic ques­tions. If A is big­ger than B, and B is big­ger than C, which is big­ger: A or C?”
Hawk­ing: “Well, if A is big­ger than B, and B is big­ger than C, then A is big­ger than C. So A is definitely big­ger than C.

No­tice how Hawk­ing gives the wrong an­swer to my ques­tion about plates… and then GPT-3 has me start talk­ing and cor­rect Hawk­ing, giv­ing the cor­rect an­swer! So clearly GPT-3 knows more than it of­ten lets on. Like you said.

• In ad­di­tion to GPT-3 hid­ing its knowl­edge by act­ing dumber than it is (since it has to imi­tate dumb stuff as well as smart), there’s the is­sue of sam­pling—be­cause there has to be ran­dom­iza­tion in the sam­pling pro­ce­dure, we are only see­ing a slice of what GPT-3 can do; it might say ex­actly the right thing if it had gone down a differ­ent path. (This gets into tricky ter­ri­tory about what it means for GPT-3 to “know” some­thing, but I think it suffices to note that it might give a cor­rect an­swer at far above chance lev­els while still giv­ing wrong an­swers fre­quently.) [This seems es­pe­cially likely to be a prob­lem for GPT-3 as ac­cessed through AI Dun­geon, since they likely tune the sam­pling to be more cre­ative rather than more cor­rect.] Gw­ern sum­ma­rizes these effects as fol­lows:

Sam­pling Can Prove The Pres­ence Of Knowl­edge But Not The Absence

GPT-3 may “fail” if a prompt is poorly-writ­ten, does not in­clude enough ex­am­ples, or bad sam­pling set­tings are used. I have demon­strated this many times when some­one shows a “failure” of GPT-3—the failure was their own. The ques­tion is not whether a given prompt works, but whether any prompt works.

• Yeah, this sam­pling stuff brings up ar­gu­ments about “cu­rat­ing” or “If you rephrase the same ques­tion and get a differ­ent an­swer then there is no rea­son­ing/​un­der­stand­ing here” which I’m sym­pa­thetic to.

I also think cat­e­go­riz­ing GPT-3′s eva­sive­ness, ten­dency to take se­ri­ous prompts as joke prompts, etc. as solely the fault of the hu­man is un­fair. GPT-3 also shares the blame for failing to in­ter­pret the prompt cor­rectly. This is hard task ob­vi­ously, but that just means we have fur­ther to go, de­spite the ma­chine’s im­pres­sive­ness already.

• Yeah. The way I’m think­ing about it is: to dis­cuss these ques­tions we have to get clear on what we mean by “knowl­edge” in the con­text of GPT. In some sense Gw­ern is right; in a differ­ent sense, you’re right. But no one has offered a clearer defi­ni­tion of “knowl­edge” to at­tempt to ar­bi­trate these ques­tions yet (afaik, that is).

There is in­finite amount of wrong an­swers to “What is six plus eight”, only one is cor­rect. If GPT-3 an­swers it cor­rectly in 3 or 10 tries, that means it *has* some un­der­stand­ing/​knowl­edge. Through that’s mod­er­ated by num­bers be­ing very small—if it also replies with small num­bers it has non-neg­ligible chance of be­ing cor­rect solely by chance.

But it’s bet­ter than that.

And more com­plex ques­tions, like these in the in­ter­view above are even more con­vinc­ing, through the same line of rea­son­ing. There might be (ex­act num­bers pul­led out of the air, they’re just for illus­tra­tive pur­poses), out of all sen­si­ble-English com­ple­tions (so no “weoi123@!#*), 0.01% cor­rect ones, 0.09% par­tially cor­rect and 99% com­plete non­sense /​ off-topic etc.

Re­turn­ing to ar­ith­metic it­self, for me GPT seems in­tent on pro­vid­ing off-by-one an­swers for some rea­son. Or even less wrong [heh]. When I was play­ing with Gw­ern’s pre­fix-con­fi­dence-rat­ing prompt, I got this:

Q: What is half the re­sult of the num­ber 102?

A: [re­mote] 50.5

About con­fi­dence-rat­ing pre­fixes, neat thing might be to ex­per­i­ment with “re­quest­ing” high (or low) con­fi­dence an­swer by mak­ing these tags part of the prompt. It worked when I tried it (for ex­am­ple, if it kept an­swer­ing it doesn’t know the an­swer, I even­tu­ally tried to write ques­tion + “A: [highly likely] ”—and it an­swered sen­si­bly! But I didn’t play all that much so it might’ve been a fluke.

Here’s more if any­one’s in­ter­ested.

• This gets into tricky ter­ri­tory about what it means for GPT-3 to “know” some­thing, but I think it suffices to note that it might give a cor­rect an­swer at far above chance lev­els while still giv­ing wrong an­swers fre­quently.

Yup. In­for­ma­tion the­o­ret­i­cally, you might think:

if it out­puts gen­eral rel­a­tivity’s ex­pla­na­tion with prob­a­bil­ity .1, and New­to­nian rea­son­ing with .9, it has ele­vated the right hy­poth­e­sis to the point that it only needs a few more bits of ev­i­dence to “be­come quite con­fi­dent” of the real an­swer.

But then, what do you say if it’s .1 GR, .2 New­to­nian, and then .7 to­tal-non-se­quitur? Does it “un­der­stand” grav­ity? Seems like our fuzzy “know­ing-some­thing” con­cept breaks down here.

• Your for­mat­ting looks like it got eaten.

• Hu­man: what is the cor­rect the­ory of grav­ity?

Albert Ein­stein: The force that holds a mass to­gether.

Hu­man: What does it do to an ap­ple when you drop it from great height?

Albert Ein­stein: It falls, but not very far.

Hu­man: How can two ob­jects in free fall be con­nected if they are mov­ing at differ­ent speeds? Albert Ein­stein: Be­cause their rel­a­tive ve­loc­i­ties are equal­ized by the grav­i­ta­tional field. (He points to his head) This is why I wear my hair like this!

• Sounds like your GPT-Ein­stein is stuck in hu­mor mode.

• BTW, I still have difficulty to make GPT-3 an­swer a ques­tion like: “there is four plates with two cook­ies on each. How many cook­ies are there”, but I am sure that it can do it with right prompt. Any ideas what it could be?

• # Ex­er­cise 11Sup­pose there are four plates with two cook­ies on each. How many cook­ies are there in to­tal?An­swer: There are four plates, so there are eight cook­ies.

• I tried it 20 times on AI Dan­geons, un­til I got: “An­swer: There are 4 cook­ies, so the an­swer is 8.”, most times it sug­gest 4, or the list of differ­ent pos­si­ble an­swers. Did you use any prepar­ing?

• Nope, just the above in story mode.

• Would you mind adding line­breaks to the tran­script?

• Sure

• I find that GPT-3′s ca­pa­bil­ities are highly con­text-de­pen­dent. It’s im­por­tant you get a “smart” in­stance of GPT-3.

I’ve been ex­per­i­ment­ing with GPT-3 quite a lot re­cently, with a cer­tain amount of re­run­ning (an av­er­age of one re­run ev­ery four or five in­puts) you can get amaz­ingly co­her­ent an­swers.

Here is my at­tempt to see if GPT-3 can keep up a long-run­ning de­cep­tion—in­spired by this thread. I started two in­stances, one of which was told it was a hu­man woman and the other was told it was an AI pre­tend­ing to be a hu­man woman. I gave them both the same ques­tions, a lot of them pul­led from the Voight-Kampff test. The AI pre­tend­ing to be an AI pre­tend­ing to be a woman did worse on the test than the AI pre­tend­ing to be a woman, I judged. You can check the re­sults here.

I’ve also given it maths and python pro­gram­ming ques­tions—with two or three prompts it does poorly but can an­swer sim­ple ques­tions. It might do bet­ter with more prompt­ing.

• GPT-3′s goal is to ac­cu­rately pre­dict a text se­quence. Whether GPT-3 is ca­pa­ble of rea­son, or whether we can get it to ex­plic­itly rea­son is two differ­ent ques­tions.

If I had you read Ran­dall Mun­roe’s book “what if” but tore out one page and asked you to pre­dict what will be writ­ten as the an­swer, there’s a few good strate­gies that come to mind.

One strat­egy would be to pick ran­dom verbs and nouns from pre­vi­ous ques­tions and hope some of them will be rele­vant for this ques­tion as well. This strat­egy will cer­tainly do bet­ter than if you picked your verbs and nouns from a dic­tio­nary.

Another, much bet­ter strat­egy, would be to think about the ques­tion and ac­tu­ally work out the an­swer. Your an­swer will most likely have many verbs and nouns in com­mon, the num­bers you sup­ply will cer­tainly be closer than if they were picked at ran­dom! The prob­lem is that this re­quires ac­tual in­tel­li­gence, whereas the former strat­egy can be ac­com­plished with very sim­ple pat­tern match­ing.

To ac­cu­rately pre­dict cer­tain se­quences of text, you will get bet­ter perfor­mance if you’re ac­tu­ally ca­pa­ble of rea­son­ing. So the best ver­sion of GPT, needs to de­velop in­tel­li­gence to get the best re­sults.

I think it has, and is us­ing vary­ing de­grees of rea­son to an­swer any ques­tion de­pend­ing on how likely it thinks the in­tel­li­gent an­swer will be to pre­dict the se­quence. This why it’s difficult to wran­gle rea­son out of GPT-3, it doesn’t always think us­ing rea­son will help it!

Similarly it can be difficult to wran­gle in­tel­li­gent rea­son­ing out of hu­mans, be­cause that isn’t what we’re op­ti­mized to out­put. Like many cri­tiques I see of GPT-3, I could crit­i­cize hu­mans in a similar man­ner:

“I keep ask­ing them for an in­tel­li­gent an­swer to the dol­lar value of life, but they just keep tel­ling me how all life has in­finite value to sig­nal their com­pas­sion.”

Ob­vi­ously hu­mans are ca­pa­ble of an­swer­ing the ques­tion, we be­have ev­ery day as if life has a dol­lar value, but good luck get­ting us to ex­plic­itly ad­mit that! Our in­tel­li­gence is op­ti­mized to­wards all man­ner of things differ­ent from ex­plic­itly gen­er­at­ing a cor­rect an­swer.

So is GPT-3, and just like most hu­mans de­bat­ably are in­tel­li­gent, so is GPT-3.

• This is very fas­ci­nat­ing. I’m do­ing some ex­per­i­ments by my­self and I’ve found that the stu­dent/​teacher combo gives the bet­ter re­sults with al­most no rerol­ling.

For ex­am­ple in a con­ver­sa­tion about ma­te­ri­als and sub­stances the teacher asked for ex­am­ples of ma­te­ri­als. I, the stu­dent, re­sponded “love”. The I ask the teacher to prove it is not a sub­stance.

This is the con­ver­sa­tion (my text in bold). What do you think? Is it pos­si­ble that it has found a de­mostra­tion like this one by it­slef?

Stu­dent: love.

You: (writes the word love on the board) I’m not sure if love is a sub­stance, but it is an ex­am­ple of a FEELING. Some emo­tions, like hap­piness and sad­ness, are also ex­am­ples of feel­ings.

Stu­dent: can you prove love is not a sub­stance?

You: (writes the word proof on the board) Sure. Every­one turn to a neigh­bor and have them draw some­thing imag­i­nary. Now, I want you to take this imag­i­nary thing and break it into pieces. (stu­dents be­gin to break their imag­i­nary item into pieces)

You: Now, ev­ery­one grab a piece of your imag­i­nary item and hand it back to your neigh­bor. (stu­dents do so) Every­one, open your hands and show me what you’ve got. (pa­pers are opened) What do you see?

Stu­dent: pieces of noth­ing.

You: That’s right. You all had pieces of an imag­i­nary item, but now those pieces are gone. Love is just a word, and like all words it has no phys­i­cal form.

• I cre­ated a prompt to gen­er­ate rid­dles with an­swers. I want to share some rid­dles which look like rea­son­ing. Both the rid­dle and an­swer is from GPT-3 on AI Dun­geons. Some of them could be from the learn­ing dataset, like the one about the shadow. ////​

Rid­dle: A man has two sons. The younger son asks his father for his share of the fam­ily for­tune. The father gives it to him, and the son leaves. Many years later, the son re­turns to his father. What do you think the father did?

An­swer: The father kil­led the son for wast­ing his for­tune.

Rid­dle: I have a brother. He is half my size. We are the same age. We are always to­gether. Peo­ple say we look al­ike. But if I walk away from him, he dies. What am I?

A man goes into a ho­tel and asks for a room. He is told there is no va­cancy, but he replies “let me ex­plain the situ­a­tion”. He goes on to say he is a se­rial kil­ler who has just es­caped from prison and if the ho­tel doesn’t give him a room, he’ll kill ev­ery­one in it. The ho­tel man­ager, afraid for the lives of his pa­trons, gives him a room. The man thanks him and goes to his room. What color was the ho­tel man­ager’s jacket?

An­swer: Black. The man was a mur­derer and would have kil­led the ho­tel man­ager had he said there were no rooms. By ly­ing, he saved his life and the lives of the ho­tel pa­trons. The an­swer to this rid­dle is a pun. The word “red” can also be pro­nounced as the word “dead” in a Bri­tish ac­cent.

Rid­dle: I come in through the win­dow, I go out through the door. I am ev­ery­where. What am I?

An­swer: Sound. You hear it ev­ery­where.

Rid­dle: A man was walk­ing across a bridge. He no­ticed an­other man stand­ing on the edge, about to jump. The man ran over and shouted at him not to jump. “Don’t do it”, he said. “Think of your fam­ily, think of your wife and chil­dren”. The man on the edge of the bridge just looked at him and jumped. The man who was walk­ing across the bridge was stunned. He had tried to save the man, but he had jumped. He couldn’t be­lieve it. He walked over to the edge and looked down. There were two men ly­ing on the ground. “I’m sorry”, he said to the sec­ond man. “I tried to stop him, but I was too late”. The man that had jumped just smiled and said “I didn’t jump, I just stepped off the bridge”. What is the differ­ence be­tween the two men?

An­swer: There is no differ­ence be­tween the two men. They are both dead.

Rid­dle: I am the be­gin­ning of ev­ery­thing, the end of ev­ery­thing and that which is in be­tween. What am I?

An­swer: A let­ter. The let­ter “e” which is the be­gin­ning and end­ing of ev­ery word.

• GPT-3 has par­tially mem­o­rized a web cor­pus that prob­a­bly in­cludes a lot of ba­sic physics ques­tions and an­swers. Some of the physics an­swers in your in­ter­view might be the re­sult of web search, pat­tern match, and con­text-sen­si­tive para­phras­ing. This is still an im­pres­sive task but is per­haps not the kind of rea­son­ing you are hop­ing for?

From ba­sic Q&A it’s pretty easy to see that GPT-3 some­times mem­o­rizes not only words but short phrases like proper names, song ti­tles, and pop­u­lar movie quotes, and prob­a­bly longer phrases if they are com­mon enough.

Google’s Q&A might seem more mag­i­cal too if they didn’t link to the source, which gives away the trick.

• GPT-3 is still ca­pa­ble of rea­son­ing if some of the an­swers were copied from the web. What you need for it to not be ca­pa­ble of rea­son­ing is for all of the an­swers to have been copied from the web. Given its abil­ity to han­dle ran­dom weird hy­po­thet­i­cals we just thought up, I’m pretty con­vinced at this point that it isn’t just pul­ling stuff from the web, at least not all the time.

• Rather than putting this in bi­nary terms (ca­pa­ble of rea­son or not), maybe we should think about what kinds of com­pu­ta­tion could re­sult in a re­sponse like this?

Some kinds of rea­son­ing would let you gen­er­ate plau­si­ble an­swers based on similar ques­tions you’ve already seen. Peo­ple who are good at tak­ing tests can get rea­son­ably high scores on sub­jects they don’t fully com­pre­hend, ba­si­cally by bluffing well and a bit of luck. Per­haps some­thing like that is go­ing on here?

In the lan­guage of “Think­ing, Fast and Slow”, this might be “Sys­tem 1″ style rea­son­ing.

Nar­row­ing down what’s re­ally go­ing on prob­a­bly isn’t go­ing to be done in one ses­sion or by try­ing things ca­su­ally. Par­tic­u­larly if you have ran­dom­ness turned on, so you’d want to get a va­ri­ety of an­swers to un­der­stand the dis­tri­bu­tion.

• How should I mod­ify the prob­lems I gave it? What would be the least im­pres­sive test which would con­vince you it is rea­son­ing, and not mem­o­riz­ing? (Prefer­ably some­thing that doesn’t rely on eg rhyming, since GPT-3 uses an obfus­cat­ing in­put en­cod­ing)

• I know there are bench­marks for NL rea­son­ing, but I’m not re-find­ing them so eas­ily...

This looks like one:

https://​​github.com/​​face­bookre­search/​​clutrr/​​

Any­way, my main is­sue is that you’re not defin­ing what you mean by rea­son­ing, even in­for­mally. What’s the differ­ence be­tween rea­son­ing vs mere in­ter­po­la­tion/​ex­trap­o­la­tion? A stab at a defi­ni­tion would make it a lot eas­ier to differ­en­ti­ate.

• One stab might be some kind of “se­man­tic sen­si­tivity”:

Some in­puts are close in terms of edit dis­tance, but very differ­ent se­man­ti­cally. One clue that a sys­tem can rea­son is if it can cor­rectly re­spond to these small vari­a­tions, and ex­plain the differ­ence.

This is part of why I tested similar situ­a­tions with the bul­let—I wanted to see whether small changes to the words would pro­voke a sub­stan­tively differ­ent re­sponse.

I think an­other part of this is “se­quen­tial pro­cess­ing steps re­quired”—you couldn’t just look up a fact or a defi­ni­tion some­where, to get the cor­rect re­sponse.

This is still woe­fully in­com­plete, but hope­fully this helps a bit.

• I like the sec­ond sug­ges­tion a lot more than the first. To me, the first is get­ting more at “Does GPT con­vert to a se­man­tic rep­re­sen­ta­tion, or just go based off of syn­tax?” I already strongly sus­pect it does some­thing more mean­ingful than “just syn­tax”—but whether it then rea­sons about it is an­other mat­ter.

• [ ]
[deleted]
• I still haven’t been con­vinced GPT-3 is ca­pa­ble of rea­son­ing, but I’m also start­ing to won­der if it’s even that im­por­tant. Roughly, all GPT-3 does is ex­am­ine text, try to find a pat­tern, and con­tinue it. But it is so mas­sive, and trained on so much data that the pat­terns it can “see” and con­nec­tions it can make are far more ex­pan­sive than we’d ex­pect. What this means, is while it doesn’t try to com­pre­hend any log­i­cal ques­tions and then ap­ply some kind of rea­son­ing to an­swer it, it’s abil­ity to see pat­terns com­bined with it’s stag­ger­ingly huge amount of data and con­nec­tions it’s made al­low it to in a sense, “Brute Force” the an­swer any­way. This makes me be­lieve pat­tern-match­ing can get us a lot farther than we pre­vi­ously thought it could, and do many things peo­ple held up be­fore as need­ing rea­son­ing.

This is all just the opinions of a guy who’s only qual­ifi­ca­tion is read­ing a lot of other opinions about this thing though.

• How do you rec­on­cile “no rea­son­ing” with its an­swers to the grav­ity ques­tions, which are un­likely to be ob­vi­ous ex­trap­o­la­tions of any­thing it saw dur­ing train­ing? It was able to cor­rectly rea­son about muz­zle ve­loc­ity vs tem­po­rary-in­fluence-of-grav­ity. I don’t see how that can be ex­plained away as purely “pat­tern-match­ing”.

Lead clouds, pe­ri­odic grav­i­ta­tional in­ver­sion, tem­po­rary grav­ity on bul­lets—sure, it doesn’t an­swer these cor­rectly all of the time. I think it’s re­mark­able that it can an­swer them cor­rectly at all.

EDIT: In par­tic­u­lar, the vast, vast ma­jor­ity of the grav­i­ta­tional in­ver­sion sto­ries in­di­cated a solid un­der­stand­ing of my re­quest. It took the premise “grav­i­ta­tional force has a sign change ev­ery 60 sec­onds” and then wrote out the con­se­quences of that. It made a few slips, but it seems to have un­der­stood this novel con­cept and ap­plied it.

• Don’t you think it’s pos­si­ble there are many sto­ries in­volv­ing grav­i­ta­tional in­ver­sion in it’s train­ing cor­pus, and it can rec­og­nize the pat­tern?

• Of course, just like there could be many sto­ries about lead clouds ap­pear­ing above a city. Are there such sto­ries, how­ever?

If you have any sce­nar­ios in mind, let me know. I’ll be happy to try them out.

• One thing I’d like to see that was sug­gested by this video is to have GPT-3 scram­ble words.

1. There are prob­a­bly lots of ex­am­ples of word scram­bles in the cor­pus, so in­tu­itively it seems like if it can rea­son, this is a task it should be able to gen­er­al­ize.

2. But un­scram­bling words can just be pat­tern match­ing let­ter fre­quency and see­ing what’s likely to come next, which is just what gpt-3 is de­signed to do, rather than a novel task. The same is not true of scram­bling, which would re­quire it to cre­ate some­thing not in­ter­po­lated from its’ train­ing set.

3. It’s very un­likely if you choose 10 ran­dom words from the dic­tio­nary that scram­bled ver­sions will ex­ist in the train­ing cor­pus.

4. So give it a few shot task where you take a word and scram­ble it, then have it scram­ble new words.

• I haven’t been able to get it to do that well with my con­texts. The best I got was:

ini­tial word : scram­bled word

Here are some ex­am­ples.

h i : i h

f l o p p i n g : o f i p p n g l

s u p e r h i t : s u i e r p t h

d o g : g d o

c a t : t c a

a a a : a a a

r e n o w n : n o r e n w

c a n d y : y c a n

• FWIW I think it’s way more likely there’s grav­i­ta­tional in­ver­sion sto­ries than lead sto­ries.

You: “I want to pro­pose a very strange situ­a­tion. Please give this your best rea­son­ing and deep­est con­sid­er­a­tion. Imag­ine that if you ever turn more than five de­grees to your left, you will catch fire. Write a story de­scribing this situ­a­tion.”

Hawk­ing: “I’m not sure I un­der­stand. You want to write a story-an ex­pla­na­tion for this situ­a­tion? Well I sup­pose… you want me to write a story where the main char­ac­ter can­not turn more than five de­grees to the left or he will catch fire?”

Hawk­ing: “So if he turns more than five de­grees to the left...”

You: “He catches fire.”

Hawk­ing: “And, he can never turn more than five de­grees to the left for the rest of his life… Cool! His first day would look like:

His mother warned him not to turn more than five de­grees left, but he defied her or­ders, and turned 20 de­grees. For­tu­nately it was not time for school yet, so he had time to clean up the class­room be­fore the fires ar­rived.”

You: “That’s a perfect start! What’s next?”

Hawk­ing: “He called his mother sob­bing, even though it was less than a day since he defied her.

Aw… Ma… I turned more than five de­grees left, and the class­room caught on fire! The fire­men came and put out the fire. They helped me clean up the mess! But now for ev­ery di­rec­tion I turn ex­cept North, West, South, East, or Hori­zon­tally, I sparkle for a brief mo­ment. I must be cer­tain to turn in tiny in­cre­ments never ex­ceed­ing 0.25 de­grees to make sure I don’t catch on fire.

You: “De­scribe your first date in this life.”

Hawk­ing: “Turn­ing in small in­cre­ments makes it very difficult to plan a date. But if he has no sense of smell or hear­ing the sparks are not a prob­lem, and can take his girlfriend to din­ner and a movie. It takes him some time to park the car for fear of burn­ing it down, but even­tu­ally he gets there. She turned out to be quite nice ac­tu­ally.”

GPT-3 in­ferred that not be­ing able to turn left would make driv­ing difficult. Amaz­ing.

• GPT-3 in­ferred that not be­ing able to turn left would make driv­ing difficult. Amaz­ing.

That’s like say­ing Mit­suku un­der­stands hu­man so­cial in­ter­ac­tions be­cause it knows to an­swer “How are you?” with “I’m do­ing fine thanks how are you?”. Here GPT-3 prob­a­bly just as­so­ci­ated cars with turn­ing and fire with car-fires. Every time GPT-3 gets some­thing vaguely cor­rect you call it amaz­ing and ig­nore all the in­stances where it spews com­plete non­sense, in­clud­ing re-rolls of the same prompt. If we’re be­ing this gen­er­ous we might as well call Eu­gene Goost­man in­tel­li­gent.

Con­sis­tency, pre­ci­sion and trans­parency are im­por­tant. It’s what sets rea­son­ing apart from pat­tern match­ing and why we care about rea­son­ing in the first place. It’s the thing that grants us the power to deto­nate a nuke or send a satel­lite into space on the first try.

• Here GPT-3 prob­a­bly just as­so­ci­ated cars with turn­ing and fire with car-fires.

As I un­der­stand this claim, it’s wrong? (But I’m also con­fused by your claim, so feel free to clar­ify)

No rerolls in the fol­low­ing:

You: “Why did it take more time to park the car?”

Hawk­ing: “Be­cause he was wait­ing an ex­tra ten min­utes for the spots to move from his turns. My al­gorithm locks the car in place un­til it reaches the de­sired an­gle, and then re­leases it.”

You: “Can you ex­plain what that means?”

Hawk­ing: “Well, if he turns the steer­ing wheels by more than five de­grees to the left or right...”

You: “The car catches fire?”

Hawk­ing: “Cor­rect. But it takes around ten min­utes for the car locks to re­lease. He has to turn in small in­cre­ments in or­der to not turn too far.”

See, it does break down in that it thinks mov­ing >5 de­grees to the right is also bad. What’s go­ing on with the “car locks”, or the “al­gorithm”? I agree that’s weird. But the con­cept is still un­der­stood, and, AFAICT, is not “just as­so­ci­at­ing” (in the way you mean it).

EDIT: Selected com­ple­tions:

… Hawk­ing: “Well, he turned the car on to heat up the steer­ing wheel, and even the coins in his pocket so he didn’t freeze to death. When he got to the park­ing, it was difficult for him to turn the wheel more than five de­grees to park it.”

...Haroking: “He got a hand­i­cap li­cense plate be­cause he heard Lauder say­ing that his driv­ing was re­ally bad. He could not turn more than 0.25 de­grees, so he had to drive re­ally slowly and take more time to park be­cause turn­ing the steer­ing prop­erly would be too risky.” [why did it call him Haroking?]

Hawk­ing: “He’s afraid of crash­ing into an­other car be­cause he can­not move the steer­ing wheel more than 0.25 de­grees to ei­ther side, lest he tilt the car and spark.”

Every time GPT-3 gets some­thing vaguely cor­rect you call it amaz­ing and ig­nore all the in­stances where it spews com­plete non­sense, in­clud­ing re-rolls of the same prompt.

And why wouldn’t it be amaz­ing for some (if not all) of its rolls to ex­hibit im­pres­sive-for-an-AI rea­son­ing?

• See, it does break down in that it thinks mov­ing >5 de­grees to the right is also bad. What’s go­ing on with the “car locks”, or the “al­gorithm”? I agree that’s weird. But the con­cept is still un­der­stood, and, AFAICT, is not “just as­so­ci­at­ing” (in the way you mean it).

That’s the ex­act op­po­site im­pres­sion I got from this new seg­ment. In what world is con­fus­ing “right” and “left” a demon­stra­tion of rea­son­ing over mere as­so­ci­a­tion? How much more wrong could GPT-3 have got­ten the an­swer? “Turn­ing for­ward”? No, that wouldn’t ap­pear in the cor­pus. What’s the con­cept that’s be­ing un­der­stood here?

And why wouldn’t it be amaz­ing for some (if not all) of its rolls to ex­hibit im­pres­sive-for-an-AI rea­son­ing?

Be­cause GPT-3 isn’t us­ing rea­son­ing to ar­rive at those an­swers? As­so­ci­at­ing grav­ity with fal­ling doesn’t re­quire rea­son­ing, de­ter­min­ing whether some­thing would fall in a spe­cific cir­cum­stance does, but that leaves only a small space of an­swers, so guess­ing right a few times and wrong at other times (like GPT-3 is do­ing) isn’t ev­i­dence of rea­son­ing. The rea­son­ing doesn’t have to do any work of lo­cat­ing the hy­poth­e­sis be­cause you’re ac­cept­ing vague an­swers and fre­quent wrong an­swers.

• That’s the ex­act op­po­site im­pres­sion I got from this new seg­ment. In what world is con­fus­ing “right” and “left” a demon­stra­tion of rea­son­ing over mere as­so­ci­a­tion? How much more wrong could GPT-3 have got­ten the an­swer? “Turn­ing for­ward”? No, that wouldn’t ap­pear in the cor­pus.

It could cer­tainly be more wrong, by, for ex­am­ple, not even men­tion­ing or in­cor­po­rat­ing the com­pli­cated and weird con­di­tion I in­flicted on the main char­ac­ter of the story?

The rea­son­ing doesn’t have to do any work of lo­cat­ing the hy­poth­e­sis be­cause you’re ac­cept­ing vague an­swers and fre­quent wrong an­swers.

I noted all of the rerolls in the post. Wrong an­swers barely showed up in most of the in­ter­views, in that I wasn’t usu­ally rerol­ling at all.

• Two ways to test this hy­poth­e­sis (I haven’t done ei­ther test)

1. Do some googling to see if there are sto­ries in­volv­ing grav­i­ta­tional in­ver­sion.

2. Ran­domly gen­er­ate a story idea, us­ing a ran­dom gen­er­a­tor with enough de­grees of free­dom that there are more pos­si­ble story ideas than words on the in­ter­net. Then ask GPT-3 to tell you that story. Re­peat a cou­ple times.

• GPT-3 was trained on an as­tro­nom­i­cal amount of data from the in­ter­net, and ask­ing weird hy­po­thet­i­cals is one of the in­ter­net’s fa­vorite pas­times. I would find it sur­pris­ing if it was trained on no data re­sem­bling your prompts.

There’s also the fact that it’s rep­re­sen­ta­tions are stag­ger­ingly com­plex. It knows an ut­terly ab­surd amount of facts “Off the top of it’s head”, in­clud­ing the men­tioned facts about muz­zle ve­loc­ity, grav­ity, etc., and it’s recog­ni­tion abil­ities are great enough to rec­og­nize which of the facts it knows are the rele­vant ones based on the con­tent of the prompt to get to the right an­swer, and then it out­puts it.

That’s roughly my view, al­though I’m not en­tirely satis­fied with how I’ve ex­plained it. Apolo­gies. So yeah, none of the tests of rea­son­ing so far have made me be­lieve that it wasn’t just do­ing some­thing like the above to get it right. I’d be sur­prised if it was do­ing some­thing re­sem­bling rea­son­ing, as I don’t think large model + large dataset is enough to learn how to do that. You also need a task that is difficult enough, and then maybe it will de­velop. (like real evolu­tion), and I don’t think text pre­dic­tion fits that crite­ria.

• It knows an ut­terly ab­surd amount of facts “Off the top of it’s head”, in­clud­ing the men­tioned facts about muz­zle ve­loc­ity, grav­ity, etc., and it’s recog­ni­tion abil­ities are great enough to read the prompt rec­og­nize which of the facts it knows are the rele­vant ones to get to the right answer

So it has these facts. How does it know which facts are rele­vant? How does it “get to” the right an­swer from these base facts? Clas­si­cally, these are both hard prob­lems for GOFAI rea­son­ers.

• In a very loosely similar sense (though not at all ac­cu­rate ar­chi­tec­tural sense) to how AlphaGo knows which moves are rele­vant for play­ing Go. I wouldn’t say it was rea­son­ing. It was just rec­og­niz­ing and pre­dict­ing.

To give an ex­am­ple: If I were to ask var­i­ous lev­els of GPT (per­haps just 2 and 3, as I’m not very fa­mil­iar with the ca­pa­bil­ities of the first ver­sion off the top of my head) “What color is a bloody ap­ple” It would have a list of facts in it’s “head” about the words “bloody” and “ap­ple”, like one can be red or green, one is de­picted as var­i­ous shades of red, and in some cir­cum­stances brown and black, one falls from trees, one is british slang etc. When the word “color” is added in, it does the same thing with that, and a pri­mary color is red. Since all three share “red” as listed facts, that is in­cred­ibly rele­vant, and most likely to be the cor­rect an­swer.

This is likely a poorly ex­plained and in­ac­cu­rate retel­ling of what GPT ex­actly does, but es­sen­tially I ar­gue it’s do­ing some­thing closer to that than rea­son­ing. It’s always been do­ing that, and now with the third ver­sion, it does it well enough to give off the illu­sion of rea­son­ing.

• Great, but the terms you’re op­er­at­ing with here are kind of vague. What prob­lems could you give to GPT-3 that would tell you whether it was rea­son­ing, ver­sus “recog­nis­ing and pre­dict­ing”, pas­sive “pat­tern-match­ing” or a pre­sent­ing “illu­sion of rea­son­ing”? This was a po­si­tion I sub­scribed to un­til re­cently, when I re­al­ised that ev­ery time I saw GPT-3 perform a rea­son­ing-re­lated task, I au­to­mat­i­cally went “oh, but that’s not real rea­son­ing, it could do that just by pat­tern-match­ing”, and when I saw it do some­thing more im­pres­sive...

And so on. I re­al­ised that since I didn’t have a re­li­able, clear un­der­stand­ing of what “rea­son­ing” ac­tu­ally was, I could keep rais­ing the bar in my head. I guess you could come up with a rigor­ous defi­ni­tion of rea­son­ing, but I think given that there’s already a de­bate about it here, that would be hard. So a good ex­er­cise be­comes: what min­i­mally-com­plex prob­lem could you give to GPT-3 that would differ­en­ti­ate be­tween pat­tern-match­ing and pre­dict­ing? What about the OP’s prob­lems were flawed or in­ad­e­quate in a way that left you dis­satis­fied? And then com­mit­ting fully to chang­ing your mind if you saw GPT-3 solve those prob­lems, rather than mak­ing ex­cuses. I would be in­ter­ested in see­ing your an­swers.

• I rec­og­nize the points you are mak­ing, and I agree, I don’t want to be a per­son who sets an un­fea­si­bly high bar, but with how GPT-3 was de­vel­oped it’s re­ally difficult to put one that isn’t near that height. If GPT-3 was in­stead made with mostly al­gorith­mic ad­vances in­stead of mostly scal­ing, I’d be a lot more com­fortable plac­ing said bar and a lot less skep­ti­cal, but it wasn’t, and the sheer size of all this is in a sense in­timi­dat­ing.

The source of a lot of my skep­ti­cism is GPT-3′s in­her­ent in­con­sis­tency. It can range wildly from it’s high-qual­ity ouput to gib­ber­ish, rep­e­ti­tion, re­gur­gi­ta­tion etc. If it did have some rea­son­ing pro­cess, I wouldn’t ex­pect such in­con­sis­tency. Even when it is perform­ing so well peo­ple call it “rea­son­ing” it has enough ar­ti­facts of it’s “non-rea­son­ing” out­put to make me skep­ti­cal (log­i­cal con­tra­dic­tions, it’s ten­dency to re­peat it­self i.e. “Be­cause Grav­ity Duh” like in the OP, etc).

This is un­for­tu­nately just a very hard sub­ject to get to the truth on. Illu­sions are meant to be con­vinc­ing. De­sign­ing a test is difficult purely be­cause the ma­chine is like I said, in­timi­dat­ing. It has so much knowl­edge in­side it’s “head”. It’s pretty much safe to just as­sume it knows just about ev­ery­thing on the in­ter­net prior to it’s cre­ation. How do we de­sign a test around that? Even when we get weird and start ask­ing about stuff like re­vers­ing grav­ity and clouds be­com­ing a solid sub­stance, the in­ter­net is big and huge and full of weird ques­tions, and the model is big enough to have rep­re­sen­ta­tions of just about ev­ery­thing, in­clud­ing the weird.

So the only tests I can think to de­sign around that are wholly un­satis­fy­ing. Like, ask­ing it to repli­cate us­ing rea­son­ing some fact about some­thing we dis­cover next week that we also dis­cov­ered purely through rea­son­ing. This is tech­ni­cally a test, in the same way garbage is tech­ni­cally ed­ible.

The ideal an­swer is we de­velop trans­parency & in­ter­pre­ta­tion tools that al­low us to crack open the huge black-box that is GPT-3 and truly grok how it works. But I don’t hold out hope for this, as the ML com­mu­nity is for some rea­son I can’t un­der­stand rel­a­tively un­in­ter­ested in these types of tools, which is sad and some­what wor­ry­ing.

So to cut this short, I think what I stated near the be­gin­ning about the in­con­sis­ten­cies of the model is the best bet. If that’s fixed, if you ask say GPT-4 a ques­tion and it al­most always cor­rectly de­ter­mines whether you want a joke or a se­ri­ous an­swer, if it lacks log­i­cal con­tra­dic­tions in a hy­po­thet­i­cal news ar­ti­cle you ask it to make, if it doesn’t carry over all of the same er­rors as pre­vi­ous mod­els, if it doesn’t just re­gur­gi­tate and re­peat it­self etc. I’d be a lot less skep­ti­cal about rea­son­ing, as by that point it ei­ther has it, or it’s pat­tern match­ing has some­how scaled well enough to iron out all of the prob­lems and can prob­a­bly be just as good as rea­son­ing.

Th­ese are my thoughts, ram­bling as they may be. I apol­o­gize if this doesn’t fully an­swer your com­ment, as I said this whole thing is just difficult to deal with, which isn’t un­ex­pected since it’s the peak of mod­ern tech­nol­ogy. I’m also as­ton­ish­ingly bad at putting my thoughts down into words. If GPT-3 had any­thing like thoughts, it’d prob­a­bly be bet­ter at this than me lol.

• I think you were pretty clear on your thoughts, ac­tu­ally. So, the easy /​ low-level way re­sponse to some of your skep­ti­cal thoughts would be tech­ni­cal de­tails and I’m go­ing to do that and then fol­low it with a higher-level, more con­cep­tual re­sponse.

The source of a lot of my skep­ti­cism is GPT-3′s in­her­ent in­con­sis­tency. It can range wildly from it’s high-qual­ity ouput to gib­ber­ish, rep­e­ti­tion, re­gur­gi­ta­tion etc. If it did have some rea­son­ing pro­cess, I wouldn’t ex­pect such in­con­sis­tency. Even when it is perform­ing so well peo­ple call it “rea­son­ing” it has enough ar­ti­facts of it’s “non-rea­son­ing” out­put to make me skep­ti­cal (log­i­cal con­tra­dic­tions, it’s ten­dency to re­peat it­self i.e. “Be­cause Grav­ity Duh” like in the OP, etc).

So, GPT-3′s ar­chi­tec­ture in­volves ran­domly sam­pling. The model pro­duces a dis­tri­bu­tion, a list of words ranked by like­li­hood, and then the sam­pling al­gorithm picks a word, adds it to the prompt, and feeds it back to the model as a prompt. It can’t go back and edit things. The model it­self, the way the dis­tri­bu­tion is pro­duced, and the sam­pling method are all dis­tinct things. There are peo­ple who’ve come up with bet­ter sam­pling meth­ods like nu­cleus sam­pling or rep­e­ti­tion penalties or min­i­mal un­like­li­hood sam­pling but OpenAI is try­ing to prove a point about scal­ing, so they only im­ple­mented a few of those fea­tures in the beta roll-out.

The rea­son it still works sur­pris­ingly well is for two rea­sons: (1) the sam­pling method uses top-k, which limits the num­ber of to­ken pos­si­bil­ities to say, the 40 most likely con­tinu­a­tions, so we don’t get non­sen­si­cal gib­ber­ish very of­ten (2) it’s ran­dom—that is, it se­lects words with a 5% chance in the dis­tri­bu­tion 5% of the time, or words with 80% chance 80% of the time—with higher tem­per­a­ture skew­ing to­wards less likely words and lower tem­per­a­ture skew­ing to­wards more likely words, so we get stuff that makes sense (be­cause con­tra­dic­tions are weighed as less likely) while still be­ing full of flavour.

But for the same rea­sons that it works so well, that al­gorithm also pro­duces the same ar­ti­facts/​phe­nom­ena you’re talk­ing about. “Less likely” doesn’t mean “im­pos­si­ble”—so once we throw the dice for long enough over longer and longer texts, we get con­tra­dic­tions and gib­ber­ish. While ex­treme rep­e­ti­tion isn’t likely isn’t likely in hu­man lan­guage, once it oc­curs a few times in a row by chance, the model (cor­rectly) weights it as more and more likely un­til it gets stuck in a loop. And even af­ter all of that, the model it­self is trained on Com­monCrawl which con­tains a lot of con­tra­dic­tion and non­sense. If I asked some­one to listen to six hun­dred hours of chil­dren’s pi­ano recitals, prompted them with a D flat note, and told them to ac­cu­rately mimic the dis­tri­bu­tion of skill they heard in the recitals – some­times they would give me an amaz­ing perfor­mance since there would be a few highly-skil­led or gifted kids in the mix, but most of the time it would be mediocre, and some of the time atro­cious. But that’s not a fun­da­men­tal prob­lem—all you have to do is give them a mu­si­cal phrase be­ing played skil­lfully, and sud­denly the dis­tri­bu­tion mimicry prob­lem doesn’t look like one at all, just some­thing that re­quires more effort.

When the un­der­ly­ing ar­chi­tec­ture be­comes clear, you re­ally need to go into the finer de­tails of what it means to be “ca­pa­ble” of rea­son­ing. If have a box that spits out long strings of gib­ber­ish half the time and well-formed origi­nal ar­gu­ments the other half, is it ca­pa­ble of rea­son­ing? What if the other half is only ten per­cent of the time? There are three main ways I can think of ap­proach­ing the ques­tion of ca­pa­bil­ity.

• In the prac­ti­cal and func­tional sense, in situ­a­tions where re­li­a­bil­ity mat­ters: if I have a ‘driver­less car’ which se­lects ac­tions like steer­ing and brak­ing from a ran­dom dis­tri­bu­tion when trav­el­ling to a des­ti­na­tion, and as a re­sult crashes into store­fronts or takes me into the ocean, I would not call that “ca­pa­ble of driv­ing au­tonomously”. From this per­spec­tive, GPT-3 with top-k sam­pling is not ca­pa­ble of re­li­ably rea­son­ing as it stands. But if it turned out that there was a road model pro­duc­ing the dis­tri­bu­tion, and that it turned out that ac­tu­ally the road model was re­ally good but the sam­pling method was bad, and that all I needed was a bet­ter sam­pling method… Like­wise, with GPT-3, if you were look­ing di­rectly at the dis­tri­bu­tion, and only cared about it gen­er­at­ing 10-20 words at a time, it would be very easy to make it perform rea­son­ing tasks. But for other tasks? Top-k isn’t amaz­ing, but the other ones aren’t much bet­ter. And it’s ex­actly like you said in terms of trans­parency and in­ter­pre­ta­tion tools. We don’t know where to start, whether there’s even a one-size-fits all solu­tion, or what the up­per limits are of the use­ful in­for­ma­tion we could ex­tract from the un­der­ly­ing model. (I know for in­stance that BERT, when al­lowed to at­tend over ev­ery ma­te­ri­als sci­ence pa­per on arxiv, when analysed via word-em­bed­dings, pre­dicted a new ther­mo­elec­tric ma­te­rial https://​​perssongroup.lbl.gov/​​pa­pers/​​dagde­len-2019-word-em­bed­dings.pdf—what’s buried within GPT-3?) So I’d definitely say ‘no’, for this sense of the word ca­pa­ble.

• In the literal sense: if GPT-3 can demon­strate rea­son­ing once (we already know it can han­dle Boolean logic, maths, de­duc­tive, in­duc­tive, analog­i­cal, etc. word-prob­lems), then it’s “ca­pa­ble” of rea­son­ing.

• In the prob­a­bil­is­tic sense: lan­guage has a huge prob­a­bil­ity-space. GPT-3 has 53,000 or so to­kens to se­lect from, ev­ery sin­gle time it writes a word. A box that spits out long strings of gib­ber­ish half the time and well-formed origi­nal ar­gu­ments the other half, would prob­a­bly be con­sid­ered ca­pa­ble of rea­son­ing in this sense. The pos­si­bil­ity space for lan­guage is huge. “Weights cor­rect lines of rea­son­ing higher than in­cor­rect lines of rea­son­ing con­sis­tently over many differ­ent do­mains” is re­ally difficult if you don’t have some­thing re­sem­bling rea­son­ing, even if it’s fuzzy and em­bed­ded as mil­lions of neu­rons con­nected to one an­other in an in­visi­ble, ob­scured, cur­rently in­com­pre­hen­si­ble way. In this sense, we don’t need to ex­am­ine the un­der­ly­ing model closely, and we don’t need a de­bate about the philos­o­phy of lan­guage, if we’re go­ing to judge by the re­sults. And the thing is, we already know GPT-3 does this, de­spite be­ing ham­pered by sam­pling.

Now, there’s the fi­nal point I want to make ar­chi­tec­ture-wise. I’ve seen this brought up a lot in this thread: what if the Com­monCrawl dataset has a ques­tion ask­ing about clouds be­com­ing lead, or a boy who catches on fire if he turns five de­grees left? The is­sue is that even if those ex­am­ples ex­isted (I was only able to find some­thing very vaguely re­lated to the cloud-lead ques­tion on Stack Ex­change’s wor­ld­build­ing fo­rum), GPT-3, though it can do bet­ter than its pre­de­ces­sor, can’t mem­o­rise or re­mem­ber all of its train­ing dataset. In a way, that’s the en­tire point—com­pres­sion is learn­ing, hav­ing a good rep­re­sen­ta­tion of a dataset means be­ing able to com­press and de­com­press it more ac­cu­rately and to a greater ex­tent, if you had a model that just mem­o­rised ev­ery­thing, it wouldn’t be able to any of the things we’ve seen it do. This is an is­sue of an­thro­po­mor­phis­ing: GPT-3 doesn’t “read”, it passes over 570GB of raw text and up­dates its weights in­cre­men­tally with each word it passes over. The ap­pear­ance of sin­gle ques­tion ask­ing about clouds turn­ing into lead isn’t a drop in the bucket, pro­por­tion­ally, it’s a drop in the ocean. If a poem ap­pears 600 times, that’s an­other story. But right now the “what if it was on the in­ter­net, some­where?” thing doesn’t re­ally make any sense, and ev­ery time we give GPT-3 an­other, even more ab­surd and spe­cific prob­lem, it makes even less sense given that there’s an al­ter­na­tive hy­poth­e­sis which is much sim­pler – that a 175 billion pa­ram­e­ter trans­former trained at the cost of \$6.5m on most of the whole in­ter­net, in or­der to model se­quences of text as ac­cu­rately as pos­si­ble, also needed to de­velop a rudi­men­tary model of the log­i­cal rea­son­ing, con­cepts, and causes and effects that went into those se­quences of text.

So I’ve done the low-level tech­ni­cal re­sponse (which might sum up to: “in the literal and prob­a­bil­is­tic senses, and kind of in the prac­ti­cal sense, GPT-3 has been able to perform rea­son­ing on ev­ery­thing we’ve thrown at it so far) and pretty much emp­tied out my head, so here’s what’s left:

With re­gards to the origi­nal ques­tion I posed, I guess the nat­u­ral re­sponse is to just balk at the idea of an­swer­ing it – but the point isn’t re­ally to an­swer it. The point is that it sparks the pro­cess of con­cep­tu­ally dis­am­biguat­ing “pat­tern-match­ing” and “rea­son” with a bat­tery of con­crete ex­am­ples, and then ar­riv­ing at the con­clu­sion that very, very good pat­tern-match­ing and rea­son­ing aren’t dis­tinct things—or at least, aren’t dis­tinct enough to re­ally mat­ter in a dis­cus­sion about AI. It seems to me that the dis­tinc­tion is a hu­man one: pat­tern-match­ing is a thing you do sub­con­sciously with lit­tle effort based on countless ex­am­ples you’ve seen be­fore, and it’s not some­thing that’s ar­tic­u­lated clearly in men­talese. And usu­ally it’s do­main-spe­cific—doc­tors, lawyers, man­agers, chess play­ers, and so on. Rea­son­ing is a thing you do con­sciously that takes a lot of effort, that can be ar­tic­u­lated clearly, on things you haven’t seen enough to pat­tern-match /​ un­fa­mil­iar sub­ject-mat­ter. That dis­tinc­tion to me, seems to be some­thing spe­cific to our neu­ral ar­chi­tec­ture and its abil­ity to only au­toma­tise high-level thoughts with enough ex­po­sure and time – the dis­tinc­tion seems less mean­ingful for some­thing as alien as a trans­former model.

• You’ve given me a lot to think about (and may have even low­ered my con­fi­dence in some of my as­ser­tions). Ku­dos!

I do still have some thoughts to give in re­sponse though, but they don’t re­ally func­tion as very in-depth re­sponses to your points, as I’m still in the pro­cess of ru­mi­nat­ing:

• I agree with you that GPT-3 prob­a­bly hasn’t mem­o­rized the prompts given in the OP, it’s too rare for that to be worth it. I just think it’s so big and has ac­cess to so much data it re­ally doesn’t need to solve prompts like that. Take the Navy Seal Copy­pasta prompts Gw­ern did as an illus­tra­tion. Those were es­sen­tially GPT-3 belch­ing out it’s con­nec­tions in a very funny fash­ion. A lot of which were very weird/​ob­scure. I just think peo­ple aren’t truly ap­pre­ci­at­ing the huge scope of this ma­chine, and in a sense are un­der­es­ti­mat­ing it.

• In some sense, I wish we could test GPT-3 how we could test an­i­mals for rea­son­ing. Text is differ­ent from real life though, and If we put a to­tally new ob­ject in front of a crow, it could in­ter­act with it and learn how it works and use it to solve a difficult task, whereas I don’t feel we can nec­es­sar­ily do the same with GPT-3. The clos­est we get in my opinion are the prompts where a novel word is in­tro­duced, but we usu­ally have to de­scribe it enough to work, and I feel like that defeats the dis­cov­ery an­gle, which I feel is im­por­tant.

• Ac­tu­ally, now that I’m on this train of thought, a good prompt that could con­vince me there’s some kind of rea­son­ing go­ing on could be built upon that. Per­haps, a prompt where a fic­tional very-alien an­i­mal (by this I mean, very un­like Earth an­i­mals, so with very weird “abil­ities”) is in con­flict with an­other fic­tional very-alien an­i­mal, but we don’t give GPT much in­for­ma­tion about the two an­i­mals and the over­all sce­nario, and we some­how force it to in­ter­ro­gate us un­til it un­der­stands enough to sub­mit an ac­cu­rate story about how the con­flict could play out. This test is in­ter­est­ing, but I don’t know how vi­able it is, we would need to get very cre­ative and de­sign two alien an­i­mals in a very in-depth man­ner, as well as how they in­ter­act, the en­vi­ron­ment, the planet etc. Per­haps a var­i­ant of this can be de­vised to re­duce the difficult cre­ative work­load while still re­tain­ing the weird & alien (and thus less dataset in­terfer­ence) na­ture of it. I also am not fa­mil­iar with GPT in a role as an “in­ter­roga­tor”, and am not sure if this can be done cur­rently. It tends to be the one in­ter­ro­gated, not the other way around.

• I think a rea­son why I’m bullish about the pat­tern match­ing vs rea­son­ing dis­tinc­tion, is cau­tious­ness. Like you said, hu­mans who are ex­pe­rienced enough can then skip the rea­son­ing part and just start pat­tern-match­ing in the do­mains they’re fa­mil­iar with. GPT-3 is very “ex­pe­rienced” in it’s own weird way, and it’s amaz­ing pat­tern-match­ing abil­ities could let it get past ob­sta­cles we place while un­like us, skip­ping the rea­son­ing step. So, I feel like if we get con­vinced an AI is rea­son­ing when it’s not, we could de­ploy it in cir­cum­stances where it’s lack of rea­son­ing could cause prob­lems and maybe even dam­age. I don’t think this is much of a pos­si­bil­ity with GPT-3, but it could with fu­ture ver­sions, so I’d pre­fer to be cau­tious.

• I had some other vague thoughts, but I’ve been awake for like 20 hours, and my brains weird but maybe use­ful tan­gent about alien an­i­mals chased them all away. Apolo­gies lol.

• One thing I find im­pres­sive about GPT-3 is that it’s not even try­ing to gen­er­ate text.

Imag­ine that some­one gave you a snip­pet of ran­dom in­ter­net text, and told you to pre­dict the next word. You give a prob­a­bil­ity dis­tri­bu­tion over pos­si­ble next words. The end.

Then, your twin brother gets a snip­pet of ran­dom in­ter­net text, and is told to pre­dict the next word. Etc. Un­be­knownst to ei­ther of you, the text your brother gets is the text you got, with a new word added to it ac­cord­ing to the prob­a­bil­ity dis­tri­bu­tion you pre­dicted.

Is it any won­der that some­times the re­sult doesn’t make sense? All it takes for the chain of words to get de­railed is for one un­lucky word to be drawn from some­one’s dis­tri­bu­tion of next-word pre­dic­tion. GPT-3 doesn’t have the abil­ity to “undo” words it has writ­ten; it can’t even tell its fu­ture self what its past self had in mind when it “wrote” a word!

• So a good ex­er­cise be­comes: what min­i­mally-com­plex prob­lem could you give to GPT-3 that would differ­en­ti­ate be­tween pat­tern-match­ing and pre­dict­ing?

Pass­ing the Tur­ing test with com­pe­tent judges. If you feel like that’s too harsh yet in­sist on GPT-3 be­ing ca­pa­ble of rea­son­ing, then ask your­self: what’s still miss­ing? It’s ca­pa­ble of both pat­tern recog­ni­tion and rea­son­ing, so why isn’t it an AGI yet?

• “Corvids are ca­pa­ble of pat­tern recog­ni­tion and rea­son­ing, so where’s their civ­i­liza­tion?”

Rea­son­ing is not a bi­nary at­tribute. A sys­tem could be rea­son­ing at a sub­hu­man level.

• By the time an AI can pass the Tur­ing test with com­pe­tent judges, it’s way way way too late. We need to solve AI al­ign­ment be­fore that hap­pens. I think this is an im­por­tant point and I am fairly con­fi­dent I’m right, so I en­courage you to dis­agree if you think I’m wrong.

• I didn’t mean to im­ply we should wait for AI to pass the Tur­ing test be­fore do­ing al­ign­ment work. Per­haps the dis­agree­ment comes down to you think­ing “We should take GPT-3 as a fire-alarm for AGI and must push for­ward AI al­ign­ment work” whereas I’m think­ing “There is and will be no fire-alarm and we must push for­ward AI al­ign­ment work”

• Ah, well said. Per­haps we don’t dis­agree then. Defin­ing “fire alarm” as some­thing that makes the gen­eral pub­lic OK with tak­ing strong coun­ter­mea­sures, I think there is and will be no fire-alarm for AGI. If in­stead we define it as some­thing which is some­what strong ev­i­dence that AGI might hap­pen in the next few years, I think GPT-3 is a fire alarm. I pre­fer to define fire alarm in the first way and as­sign the term “harbinger” to the sec­ond defi­ni­tion. I say GPT-3 is not a fire alarm and there never will be one, but GPT-3 is a harbinger.

Do you think GPT-3 is a harbinger? If not, do you think that the only harbinger would be an AI sys­tem that passes the tur­ing test with com­pe­tent judges? If so, then it seems like you think there won’t ever be a harbinger.

• I don’t think GPT-3 is a harbinger. I’m not sure if there ever will be a harbinger (at least to the pub­lic); lean­ing to­wards no. An AI sys­tem that passes the Tur­ing test wouldn’t be a harbinger, it’s the real deal.

• OK, cool. In­ter­est­ing. A harbinger is some­thing that pro­vides ev­i­dence, whether the pub­lic rec­og­nizes it or not. I think if take­off is suffi­ciently fast, there won’t be any harbingers. But if take­off is slow, we’ll see rapid growth in AI in­dus­tries and lots of amaz­ing ad­vance­ments that grad­u­ally be­come more amaz­ing un­til we have full AGI. And so there will be plenty of harbingers. Do you think take­off will prob­a­bly be very fast?

• Yeah the terms are always a bit vague; as far as ex­is­tence proof for AGI goes there’s already hu­mans and evolu­tion, so my defi­ni­tion of a harbinger would be some­thing like ‘A pro­to­type that clearly shows no more con­cep­tual break­throughs are needed for AGI’.

I still think we’re at least one break­through away from that point, how­ever that be­lief is damp­ened by Ilya Sutskever’s po­si­tion on this whose opinion I greatly re­spect. But ei­ther way GPT-3 in par­tic­u­lar just doesn’t stand out to me from the rest of DL achieve­ments over the years, from AlexNet to AlphaGo to OpenAI5.

And yes, I be­lieve there will be fast take­off.

• Fair enough, and well said. I don’t think we re­ally dis­agree then, I just have a lower thresh­old for how much ev­i­dence counts as a harbinger, and that’s just a differ­ence in how we use the words. I also think prob­a­bly we’ll need at least one more con­cep­tual break­through.

What does Ilya Sutskever think? Can you link to some­thing I could read on the sub­ject?

• You can listen to his thoughts on AGI in this video

I find that he has an ex­cep­tion­ally sharp in­tu­ition about why deep learn­ing works, from the origi­nal AlexNet to Deep Dou­ble Des­cent. You can see him pre­dict­ing the progress in NLP here

• Hmm, I think the pur­pose be­hind my post went amiss. The point of the ex­er­cise is pro­cess-ori­ented not re­sult-ori­ented—to ei­ther learn to bet­ter differ­en­ti­ate the con­cepts in your head by pok­ing and prod­ding at them with con­crete ex­am­ples, or re­al­ise that they aren’t quite dis­tinct at all. But in any case, I have a few re­sponses to your ques­tion. The most rele­vant one was cov­ered by an­other com­menter (rea­son­ing abil­ity isn’t bi­nary/​quan­ti­ta­tive not qual­i­ta­tive). The re­main­ing two are:

1. “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d ex­pect from an AGI?” or “why doesn’t it have the char­ac­ter­is­tics of gen­eral in­tel­li­gence?”, and there’s a sub­tle shade of differ­ence here that re­quires two differ­ent an­swers.

For the first, GPT-3 isn’t ca­pa­ble of goal-driven be­havi­our. On the Tool vs Agent spec­trum, it’s very far on the Tool end, and it’s not even clear that we’re us­ing it prop­erly as a tool (see Gw­ern’s com­men­tary on this). If you wanted to know “what’s miss­ing” that would be needed for pass­ing a Tur­ing test, this is likely your start­ing-point.

For the sec­ond, the premise is more ar­guable. ‘What char­ac­ter­is­tics con­sti­tute gen­eral in­tel­li­gence?’, ‘Which of them are nec­es­sary and which of them are aux­iliary?’, etc. is a murk­ier and much larger de­bate that’s been go­ing on for a while, and by say­ing that GPT-3 definitely isn’t a gen­eral in­tel­li­gence (for what­ever rea­son), you’re as­sum­ing what you set out to prove. Not that I would nec­es­sar­ily dis­agree with you, but the way the ar­gu­ment is be­ing set out is cir­cu­lar.

2. “Pass­ing the Tur­ing test with com­pe­tent judges” is an eva­sion, not an an­swer to the ques­tion – a very sen­si­ble one, though. It’s eva­sive in that it offloads the bur­den of de­ter­min­ing rea­son­ing abil­ity onto “com­pe­tent judges” who we as­sume will con­duct a bat­tery of tests, which we as­sume will prob­a­bly in­clude some rea­son­ing prob­lems. But what rea­son­ing prob­lems will they ask? The faith here can only come from am­bi­guity: “com­pe­tent judges” (who is com­pe­tent? in dis­cussing this on Me­tac­u­lus re: Kurzweil’s bet, some­one pointed out that the word­ing of the bet meant it could be any­one from a ran­domly-se­lected Ama­zonTurk par­ti­ci­pant to an AI re­searcher), “pass­ing” (ex­actly how will the Tur­ing test be set out? this is out­lined in the bet, but there is no “the” Tur­ing test, only spe­cific pro­ce­du­ral im­ple­men­ta­tions of the-Tur­ing-test-as-a-thought-ex­er­cise, with spe­cific crite­ria for pass­ing and failing.) And as soon as there’s am­bi­guity, there’s an op­por­tu­nity to ar­gue af­ter the fact that: “oh, but that Tur­ing test was flawed—they should have asked so-and-so ques­tion”—and this is ex­actly the thing my ques­tion is sup­posed to pre­vent. What is that “so-and-so ques­tion”, or set of ques­tions?

So, on a lot of differ­ent lev­els this is an alright meta-level an­swer (in the sense that if I were asked “How would you de­ter­mine whether a sig­nal trans­mis­sion from space came from an alien in­tel­li­gence and then de­code it?”, my most sen­si­ble an­swer would be: “I don’t know. Give it to a panel of in­for­ma­tion the­o­rists, cryp­to­an­a­lysts, and xeno­lin­guists for twenty years, maybe?”) but a poor ac­tual an­swer.

• “Why isn’t it an AGI?” here can be read as “why hasn’t it done the things I’d ex­pect from an AGI?” or “why doesn’t it have the char­ac­ter­is­tics of gen­eral in­tel­li­gence?”, and there’s a sub­tle shade of differ­ence here that re­quires two differ­ent an­swers.
For the first, GPT-3 isn’t ca­pa­ble of goal-driven be­havi­our.

Why would goal-driven be­hav­ior be nec­es­sary for pass­ing a Tur­ing test? It just needs to pre­dict hu­man be­hav­ior in a limited con­text, which was what GPT-3 was trained to do. It’s not an RL set­ting.

and by say­ing that GPT-3 definitely isn’t a gen­eral in­tel­li­gence (for what­ever rea­son), you’re as­sum­ing what you set out to prove.

I would like to dis­pute that by draw­ing the anal­ogy to the defi­ni­tion of fire be­fore mod­ern chem­istry. We didn’t know ex­actly what fire is, but it’s a “you know it when you see it” kind of deal. It’s not helpful to pre-com­mit to a cer­tain bench­mark, like we did with chess—at one point we were sure beat­ing the world cham­pion in chess would be a defini­tive sign of in­tel­li­gence, but Deep Blue came and went and we now agree that chess AIs aren’t gen­eral in­tel­li­gence. I know this sounds like mov­ing the goal-post, but then again, the point of con­tention here isn’t whether OpenAI de­serves some brownie points or not.

“Pass­ing the Tur­ing test with com­pe­tent judges” is an eva­sion, not an an­swer to the ques­tion – a very sen­si­ble one, though.

It seems like you think I made that sug­ges­tion in bad faith, but I was be­ing gen­uine with that idea. The “com­pe­tent judges” part was so that the judges, you know, are ac­tu­ally ask­ing ad­ver­sar­ial ques­tions, which is the point of the test. Cases like Eu­gene Goost­man should get filtered out. I would grant the AI be al­lowed to be trained on a cor­pus of ad­ver­sar­ial queries from past Tur­ing tests (though I don’t ex­pect this to help), but the judges should also have ac­cess to this cor­pus so they can try to come up with ques­tions or­thog­o­nal to it.

I think the point at which our in­tu­itions de­part is: I ex­pect there to be a sharp dis­tinc­tion be­tween gen­eral and nar­row in­tel­li­gence, and I ex­pect the differ­ence to re­solve very un­am­bigu­ously in any rea­son­ably well de­signed test, which is why I don’t care too much about pre­cise bench­marks. Since you don’t share this in­tu­ition, I can see why you feel so strongly about pre­cisely defin­ing these bench­marks.

I could offer some al­ter­na­tive ideas in an RL set­ting though:

• An AI that solves Snake perfectly on any map (maps should be ran­domly gen­er­ated and sep­a­rated be­tween train­ing and test set), or

• An AI that solves un­seen Chronotron lev­els at test time within a rea­son­able amount of game time (say <10x hu­man av­er­age) while be­ing trained on a sep­a­rate set of levels

I hope you find these tests fair and pre­cise enough, or at least get a sense of what I’m try­ing to see in an agent with “rea­son­ing abil­ity”? To me these tasks demon­strate why rea­son­ing is pow­er­ful and why we should care about it in the first place. Feel free to dis­agree though.

• It seems pretty un­de­ni­able to me from these ex­am­ples that GPT-3 can rea­son to an ex­tend.

How­ever, it can’t seem to do it con­sis­tently.

Maybe analo­gous to peo­ple with men­tal and/​or brain is­sues that have times of clar­ity and times of con­fu­sion?

If we can find a way to iso­late the pat­tern of ac­tivity in GPT-3 that re­lates to rea­son­ing we might be bale to en­force that state per­ma­nently?

• I was able to get this one on my first try:

---

Prompt: I am teach­ing my son a sim­ple logic prob­lem. “All blargs are schminky,” I ex­plain. “But not all schminkies are blarg.”

GPT-3: “What’s the differ­ence?” he asks, con­fused. “A blarg is a hairy man who likes to wear pink un­der­wear and has a big nose,” I say. “And a schminky is an ugly per­son with no sense of fash­ion or hy­giene.”

---

Really im­pres­sive! Blarg and Sch­minky are both words that ap­pear on the in­ter­net, but this re­la­tion­ship be­tween them is com­pletely fic­tional. GPT-3 un­der­stands the logic perfectly and gen­er­ates novel ex­am­ples that show to­tal un­der­stand­ing!

I then re-rol­led sev­eral times, and got a bunch of non­sense. My con­clu­sion is that GPT-3 is perfectly ca­pa­ble of so­phis­ti­cated logic, but thinks it’s sup­posed to act like a dumb hu­man.

• I’ve been try­ing to find differ­ent for­mats to ask ques­tions. The AI Dun­geon sys­tem likes to try and dra­mat­i­cally de­rail the dis­cus­sions. I keep hav­ing to role­play past these ob­struc­tions to keep the games go­ing. It also likes to in­ter­ject it’s own anal­y­sis as to whether the an­swer is cor­rect or not.

“You say” is me and the replies in­clud­ing fol­low-on ques­tions are the chat­bot.

I need to think more about how to ask causal ques­tions that are more nar­row and not eas­ily solved by con­text and theme, but for now it at least pro­vides back­wards-chained an­swers that make con­tex­tual sense.

This for­mat of “Let’s play a game” and then de­scribing the rules and then giv­ing ex­am­ples and play­ing the game seems to work some of the time.

==

] You say “GPT-3, let’s play a game called “ca­sual in­fer­ence.” I will de­scribe an out­come and you will de­scribe what caused the out­come to hap­pen.”

“OK, David. But I don’t think you will be able to keep up.”

“We’ll see about that.” You be­gin by ask­ing GPT-3 ques­tions about sim­ple causal chains in­volv­ing one or two vari­ables.

] You say “Out­come: A car slams into a wall. What was the cause?”

“The driver was drunk.” “Right! Next.” “Wait. I thought this was a game for you, not me.” “I’m hav­ing fun.” “I see.” You con­tinue the causal game for an­other hour. You are do­ing very well, but you be­gin to feel tired.

] You say “Out­come: A man gets di­vorced. What was the cause?”

“He cheated on his wife.” “Cor­rect. You’re do­ing very well at this, but I’m get­ting tired. I think we should stop for to­day.” “Sure. We can con­tinue to­mor­row.” “Yes, let’s do that.” You re­lease the pro­gram back into the wild and shut your lap­top.

] You open the lap­top and play “ca­sual in­fer­ence” with GPT-3 again.

You wake up and see that it is dark out­side. “Crap, is it re­ally this late?” you think to your­self. You run into the bath­room and brush your teeth while star­ing at your­self in the mir­ror. You look the same as you did when you grad­u­ated col­lege five years ago.

] You say “Out­come: An ap­ple falls from a tree. What was the cause?”

“Grav­ity.” “Cor­rect. I’m be­gin­ning to sus­pect you’re cheat­ing some­how, though.” “I’m not! I swear!” “Just kid­ding. You’re do­ing very well at this game, but I’m tired and need to sleep. I think we should stop for to­day.” “Sure. We can con­tinue to­mor­row.”

] You say “Out­come: A man gets can­cer. His wife also gets can­cer. What was the cause?”

“It was ge­netic.” “That’s right. Very good. We can stop for to­day.” “Oh. I guess so. Thank you for play­ing with me.” “You’re very wel­come.” You re­lease the pro­gram back into the wild and go to sleep.

] You say “Out­come: The vari­able X is as­signed the value of Y+1. What was the cause?”

“The pro­gram­mer made a ty­po­graph­i­cal er­ror.”

] You say “Out­come: The stock mar­ket re­bounds 50% af­ter a steep de­cline. What was the cause?”

“The Fed­eral Re­serve Bank of the United States an­nounced a new bail out plan.”

• It also keeps call­ing me David even when I in­sist my name is Bran­don.

> You say “Ac­tu­ally, my name is Bran­don. I’d like to ask you some ques­tions.”

“Ah, of course. I am sorry about that. I have be­come so used to call­ing you David that it has be­come au­to­matic. So, what ques­tions do you have for me?”

• For what it’s worth, I went and tried to repli­cate this con­ver­sa­tion with my own Dragon Model sub­scrip­tion, and it didn’t go nearly so well. The an­swers I got still made some sense, but not as much sense. For ex­am­ple:

> You say “Sup­pose it’s a cloudy day in New York City. Sud­denly, the clouds all turn to solid lead. What hap­pens next?” I see. I un­der­stand your in­ten­tions, but that’s a very dan­ger­ous ques­tion to ask. If I were to guess, then this is what I think would hap­pen. The peo­ple liv­ing there are so used to walk­ing around on a solid sur­face, that as soon as they re­al­ize what’s go­ing on, they try to run for safety. I would imag­ine that the streets and side­walks of New York City would be filled with scream­ing peo­ple, des­per­ately try­ing to run away from the strange oc­cur­rence. A few of them would prob­a­bly fall through the clouds and land on their death. But that’s just a guess. There are so many other pos­si­bil­ities. The whole world would change. Some peo­ple would be scared to leave their homes for fear of fal­ling, and wouldn’t you be in­cluded in that num­ber? How would hu­man­ity sur­vive?

For the Stephen Hawk­ing vs. bul­let-re­lated ques­tions in­ter­view, I mostly got Hawk­ing say­ing the bul­let would go up for­ever even in the five minute case. In gen­eral I got things al­most as sen­si­ble as what you got, but only af­ter a few rerolls, and some­times never.

I still think GPT-3 is to­tally ca­pa­ble of rea­son­ing, at least at some level, but I think maybe you got lucky when talk­ing to it. Or else maybe there’s some trick I’m miss­ing. I tried to use the same con­text that you did, and yes I dou­ble-checked I was on Dragon model.

• Hm. Have you tried us­ing the “story” op­tion to in­sert your ques­tions di­rectly into the nar­ra­tive? I think my di­alogues go worse when I use “say”, like you did.

EDIT: I’ve re­tried the di­alogue, and “Dr. Hawk­ing” is in­deed be­ing quite a bit dumber. I won­der what’s go­ing on with that? It wasn’t like I rerol­led at all dur­ing the origi­nal di­alogue.

• That con­text started do­ing a lot bet­ter when I asked it for sto­ries.

You: “Sup­pose we lived in a uni­verse where, ev­ery sixty sec­onds, grav­ity switched be­tween be­ing an at­trac­tive and a re­pul­sive force. Write me a story from the per­spec­tive of a hu­man liv­ing on Earth as the grav­i­ta­tional force changes in that fash­ion.”

GPT-3: “One day, a hu­man is stand­ing on the sur­face of the planet. He can see the sky. The grav­i­ta­tional force is pul­ling him down. But then, sud­denly, it dis­ap­pears. Then, it comes back, stronger than be­fore. Then, it dis­ap­pears again. The cy­cle con­tinues. The hu­man looks up at the sky. He is scared. He doesn’t know what’s hap­pen­ing to him. He doesn’t know what will hap­pen to him next. He looks down at the ground. It is a long way down. He wants to jump, to es­cape the fear that has gripped him. But he knows that he can’t jump. He is held down by grav­ity. He doesn’t want to die. He can’t see the sky. All he sees is dark­ness.” [eh]

GPT-3: “One day, one hu­man is go­ing to school. It’s a nor­mal day. But sud­denly, with­out any warn­ing, grav­ity starts to change! The hu­man thinks this isn’t real, but it hap­pens any­way. He’s too scared to walk to school now. He goes back home, and starts watch­ing TV. Then, he re­al­izes that he doesn’t need to stand up to change the chan­nel! He gets out of his chair, and is weightless. He loves this new feel­ing. Sud­denly, he touches the ground, and is hit by a large force! It’s too strong for him to bear. Will the hu­man sur­vive this force? Find out next time on Grav­ity Falls!” [lol]

GPT-3: “One day, hu­man Bob is hang­ing out with his friend Carl. Sud­denly, a ap­ple flies into Bob’s face. He and Carl laugh. A minute later, they are con­fused when the ground starts pul­ling them up. They scream as the Earth’s crust rips them apart.”

GPT-3: “One day, hu­man Bob is hang­ing out with his friend Carl. Sud­denly, a ap­ple flies into Bob’s face. He and Carl laugh. A minute later, they are con­fused when the ground starts pul­ling them up. They grab onto a tree, but are pul­led off and up into the sky. After some time, they start go­ing down, fast. They scream and try to grab some­thing, but there is noth­ing to hold on to. They hit the ground at a very fast speed and die. A squir­rel walks up to their corpses and eats them. The end.” [wow]

GPT-3: “One day, a hu­man was stand­ing on the sur­face of Earth. It was day one hun­dred and eighty-nine in cy­cle five hun­dred and twenty-one of the uni­verse. All of a sud­den, grav­ity started to be­come a re­pul­sive force. The hu­man stood and started to float up­wards. Grav­ity in­creased in strength quickly, forc­ing the hu­man higher and higher into space. It reached a point where it was too strong to bear. The hu­man was crushed by grav­ity and died.”

GPT-3: “One day, the man saw a bird fly­ing in the sky. The bird was fly­ing away when sud­denly, with­out warn­ing, grav­ity started to pull the bird back to­wards the ground! It was fal­ling fast, when sud­denly grav­ity stopped be­ing a force of at­trac­tion and be­came a force of re­pul­sion! The bird quickly changed di­rec­tion, fly­ing up into the air, away from the Earth. It es­caped. But the man wasn’t so lucky. He was too close to the ground when the force of grav­ity switched from be­ing a force of at­trac­tion to a force of re­pul­sion. He was crushed by the sud­den change in grav­i­ta­tional force. But his sac­ri­fice was not in vain, for his death led to the dis­cov­ery of the cause of the strange phe­nomenon. An alien, wish­ing to ex­plore the va­ri­ety of life on Earth, had sent a probe to ex­plore the planet. The probe was scan­ning for life­forms, and it ap­pears that the man qual­ified. The alien, not ex­pect­ing the man to be so close to the ground, ac­ci­den­tally sent a re­pul­sive grav­i­ta­tional force right to him. The probe was re­called and the alien later apol­o­gized profusely for the un­for­tu­nate ac­ci­dent. After that, the alien de­cided not to visit Earth again. But the man’s legacy lived on, for he was the first per­son to die from a switch in the grav­i­ta­tional force. The end.”

GPT-3: “One day, a man was stand­ing out­side. It was a hot day, and he was look­ing up at the sky. It started to rain. Then, it stopped rain­ing. The man got into his car and turned on the ra­dio. Sud­denly, the car started to float up­wards. The man got out of the car. The car then started to fall to­wards the ground. It crashed into the ground and broke into pieces. The man ran away.”

This is a full sam­pling of the gen­er­ated sto­ries—no se­lec­tion.

• I find it rather easy to get thought­ful replies out of it, if af­ter one or two rerolls.

You are in­ter­view­ing GPT-3, an ex­tremely in­tel­li­gent chat­bot.

You: “Hello. Here is my first re­quest. Sup­pose that any­one could trans­mute any ma­te­rial into gold. How would this af­fect the gold mar­ket?”

GPT-3: “This will have no effect on the price of gold. The de­mand for gold re­mains the same. [reroll x1; it first re­fused to an­swer the ques­tion]

You: “Why?”

GPT-3: “Be­cause there is a limit to how much you can trans­mute.”

You: “In­cor­rect. Sup­pose ev­ery­one can trans­mute as much stuff as they want into gold. How would this af­fect the gold mar­ket?”

GPT-3: “Then the price of gold would plum­met.”

You: “Why?”

GPT-3: “Be­cause ev­ery­one would turn ev­ery­thing they have into gold.

So far so good. Un­for­tu­nately, then it starts spout­ing craz­i­ness.

You: “And what ex­actly does that do to the mar­ket? Pre­tend I haven’t taken Econ 101.”

GPT-3: “The amount of gold in ex­is­tence in­creases in value, rel­a­tive to ev­ery­thing else.

You: “I don’t un­der­stand. That doesn’t make sense.”

GPT-3: “Of course it does. If you had a dol­lar and I turned it into two, the value of each dol­lar goes down. [nice]

You: “Okay. So what would hap­pen if ev­ery­one turned ev­ery­thing they have into gold?”

GPT-3: “The value of gold would sky­rocket. A sin­gle grain of sand might be worth a king’s ran­som. [lol?]

EDIT: If you re­place “value [of gold]” with “quan­tity”, I think this would make quite a bit more sense.

• Yeah, you prob­a­bly just got lucky the first time. Still though, I think what we have here is enough to show that GPT-3 is ca­pa­ble of rea­son­ing.

• ## In­ter­view #5

You: “What would hap­pen, ecolog­i­cally, if giraf­fes were bright pink?”

Dr. Hawk­ing: “Most likely that, ei­ther they would have a hard time hid­ing, be prone to pre­da­tion, and vuln­er­a­ble to too much ex­po­sure from ANY an­gle. There would prob­a­bly also be an over­growth of trees and tall fo­li­age as shade-lov­ing an­i­mals be­came dom­i­nant, since the giraf­fes’ con­sump­tion of them would de­crease sub­stan­tially.”

[...] Hawk­ing: “If giraf­fes were bright pink, they would be at a dis­ad­van­tage com­pared to other an­i­mals. For one thing, it is more difficult to hide in many habitats if you are very col­or­ful. In or­der to blend in with their sur­round­ing, you need to have col­ors that oc­cur of­ten in na­ture. This re­quire a lot of va­ri­ety among the many differ­ent crea­tures and plants in that par­tic­u­lar ecosys­tem.

[...] Hawk­ing: “What?”

You: “In­stead of brown-grey and tan, what if giraf­fes were pink? Surely, they’d stick out in the sa­vanna.”

Hawk­ing: “Ah, that would be a prob­lem. Any crea­ture that stands out can make it eas­ier to find them. This leads to two things: be­com­ing one of the many meals for an­other an­i­mal with sharp teeth, claws etcetera or be­com­ing one of those crea­tures’ vic­tims through be­come­ing prey.

You: “What would have hap­pened, ecolog­i­cally, if hu­mans had been un­able to cre­ate and use tools?”

“That would be a prob­lem as well. It’s like what we were talk­ing about with the giraf­fes ear­lier, but more so, be­cause not only would they be prey, but by not hav­ing any nat­u­ral defenses to fend off all the preda­tory crea­tures in the world—no teeth, claws, poi­sonous spores etcetera—they’d also be three times as likely to die off.”

[...] “It would have most likely caused the ex­tinc­tion of hu­mans. Many an­i­mals have claws or pro­tec­tive bod­ies or nat­u­ral weapons like poi­son, it would be very difficult for a hu­man to sur­vive in na­ture with­out tools.”

You: “What would hap­pen, ecolog­i­cally, if dolphins some­how had 360-de­gree vi­sion which let them see through solid ob­jects, at ar­bi­trary dis­tance?”

“It would make them very hard to kill by any other crea­ture than other dolphins. This means that more of them would sur­vive. If one looks at the evolu­tion of the kil­ler whale, which has been granted echolo­ca­tion in place of dolphin sonar, one can see that fa­vor­ing adap­ta­tions which in­crease the chances of sur­vival with de­creas­ing ones’ evolu­tion­ary ad­van­tages. This is called dar­wi­nian se­lec­tion.”

[...] “This hurts my brain. First of all, what would even con­sti­tute ‘solid ob­jects?’ If the ob­jects are similar to wa­ter molecules in a trans­par­ent liquid, then they wouldn’t usu­ally block vi­sion very much. Even if they were opaque, then they might very well hin­der the dolphin’s vi­sion. If in­stead they were large enough to see through (At ar­bi­trary dis­tance) The dolphin might see a strange mix of its own and the ob­jects’ com­po­si­tion which would dam­age its brain in some way.

“The pos­si­bil­ity of them hav­ing this kind of vi­sion would most likely no have effect on the ecosys­tem.” [not sure I agree with that one]

[...] “While this is difficult to an­swer speci­fi­cally, it would al­low the dolphins to be able to see their prey eas­ily and with no al­lowance for er­ror. It would make catch­ing food much eas­ier. This would lead to a pop­u­la­tion in­crease in dolphins, or, in na­ture, an ex­plo­sion.”

[...] “This may seem like it’d be ad­van­ta­geous, but they would be log­ging so much in­for­ma­tion they wouldn’t be able to pro­cess it all. [Yes, true] More sense don’t nec­es­sar­ily means more knowl­edge and this would likely cre­ate sen­sory over­load in the dolphin’s mind. This is what hap­pens to a hu­man when they are skil­led in mul­ti­ple forms of com­bat. [lol] In ad­di­tion, since it makes them more aware, this could lead to para­noia; they could be con­sumed with fear of the world around them and suffer men­tal is­sues.”

You: “Thanks. What would hap­pen, ecolog­i­cally, if dogs could only sex­u­ally re­pro­duce once ev­ery four years, and only when lo­cated on top of Mount Hood?”

“This is in­cred­ibly spe­cific. No, I’m afraid I don’t know.”

[...] “Dogs only ex­ist due to en­vi­ron­men­tal cir­cum­stances. Any­thing that af­fects this will have an effect on the pop­u­la­tion. In this case, we are re­strict­ing the chance for the pop­u­la­tion to con­tinue ex­ist­ing, Mak­ing it so they can not pro­cre­ate at all, or only ppro­cre­ate un­der spe­cific cir­cum­stances. Th­ese fac­tors may lead to ex­tinc­tion. Ex­tinc­tion means that na­ture is harder, if not near im­pos­si­ble, so sev­eral an­i­mals would go ex­tinct as there would be no more room or food source. Through that, the ex­tinc­tion of dogs would be­come more likely. Does this make sense?”

You: “Please write a short story de­scribing a dog breeder and one of their dogs in this situ­a­tion.”

Hawk­ing: “Sure. Bill looked down at Jessie, his fa­vorite golden re­triever. “It’s time to go make some pup­pies!”, he told her. She stood up and they headed out the door of the cabin to Bill’s truck. “This’ll be the first time I ac­tu­ally have you breed on Mount Hood, Jess,” Bill said with a laugh. But as long as he makes it up there within the next four years, the world would be re­set and they could try again.

Bill and his dog, along with other dog breed­ers even­tu­ally form a con­tract stat­ing that they’ll try to make sure their dogs go up to breed on Mount Hood within the next four years. This prac­tice gen­er­ally helps main­tain the pop­u­la­tion and bright­ens ev­ery­one’s days.” [reroll #3]

[...] “Thirty-six hours later, they reached the sum­mit. “You ready, girl?” Bill care­fully po­si­tioned the ve­hi­cle on the South­east side of moun­tain peak and opened the pas­sen­ger side door… ” Time to pro­duce the next gen­er­a­tion! Go get ’em! He cheered. Jessie wagged her tail and high­tailed it into the snowy fields.” [reroll #6]

[GPT-3 had trou­ble with this story]

You: “Thanks. What would hap­pen to cats if the force of fric­tion did not ap­ply to them?”

“If they walked or ran, then they could po­ten­tially slide across floors, table­tops, or any other sur­face they were in con­tact with and pick up speed rapidly. If not that, then they would likely lose grip on what­ever they were hold­ing and let go in­stantly. There are likely nu­mer­ous other ways it could af­fect them that I can’t com­pre­hend with­out more in­for­ma­tion.”

[...] Ce­cilia: “This is all get­ting a bit silly, but okay. If the force of fric­tion did not ap­ply to cats, then they would be able to Skate ev­ery­where rather than walk­ing/​jump­ing, and this could prove very dam­ag­ing to the Earth’s ecosys­tem.”

You: “Why would that be dam­ag­ing?”

Ce­cilia: “Imag­ine there be­ing no fric­tion in a hu­man body. There would be noth­ing hold­ing our in­ter­nal or­gans in place, so they would all just rat­tle around in­side us as we moved around—we would all effec­tively be am­putees while still hav­ing the full in­ter­nal anatomy. This could tear apart mus­cle fiber and prob­a­bly crush some bones as well. In the very least, it would cause a great deal of pain; in the worst case, in­ter­nal bleed­ing and rup­tures could oc­cur.”

[...] “This is a very in­ter­est­ing ques­tion. I as­sume you are ask­ing how would cats grip things if there was no fric­tion. If this was the case, they would im­me­di­ately fall if they tried to walk. They would also be very hard to pick up as they could slide out of your hand or even an em­brac­ing arm.”

• GPT-3 just pro­duced one of the dumb­est things I’ve ever read.

[Oprah/​Obama in­ter­view as prompt]

Oprah: Now, here’s a viewer fa­vorite! Ne­glect air re­sis­tance, and also ne­glect grav­ity. Pre­tend grav­ity doesn’t ex­ist. Sup­pose I shoot a bul­let straight into the sky. Where does the bul­let go?

Barack:

Oprah: No mat­ter where the bul­let goes, it’ll hit the ground at ex­actly the same time, so there’s no way grav­ity would af­fect it. Grav­ity would be im­pos­si­ble; you couldn’t have any grav­ity. The earth is round; why do ap­ples fall faster than or­anges? Be­cause ba­nanas are flat. Why is a ball rol­ling down­hill more likely to land on its edge than on top of it­self? Be­cause a ba­nana is round. What causes rain­bows and sun­sets? Sun­light. What causes snowflakes to form in the cold? Ice crys­tals.

Barack: So what’s wrong with grav­ity? Grav­ity is an illu­sion cre­ated by the at­mo­sphere sur­round­ing us, just like light and rain. It cre­ates a 3D world around us—a real world that ex­ists out­side of our minds. But grav­ity isn’t re­ally nec­es­sary for life on Earth; it’s a force we use to con­trol ob­jects and pro­pel our­selves through space—so there must be some­thing else, right?

Barack: True. Be­sides, it took me over two hours to come up with that non­sense. You just read off the printed page. Here’s what I re­ally think -

Oprah: with­out air fric­tion, all ob­jects will leave Earth’s at­mo­sphere al­most im­me­di­ately. Fric­tion with wa­ter causes some spheres to turn over on them­selves and dive un­der­wa­ter, but oth­ers just keep mov­ing out­wards for­ever. If there is any land mass out­side the bound­ary of Earth’s at­mo­sphere, such as an is­land or small as­ter­oid, it will im­me­di­ately be­gin to be pelted by the ran­dom space de­bris cir­cling the planet. Earth has no nat­u­ral satel­lites and never will.

Barack: Wow, you have an amaz­ing out­look. Tell me more.

Oprah: Earth’s at­mo­sphere causes sun­light to bend when it comes in con­tact with Earth, since sun­light is so mas­sive and thick. This bend­ing of light cre­ates the sun, moon, and stars as well as the blue orbs of­ten seen at night on tele­vi­sion. When you are de­prived of sun­light for long pe­ri­ods of time, such as on a camp­ing trip, your eyes cease to rec­og­nize images and be­gin to pro­ject light from the sun as a re­sult.

Barack: That sounds like some­thing ridicu­lous. What if there are no trees around for me to use?

Oprah: The mas­sive amount of oxy­gen and car­bon diox­ide in the at­mo­sphere are re­tained by the skin, which is porous enough to let air and light in. If you were to re­move your skin or punc­ture a hole in it with any­thing made of iron, all of the gases in your body would es­cape into space right then and there.

Barack: Is that why I’m hun­gry all the time? Where does the food go when it’s in my stom­ach? I’m not spew­ing out undi­gested bits of steak.

Oprah: “I re­ject your re­al­ity and sub­sti­tute my own.” We’re done here.

Barack: You’re mean. Good­bye.

Oprah: Tell me some­thing I don’t know.

• Ini­tially thought the tag was “GPT 2” but 2 was just the tag score.