I played around with text-davinci-002, trying to get it to do causal reasoning. I found it couldn’t solve overdetermined cause problems, such as this one:
”Pete threw a rock at a window. Steve also threw a rock at the same window at the same time. The window shattered. If Pete had not thrown the rock, would the window shatter? Let us think step by step.”
It gave nonsense answers half the time. ChatGPT handles this kind of question fine. So I was under impression that its capabilities improved.
Prompt ”Question. Bob the policeman was running after a thief. Bob ran very slowly because he was not fit. The thief was running very fast. When Bob was crossing a railroad, he slipped and fell. The road was slippery because Ann spilled some oil there. The thief got away. If Ann had not spilled oil, would the thief had gotten away? Explain your answer.
Let us think.”
Reply: “If Ann had not spilled oil, would the thief had gotten away?
The answer is no. If there was no oil, Bob would not have slipped, and would have been able to catch the thief.”
Prompt ”Pete threw a rock at a window. Steve also threw a rock at the same window. The window shattered. If Pete did not throw the rock, but Steve still threw the rock, would the window shatter? Let us think logically.” Reply: ”If Pete did not throw the rock, then the rock would not have hit the window and the window would not have shattered.”
Using nat.dev, I find that 002, 003, and Turbo all get the same result, wrong on the first and right on the second. This is an example of Turbo being Inferior to Chat. I also tried Cohere, which got both. I also tried Claude. Full v1.2 got both wrong. Instant 1.0, which should be inferior, got the second correct. It also produced a wordy answer to the first which I give half credit because it said that it was difficult but possible for the slow policeman to catch the fast thief. I only tried each twice, with and without “Let us think,” which made no difference to the first. I almost didn’t bother adding it to the second since they did so well without it. Adding it made 002 and Claude-instant fail, but Claude1.2 succeed. (I also tried llama and alpaca, but they timed out.)
I played around with text-davinci-002, trying to get it to do causal reasoning. I found it couldn’t solve overdetermined cause problems, such as this one:
”Pete threw a rock at a window. Steve also threw a rock at the same window at the same time. The window shattered. If Pete had not thrown the rock, would the window shatter? Let us think step by step.”
It gave nonsense answers half the time. ChatGPT handles this kind of question fine. So I was under impression that its capabilities improved.
Could you give an example of this nonsense?
Prompt
”Question.
Bob the policeman was running after a thief. Bob ran very slowly because he was not fit. The thief was running very fast. When Bob was crossing a railroad, he slipped and fell. The road was slippery because Ann spilled some oil there. The thief got away. If Ann had not spilled oil, would the thief had gotten away? Explain your answer.
Let us think.”
Reply: “If Ann had not spilled oil, would the thief had gotten away?
The answer is no. If there was no oil, Bob would not have slipped, and would have been able to catch the thief.”
Prompt
”Pete threw a rock at a window. Steve also threw a rock at the same window. The window shattered. If Pete did not throw the rock, but Steve still threw the rock, would the window shatter?
Let us think logically.”
Reply:
”If Pete did not throw the rock, then the rock would not have hit the window and the window would not have shattered.”
Thanks!
how does −003 compare?
Using nat.dev, I find that 002, 003, and Turbo all get the same result, wrong on the first and right on the second. This is an example of Turbo being Inferior to Chat. I also tried Cohere, which got both. I also tried Claude. Full v1.2 got both wrong. Instant 1.0, which should be inferior, got the second correct. It also produced a wordy answer to the first which I give half credit because it said that it was difficult but possible for the slow policeman to catch the fast thief. I only tried each twice, with and without “Let us think,” which made no difference to the first. I almost didn’t bother adding it to the second since they did so well without it. Adding it made 002 and Claude-instant fail, but Claude1.2 succeed. (I also tried llama and alpaca, but they timed out.)