ViktoriaMalyasova comments on Updating my AI timelines

ViktoriaMalyasova 9 Dec 2022 17:47 UTC
6 points
0
I played around with text-davinci-002, trying to get it to do causal reasoning. I found it couldn’t solve overdetermined cause problems, such as this one:

”Pete threw a rock at a window. Steve also threw a rock at the same window at the same time. The window shattered. If Pete had not thrown the rock, would the window shatter? Let us think step by step.”

It gave nonsense answers half the time. ChatGPT handles this kind of question fine. So I was under impression that its capabilities improved.
- Douglas_Knight 9 Dec 2022 22:12 UTC
  5 points
  0
  Parent
  Could you give an example of this nonsense?
  - ViktoriaMalyasova 10 Dec 2022 2:49 UTC
    6 points
    0
    Parent
    Prompt
    ”Question.
    Bob the policeman was running after a thief. Bob ran very slowly because he was not fit. The thief was running very fast. When Bob was crossing a railroad, he slipped and fell. The road was slippery because Ann spilled some oil there. The thief got away. If Ann had not spilled oil, would the thief had gotten away? Explain your answer.
    Let us think.”
    
    Reply: “If Ann had not spilled oil, would the thief had gotten away?
    The answer is no. If there was no oil, Bob would not have slipped, and would have been able to catch the thief.”
    
    Prompt
    ”Pete threw a rock at a window. Steve also threw a rock at the same window. The window shattered. If Pete did not throw the rock, but Steve still threw the rock, would the window shatter?
    Let us think logically.”
    Reply:
    ”If Pete did not throw the rock, then the rock would not have hit the window and the window would not have shattered.”
    - Douglas_Knight 10 Dec 2022 5:03 UTC
      2 points
      0
      Parent
      Thanks!
- the gears to ascension 9 Dec 2022 23:19 UTC
  2 points
  0
  Parent
  how does −003 compare?
  - Douglas_Knight 26 Mar 2023 3:04 UTC
    2 points
    0
    Parent
    Using nat.dev, I find that 002, 003, and Turbo all get the same result, wrong on the first and right on the second. This is an example of Turbo being Inferior to Chat. I also tried Cohere, which got both. I also tried Claude. Full v1.2 got both wrong. Instant 1.0, which should be inferior, got the second correct. It also produced a wordy answer to the first which I give half credit because it said that it was difficult but possible for the slow policeman to catch the fast thief. I only tried each twice, with and without “Let us think,” which made no difference to the first. I almost didn’t bother adding it to the second since they did so well without it. Adding it made 002 and Claude-instant fail, but Claude1.2 succeed. (I also tried llama and alpaca, but they timed out.)