What is the role of Chat-GPT? Do you see it as progress over GPT-3, or is it just a tool for discovering capabilities that were already available in GPT-3 to good prompt engineers? I see it as the latter and I’m confused by the large numbers of people who seem to be impressed by it as progress. But in your previous post, you mentioned our ignorance of GPT-3, so you seemed to already have large error bars. Is the importance that Chat is revealing those abilities and narrowing the ignorance?
What is the role of Chat-GPT? Do you see it as progress over GPT-3, or is it just a tool for discovering capabilities that were already available in GPT-3 to good prompt engineers? [...] Is the importance that Chat is revealing those abilities and narrowing the ignorance?
Yes, it had revealed to me that GPT-3 was stronger than I had thought. I played with GPT-3 prior to ChatGPT, but it seems I was never very good at finding a good prompt. For example, I had tried to make it produce dialogue, in a similar manner to that of ChatGPT, but its replies were often surprisingly incoherent. On top of that, it would often produce boilerplate replies in the dialogue that were quite superficial, almost like the much worse BlenderBot from Meta.
After playing with ChatGPT however, and after seeing many impressive results on Twitter, I realized that the model’s fundamental capabilities were solidly on the right end of the distribution of what I had previously believed. I truly underestimated the power of getting the right prompt, or fine-tuning it. It was a stronger update than almost anything else I have seen from any language model.
What I get from essentially the same observations of ChatGPT is increase in AI risk without shortening of timelines, which were already with median at 2032-2042 for me. My model is that there is a single missing piece to the puzzle (of AGI, not alignment), generation of datasets for SSL (and then an IDA loop does the rest). This covers a current bottleneck, but also feels like a natural way of fixing the robustness woes.
Before ChatGPT, I expected that GPT-n is insufficiently coherent to set it up directly, in something like HCH bureaucracies, and fine-tuned versions tend to lose their map of the world, what they generate can no longer be straightforwardly reframed into an improvement over (amplification of) what the non-fine-tuned SSL phase trained on. This is good, because I expect a more principled method of filling the gaps in the datasets for SSL is the sort of reflection (in the usual human sense) that boosts natural abstraction, makes learning less lazy, promotes easier alignment. If straightforward bureaucracies of GPT-n can’t implement reflection, that is a motivation to figure out how to do this better.
But now I’m more worried that GPT-n with some fine-tuning and longer-term memory for model instances could be sufficiently close to human level to do reflection/generation directly, without a better algorithm. And that’s an alignment hazard, unless there is a stronger resolve to only use this for strawberry alignment tasks not too far away from human level of capability, which I’m not seeing at all.
I played with davinci, text-davinci-002, and text-davinci-003, if I recall correctly. The last model had only been out for a few days at most, however, before ChatGPT was released.
Of course, I didn’t play with any of these models in enough detail to become an expert prompt engineer. I mean, otherwise I would have made the update sooner
I played around with text-davinci-002, trying to get it to do causal reasoning. I found it couldn’t solve overdetermined cause problems, such as this one:
”Pete threw a rock at a window. Steve also threw a rock at the same window at the same time. The window shattered. If Pete had not thrown the rock, would the window shatter? Let us think step by step.”
It gave nonsense answers half the time. ChatGPT handles this kind of question fine. So I was under impression that its capabilities improved.
Prompt ”Question. Bob the policeman was running after a thief. Bob ran very slowly because he was not fit. The thief was running very fast. When Bob was crossing a railroad, he slipped and fell. The road was slippery because Ann spilled some oil there. The thief got away. If Ann had not spilled oil, would the thief had gotten away? Explain your answer.
Let us think.”
Reply: “If Ann had not spilled oil, would the thief had gotten away?
The answer is no. If there was no oil, Bob would not have slipped, and would have been able to catch the thief.”
Prompt ”Pete threw a rock at a window. Steve also threw a rock at the same window. The window shattered. If Pete did not throw the rock, but Steve still threw the rock, would the window shatter? Let us think logically.” Reply: ”If Pete did not throw the rock, then the rock would not have hit the window and the window would not have shattered.”
Using nat.dev, I find that 002, 003, and Turbo all get the same result, wrong on the first and right on the second. This is an example of Turbo being Inferior to Chat. I also tried Cohere, which got both. I also tried Claude. Full v1.2 got both wrong. Instant 1.0, which should be inferior, got the second correct. It also produced a wordy answer to the first which I give half credit because it said that it was difficult but possible for the slow policeman to catch the fast thief. I only tried each twice, with and without “Let us think,” which made no difference to the first. I almost didn’t bother adding it to the second since they did so well without it. Adding it made 002 and Claude-instant fail, but Claude1.2 succeed. (I also tried llama and alpaca, but they timed out.)
What is the role of Chat-GPT? Do you see it as progress over GPT-3, or is it just a tool for discovering capabilities that were already available in GPT-3 to good prompt engineers? I see it as the latter and I’m confused by the large numbers of people who seem to be impressed by it as progress. But in your previous post, you mentioned our ignorance of GPT-3, so you seemed to already have large error bars. Is the importance that Chat is revealing those abilities and narrowing the ignorance?
Yes, it had revealed to me that GPT-3 was stronger than I had thought. I played with GPT-3 prior to ChatGPT, but it seems I was never very good at finding a good prompt. For example, I had tried to make it produce dialogue, in a similar manner to that of ChatGPT, but its replies were often surprisingly incoherent. On top of that, it would often produce boilerplate replies in the dialogue that were quite superficial, almost like the much worse BlenderBot from Meta.
After playing with ChatGPT however, and after seeing many impressive results on Twitter, I realized that the model’s fundamental capabilities were solidly on the right end of the distribution of what I had previously believed. I truly underestimated the power of getting the right prompt, or fine-tuning it. It was a stronger update than almost anything else I have seen from any language model.
What I get from essentially the same observations of ChatGPT is increase in AI risk without shortening of timelines, which were already with median at 2032-2042 for me. My model is that there is a single missing piece to the puzzle (of AGI, not alignment), generation of datasets for SSL (and then an IDA loop does the rest). This covers a current bottleneck, but also feels like a natural way of fixing the robustness woes.
Before ChatGPT, I expected that GPT-n is insufficiently coherent to set it up directly, in something like HCH bureaucracies, and fine-tuned versions tend to lose their map of the world, what they generate can no longer be straightforwardly reframed into an improvement over (amplification of) what the non-fine-tuned SSL phase trained on. This is good, because I expect a more principled method of filling the gaps in the datasets for SSL is the sort of reflection (in the usual human sense) that boosts natural abstraction, makes learning less lazy, promotes easier alignment. If straightforward bureaucracies of GPT-n can’t implement reflection, that is a motivation to figure out how to do this better.
But now I’m more worried that GPT-n with some fine-tuning and longer-term memory for model instances could be sufficiently close to human level to do reflection/generation directly, without a better algorithm. And that’s an alignment hazard, unless there is a stronger resolve to only use this for strawberry alignment tasks not too far away from human level of capability, which I’m not seeing at all.
FWIW they call ChatGPT “GPT-3.5”, but text-davinci-002 was also in this series
Which model were you playing with before (davinci/text-davinci-002/code-davinci-002)?
I played with davinci, text-davinci-002, and text-davinci-003, if I recall correctly. The last model had only been out for a few days at most, however, before ChatGPT was released.
Of course, I didn’t play with any of these models in enough detail to become an expert prompt engineer. I mean, otherwise I would have made the update sooner
I played around with text-davinci-002, trying to get it to do causal reasoning. I found it couldn’t solve overdetermined cause problems, such as this one:
”Pete threw a rock at a window. Steve also threw a rock at the same window at the same time. The window shattered. If Pete had not thrown the rock, would the window shatter? Let us think step by step.”
It gave nonsense answers half the time. ChatGPT handles this kind of question fine. So I was under impression that its capabilities improved.
Could you give an example of this nonsense?
Prompt
”Question.
Bob the policeman was running after a thief. Bob ran very slowly because he was not fit. The thief was running very fast. When Bob was crossing a railroad, he slipped and fell. The road was slippery because Ann spilled some oil there. The thief got away. If Ann had not spilled oil, would the thief had gotten away? Explain your answer.
Let us think.”
Reply: “If Ann had not spilled oil, would the thief had gotten away?
The answer is no. If there was no oil, Bob would not have slipped, and would have been able to catch the thief.”
Prompt
”Pete threw a rock at a window. Steve also threw a rock at the same window. The window shattered. If Pete did not throw the rock, but Steve still threw the rock, would the window shatter?
Let us think logically.”
Reply:
”If Pete did not throw the rock, then the rock would not have hit the window and the window would not have shattered.”
Thanks!
how does −003 compare?
Using nat.dev, I find that 002, 003, and Turbo all get the same result, wrong on the first and right on the second. This is an example of Turbo being Inferior to Chat. I also tried Cohere, which got both. I also tried Claude. Full v1.2 got both wrong. Instant 1.0, which should be inferior, got the second correct. It also produced a wordy answer to the first which I give half credit because it said that it was difficult but possible for the slow policeman to catch the fast thief. I only tried each twice, with and without “Let us think,” which made no difference to the first. I almost didn’t bother adding it to the second since they did so well without it. Adding it made 002 and Claude-instant fail, but Claude1.2 succeed. (I also tried llama and alpaca, but they timed out.)