Part of the confidence came from Bing’s success in answering correctly when set to precise mode. Many speculated GPT-4 was going to be even more powerful than Bing, even though they turned out to be the same. I’m not exactly sure what the difference is using the “precise” setting, if anyone knows let me know!
Based on Mikhail’s Twitter comments, ‘precise’ and ‘creative’ don’t seem to be too much more than simply the ‘temperature’ hyperparameter for sampling. ‘Precise’ would presumably correspond to very low, near-zero or zero, highly deterministic samples.
The ChatGPT interface to GPT-4 doesn’t let you control temperature at all, so it’s possible that its ‘mistakes’ are due to its hidden temperature being too randomized and it committing to bad answers early (which is a very common issue with doing Q&A with ‘one right answer’ type questions)… However, I see people are mentioning results like 0⁄7 and 0⁄10, so that may not be it. I would expect too-high temps to get it right a decent fraction of the time. It would be very interesting if ‘Monty Fall’ turned out to be another example of the RLHF skewing GPT-4′s calibration badly compared to the baseline model, which they report about other things.
Based on Mikhail’s Twitter comments, ‘precise’ and ‘creative’ don’t seem to be too much more than simply the ‘temperature’ hyperparameter for sampling. ‘Precise’ would presumably correspond to very low, near-zero or zero, highly deterministic samples.
That’s interesting. Earlier, he was very explicitly identifying temperature with creativity in the Tweets I collated when commenting about how the controls worked. So now if the temperature is identical but he’s calling whatever it is ‘creative’, he’s completely flipped his position on “hallucinations = creativity”, apparently.
Hm. So it’s the same temperature, but it’s more expensive, which has ‘longer output, more expressive, slower’, requires more context… That could point to it being a different model under the hood. But it could also point to a different approach entirely, like implementing best-of sampling, or perhaps some inner-monologue-like approach like a hidden prompt generating several options and then another prompt to pick “the most creative” one. There were some earlier comments about Sydney possibly having a hidden inner-monologue scratchpad/buffer where it could do a bunch of outputs before returning only 1 visible answer to the user. (This could be parallelized if you generated the n suggestions in parallel and didn’t mind the possible redundancy, but is inherently still more serial steps than simply generating 1 answer immediately.) This could be ‘pick the most creative one’ for creative mode, or ‘pick the most correct one’ for ‘precise’ mode, etc. So this wouldn’t necessarily be anything new and could have been iterated very quickly (but as he says, it’d be inherently slower, generate longer responses, and be more expensive, and be hard to optimize much more).
This is something you could try to replicate with ChatGPT/GPT-4. Ask it to generate several different answers to the Monty Fall problem, and then ask it for the most correct one.
We will {...increase the speed of creative mode...}, but it probably always be somewhat slower, by definition: it generates longer responses, has larger context.
Our current thinking about Bing Chat modes: Balanced: best for the most common tasks, like search, maximum speed Creative: whenever you need to generate new content, longer output, more expressive, slower Precise: most factual, minimizing conjectures
So creative mode definitely has larger context size, and might also be a larger model?
Based on Mikhail’s Twitter comments, ‘precise’ and ‘creative’ don’t seem to be too much more than simply the ‘temperature’ hyperparameter for sampling. ‘Precise’ would presumably correspond to very low, near-zero or zero, highly deterministic samples.
The ChatGPT interface to GPT-4 doesn’t let you control temperature at all, so it’s possible that its ‘mistakes’ are due to its hidden temperature being too randomized and it committing to bad answers early (which is a very common issue with doing Q&A with ‘one right answer’ type questions)… However, I see people are mentioning results like 0⁄7 and 0⁄10, so that may not be it. I would expect too-high temps to get it right a decent fraction of the time. It would be very interesting if ‘Monty Fall’ turned out to be another example of the RLHF skewing GPT-4′s calibration badly compared to the baseline model, which they report about other things.
Nope, Mikhail has said the opposite: https://twitter.com/MParakhin/status/1630280976562819072
So I’d guess the main difference is in the prompt.
That’s interesting. Earlier, he was very explicitly identifying temperature with creativity in the Tweets I collated when commenting about how the controls worked. So now if the temperature is identical but he’s calling whatever it is ‘creative’, he’s completely flipped his position on “hallucinations = creativity”, apparently.
Hm. So it’s the same temperature, but it’s more expensive, which has ‘longer output, more expressive, slower’, requires more context… That could point to it being a different model under the hood. But it could also point to a different approach entirely, like implementing best-of sampling, or perhaps some inner-monologue-like approach like a hidden prompt generating several options and then another prompt to pick “the most creative” one. There were some earlier comments about Sydney possibly having a hidden inner-monologue scratchpad/buffer where it could do a bunch of outputs before returning only 1 visible answer to the user. (This could be parallelized if you generated the n suggestions in parallel and didn’t mind the possible redundancy, but is inherently still more serial steps than simply generating 1 answer immediately.) This could be ‘pick the most creative one’ for creative mode, or ‘pick the most correct one’ for ‘precise’ mode, etc. So this wouldn’t necessarily be anything new and could have been iterated very quickly (but as he says, it’d be inherently slower, generate longer responses, and be more expensive, and be hard to optimize much more).
This is something you could try to replicate with ChatGPT/GPT-4. Ask it to generate several different answers to the Monty Fall problem, and then ask it for the most correct one.
Additional comments on creative mode by Mikhail (from today):
https://twitter.com/MParakhin/status/1636350828431785984
https://twitter.com/MParakhin/status/1636352229627121665
https://twitter.com/MParakhin/status/1636356215771938817
So creative mode definitely has larger context size, and might also be a larger model?
Note that 2 stage generation—ask it if it’s sure about its answer and use the second response as the output—solves Monty fall every time I tried.