I would be curious to see an attempt! I have a pretty strong prior that it would fail, though, with currently available models. I buy that RLHF hurts, but given Sam Altman’s sample story also not impressing me (and having the same failure modes, just slightly less so), the problem pattern-matches for me to the underlying LLM simply not absorbing the latent structure well enough to imitate it. You might need more parameters, or a different set of training data, or something.
(This also relates to my reply to gwern above—his prompt did indeed include high quality examples, and in my opinion it helped ~0.)
Both Altman and Gwern used fine-tuned models, those don’t really do in-context learning. They don’t support “prompt engineering” in the original sense, they only respond to commands and questions in a particular way.
I would be curious to see an attempt! I have a pretty strong prior that it would fail, though, with currently available models. I buy that RLHF hurts, but given Sam Altman’s sample story also not impressing me (and having the same failure modes, just slightly less so), the problem pattern-matches for me to the underlying LLM simply not absorbing the latent structure well enough to imitate it. You might need more parameters, or a different set of training data, or something.
(This also relates to my reply to gwern above—his prompt did indeed include high quality examples, and in my opinion it helped ~0.)
Both Altman and Gwern used fine-tuned models, those don’t really do in-context learning. They don’t support “prompt engineering” in the original sense, they only respond to commands and questions in a particular way.