Thanks for writing this up! I’m excited to see more people running experiments like this.
When you say “if I take X as a prompt, I get Y,” how many trials did you wait? In my own experimentation I’ve found lil’ GPT-2′s performance to be really variable across trials, and I’ve needed to wait 5 trials in some cases to get results I even sort of liked.
My sense overall of how lil’ GPT-2 functions after playing with it for awhile on several different kinds of prompts is that it has a strong sense of genre, and has done something like learned a bunch of different genre conventions for the different types of texts in the training set. If the prompt strongly resembles a genre familiar from training it will run with that genre, although sometimes it’ll wander off into another genre. It does quite poorly on prompts that I suspect don’t strongly match to a genre in the training set.
For example, I tried to run a Turing test (mostly as a joke) by prompting with “Question: what is your name? Answer:” and I got this on my ~2nd to 4th trial (don’t remember), with my speculations as to genre in [square brackets]:
Foulplay
I’m a Guardian Huntress, a deviation of the <Here> Monks keyword. I’m like a little “mechanic-creature hire”, but is Disassembler which is general clique-of-guy? I usually live in forms of Priests or Whispers of the Old Gods, but if it’s Asian Brotherhood of Magi..… [RP forum?]
deepmen
Reply ~29000 0 ~13 min 0 By TrueBeliever For MA teenage adventurer: Ooops It looks stupid but I’m given the wrong answer, and this logic is corrupt. Other first hand studies: sometimes your famed mechanical fiends attack other nvrock’s—I had such too—but you’re supposed to have flying paths to defend them would not it? This alien spawn is being unnecessary and a waste of time Man—you stole the Saalis… [gaming forum?]
Shadow Netherlands
Reply ~120200 0 ~30 min 0 By john : I know there’s no better examples of
a landing back off 2lands ( ou ever?) Low challenge difficulty
a lot of new cards in my deck,
which doesn’t draw cards: stacks high levels Some need life if w/o rest of deck already Defense Emperor’s | Inferno/Neck Scorer Necronomicon Mysticpetal Mana Enigma Mortalswords Mysticmist Scout Zarthai Sniper [MtG or Hearthstone forum?]
It’s a multiplier and it increases the yield on a lv301 fighter fighter, next to nothing it’s probably never in my deck or some random deck Dofrone’s | Boltforge Fastboat Sling dmt3 Efreet Flattestalker Infernal Eater Toxic Water Hurricane For another Holy orb suggested… [gaming forum? LoL?]
Was planning on posting a longer (mostly humorous) post with my own results but that post is low priority so I don’t know when it’s going to happen.
This definitely could use more trials. In the case of the sentiment analysis experiment, I’d ideally like to try out some other sentence structures (eg “Is a <noun> bad?”, “Are <adjective> things good?); in the case of the Moloch experiment, I’d like to try some reruns with the same parameters, as well as different name substitutions, just to be sure that it isn’t noise.
Thanks for writing this up! I’m excited to see more people running experiments like this.
When you say “if I take X as a prompt, I get Y,” how many trials did you wait? In my own experimentation I’ve found lil’ GPT-2′s performance to be really variable across trials, and I’ve needed to wait 5 trials in some cases to get results I even sort of liked.
My sense overall of how lil’ GPT-2 functions after playing with it for awhile on several different kinds of prompts is that it has a strong sense of genre, and has done something like learned a bunch of different genre conventions for the different types of texts in the training set. If the prompt strongly resembles a genre familiar from training it will run with that genre, although sometimes it’ll wander off into another genre. It does quite poorly on prompts that I suspect don’t strongly match to a genre in the training set.
For example, I tried to run a Turing test (mostly as a joke) by prompting with “Question: what is your name? Answer:” and I got this on my ~2nd to 4th trial (don’t remember), with my speculations as to genre in [square brackets]:
Was planning on posting a longer (mostly humorous) post with my own results but that post is low priority so I don’t know when it’s going to happen.
This definitely could use more trials. In the case of the sentiment analysis experiment, I’d ideally like to try out some other sentence structures (eg “Is a <noun> bad?”, “Are <adjective> things good?); in the case of the Moloch experiment, I’d like to try some reruns with the same parameters, as well as different name substitutions, just to be sure that it isn’t noise.
Try varying lines 14 and 16 in the interactive script for quicker execution, and try giving it a few example lines to start with.