Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.
I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7⁄10. Is this how you did it @ErickBall?
I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!
I asked my therapist if it was normal to talk to myself. She said, “It’s perfectly fine—as long as you don’t interrupt.”
Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn’t find it with a quick Google search.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)
Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.
I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7⁄10. Is this how you did it @ErickBall?
Few-shot prompting seems to help: https://claude.ai/share/1a6221e8-ff65-4945-bc1a-78e9e79be975
I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!
Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn’t find it with a quick Google search.
I bet o3 is better than 4.1. I also bet reasoning probably helps some.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)