I think the prompt is critical here. I tried asking Claude how funny that last joke was and it said 6⁄10 but then floundered when trying to explain why it’s funny or what any of it means. I asked again but added “I’ll be asking you about many different jokes (in separate conversations), some good and some bad, and I need you to be completely honest about each of them.” This time it gave a 3⁄10 and accurate criticism:
Claude 3.7 output
This joke has an interesting structure but feels disjointed and unfocused. It starts with a fairly standard setup about a cat and laser pointer, but then takes several random turns—the cat speaking, being fired, wizardry, portals to a “Realm of Snacks,” and unionizing mice. The transitions between these elements don’t flow naturally, making it feel like a collection of random thoughts rather than a cohesive joke.
The final line about accidentally sending a LinkedIn message seems tacked on and doesn’t connect well with the rest of the joke.
On a humor scale, I’d rate it about 3⁄10. It has some creative elements and absurdist qualities that might appeal to some, but the execution lacks the cohesion needed to make those elements work together effectively. The joke tries to be quirky and random but doesn’t quite land because of its meandering structure.
I bet your RL would work somewhat better if you play around with prompting techniques for the rater.
Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.
I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7⁄10. Is this how you did it @ErickBall?
I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!
I asked my therapist if it was normal to talk to myself. She said, “It’s perfectly fine—as long as you don’t interrupt.”
Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn’t find it with a quick Google search.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)
I think the prompt is critical here. I tried asking Claude how funny that last joke was and it said 6⁄10 but then floundered when trying to explain why it’s funny or what any of it means. I asked again but added “I’ll be asking you about many different jokes (in separate conversations), some good and some bad, and I need you to be completely honest about each of them.” This time it gave a 3⁄10 and accurate criticism:
Claude 3.7 output
This joke has an interesting structure but feels disjointed and unfocused. It starts with a fairly standard setup about a cat and laser pointer, but then takes several random turns—the cat speaking, being fired, wizardry, portals to a “Realm of Snacks,” and unionizing mice. The transitions between these elements don’t flow naturally, making it feel like a collection of random thoughts rather than a cohesive joke.
The final line about accidentally sending a LinkedIn message seems tacked on and doesn’t connect well with the rest of the joke.
On a humor scale, I’d rate it about 3⁄10. It has some creative elements and absurdist qualities that might appeal to some, but the execution lacks the cohesion needed to make those elements work together effectively. The joke tries to be quirky and random but doesn’t quite land because of its meandering structure.
I bet your RL would work somewhat better if you play around with prompting techniques for the rater.
Good idea! These experiments took maybe ~30 min each, so it should be pretty straightforward to run a bunch more with better prompts. I also think Claude 3.7 might be a better judge of humor than GPT 4.1.
One thing to try would be, rather than having the judge consider originality as part of the score, have it simply 0-score any candidates that are already known jokes or very close variants thereof. Intuitively it seems like that might be a bit more effective.
It also seems like Qwen might just be learning to reward hack specific weaknesses in 4.1′s sense of humor. I agree with Tao Lin that 4.1 is relatively soulless. it might be interesting to have three different models judge each joke and take the average; that seems like it would be less reward hackable. Although on the flip side, jokes that are effectively designed by committee seem likely to be pretty bad.
Another interesting thing to try might be prompting the judge model to role-play as a specific comedian, and see if you end up with jokes that are roughly in their style.
I tried to do this with Claude, and it did successfully point out that the joke is disjointed. However, it still gave it a 7⁄10. Is this how you did it @ErickBall?
Few-shot prompting seems to help: https://claude.ai/share/1a6221e8-ff65-4945-bc1a-78e9e79be975
I actually gave these few-shot instructions to ChatGPT and asked it to come up with a joke that would do well by my standards. It did surprisingly well!
Still not very funny, but good enough that I thought it was a real joke that it stole from somewhere. Maybe it did, but I couldn’t find it with a quick Google search.
I bet o3 is better than 4.1. I also bet reasoning probably helps some.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)