I bet o3 is better than 4.1. I also bet reasoning probably helps some.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)
I bet o3 is better than 4.1. I also bet reasoning probably helps some.
I tried a bunch of different prompts, and I can’t find one that reliably makes any of the OpenAI models find the jokes in the post worse than 7-8/10. (Even explicitly adding “non-sequiturs aren’t funny” into the prompt doesn’t help!)