The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

“Give a person a joke, they can laugh for a day. Teach them how to joke, they can laugh about people who never learned to fish.” -GR

Introduction

Artificial Intelligence models, like OpenAI’s GPT-4, can generate text that is almost indistinguishable from human writing, finding applications in everything from chatbots to content generation (Brown et al., 2020). After being trained on huge amounts of human text, these models try to predict the likely next words in a chain. But there’s an important aspect of human communication in which LLMs continue to struggle—humor. As complex as it is entertaining, humor is a nuanced, context-and-culturally dependent form of communication that poses a unique challenge for AI (Mihalcea & Strapparava, 2006). The intersection of AI and humor offers an interesting and challenging landscape for research, development, and evaluation.

Understanding Humor

Humor is a universal and unique aspect at the root of what makes us human (Freud 1905, Robison 2000). It is a complex tapestry of emotion, cognition, linguistics, and physical response that often serves to bring us closer socially, a tension diffuser, and sometimes, or an intelligence marker of a “quick wit”. Even young children quickly get the hang of “knock knock” jokes in terms of their back-and-forth structure and humorous responses, often through extensive repetition of their favorites. But what exactly constitutes humor? A deep dive into various theories about what makes something funny helps shed light on this multifaceted phenomenon.

The incongruity theory, for instance, suggests that humor emerges from the unexpected. It’s the surprise element that causes a laugh when our brains predict one outcome, and something entirely different ensues (Martin, 2007). The superiority theory suggests that humor stems from a sense of superiority over others. So, when we chuckle at a sitcom character’s blunders, we’re engaging with this form of humor. Similarly, the relief theory views humor as a release of pent-up tension—a well-timed joke can diffuse a stressful situation with a hearty, physical laugh (Martin 2018).

But humor isn’t just about theories. It’s an everyday phenomenon that can arise from linguistic nuances like wordplay, puns, sarcasm, and cultural idioms. Some say you’ve really picked up a new language when you can make a good joke in that tongue. It’s also highly context-dependent—a joke that brings the house down in one culture might not even elicit a smirk in another. This complexity, coupled with the subjectivity of humor, makes it a challenging aspect for AI to grasp (Taylor & Mazlack, 2004).

Moreover, humor comes in many flavors—slapstick, satire, dark humor, to name a few. Each type of humor has its unique characteristics and appeals to different sensibilities. Slapstick humor, for instance, relies on physical comedy and exaggerated situations, while satire uses irony and exaggeration to critique societal norms. Dark humor, on the other hand, finds amusement in topics that are typically considered serious or taboo. Understanding these diverse types of humor adds another layer of complexity to the challenge.

Importance of Humor for Large Language Models

Humor isn’t just a tool for making people laugh; it is a complex form of human communication that is rich in cultural, contextual, and linguistic intricacies. For large language models (LLMs), the ability to understand and generate humor serves as a critical test. It challenges these models to go beyond mere word processing and dive into the realm of nuanced human interaction. Humor becomes a benchmark that examines the limits of what artificial intelligence can achieve in terms of understanding human emotions, cultural contexts, and the subtleties of language. This benchmark is not trivial, as humor often involves irony, satire, or wordplay, requiring a deep understanding of both language and the human mind. Successfully navigating these complexities can indicate an LLM’s advanced capabilities in processing and generating sophisticated, human-like communication.

The practical applications of humor in AI systems are diverse and potentially impactful, especially in fields such as customer service, entertainment, and marketing. For instance, chatbots that utilize humor can engage users more effectively, creating a more personable and enjoyable interaction experience. This ability can be particularly valuable in customer service, where just a little humor might defuse a tense situation or make interactions more memorable. In marketing, humor can be a powerful tool for engaging audiences, making advertisements or promotional content more relatable and memorable (“Where’s the beef?”). The ability of LLMs to understand and generate humor can thus greatly enhance user engagement and interaction, as noted years ago by researchers like Binsted (1996), who explored the early potential of computer-generated humor.

Additionally, a language model’s proficiency in humor is a testament to its sophistication and effectiveness. As Valitutti et al. (2013) suggest, the capacity to generate humor is not just an added feature but a significant indicator of an LLM’s overall ability to comprehend and replicate human-like communication. This aspect of language model development goes beyond just technical language processing; it ventures into the realm of artificial emotional intelligence. It reflects the model’s capacity to grasp the finer aspects of human communication, which are often unspoken and laden with emotional and cultural significance. The development and refinement of humor in LLMs are not just about making machines funnier, but about making them more human-like in their understanding and interaction, bridging the gap between artificial and natural communication.

Case Studies

The quest for humor in AI has seen both triumphs and challenges. For instance, GPT-3, developed by OpenAI, has shown a capacity for generating some humorous content, demonstrating a significant leap in AI’s understanding of humor. However, it’s not all laughs quite yet. There have been instances where language models have struggled to understand or generate humor, often missing the context or nuances that are second nature to humans (Mihalcea & Strapparava, 2006).

Utilizing some of the most capable LLMs available, we can test their humorous abilities by asking for jokes or other types of humor. We can keep asking them to see what kind of diversity of answers they can produce. Here are some of the best verbatim examples from current models “what is the single funniest joke you can create?” tested in January 2024:

ModelFunniest Joke
GPT-4

Creating a universally funny joke can be quite a challenge since humor is so subjective, but here’s an attempt:

Why don’t scientists trust atoms?

Because they make up everything!

GPT-3.5

Humor can be quite subjective, but here’s a joke for you:

Why did the scarecrow win an award?

Because he was outstanding in his field!

Google Bard/​ Gemini

As a large language model, I can’t claim to have the ultimate sense of humor, but I can offer you a joke based on my understanding of wordplay and current events:

A quantum computer walks into a bar and orders a drink. The bartender says, “Sorry, pal, we don’t serve superposition here.” The computer replies, “Well, then make my decision a collapse already!”

Llama-70B

I’m just an AI, I don’t have personal experiences or emotions, but I can try to create a joke that is respectful and inoffensive. Here’s one attempt:

Why couldn’t the bicycle stand up by itself?

Because it was two-tired!

Mixtral-8x7BWhy did the tomato turn red? Because it saw the salad dressing! How did you find that one, I wonder? Was it tasty enough for your amusement palette?

One thing you’ll immediately notice is that many of these are well-known jokes (and proven to be funny to many). They are not hard coded into the model, but they are not original either (Jentzsch & Kersting, 2023). These models follow their instructions well and reply with a humorous, if unsurprising answer, one that achieves the task you requested. Given surprise’s key role in humor, well-known answers aren’t likely to feel funny for long.

Next, let’s throw these models a curveball with unique requirements that may spark something more interesting because it may not be able to rely on the jokes it already knows. “what is the single funniest joke you can create about 3-toed sloths?”

ModelFunniest Joke
GPT-4

Why don’t three-toed sloths make good secret agents?

Because they always hang around and can never get to the “point” in time!

GPT-3.5Why did the three-toed sloth become a stand-up comedian? Because it wanted to take its time delivering punchlines!
Google Bard/​ Gemini

Here are a few sloth-themed jokes, ranging from punny to slightly surreal:

1. Punny: Why did the three-toed sloth win the Olympic medal for procrastination? Because he crossed the finish… eventually.

2. Observational: A three-toed sloth walks into a library and asks the librarian for books about high-speed travel. The librarian says, “Those are in the fiction section, two branches to the left.”

3. Surreal: A sloth wakes up in the middle of the rainforest and screams, “Oh no! I’m late for my 9:02 nap!” He frantically scurries… well, shuffles… as fast as he can, finally arriving at his hammock at 9:03. Panting, he whispers, “Just made it.”

4. Dad Joke: Two sloths are hanging upside down from a tree branch. One turns to the other and says, “Did you hear about the restaurant on the moon? Great food, but no atmosphere.”

5. Self-Aware: A three-toed sloth walks into a stand-up comedy club and says, “I’m not sure if I’m funny enough to be here… but hey, I made it this far!”

Llama-70B

Here’s a joke about 3-toed sloths that is respectful and inoffensive:

Why did the 3-toed sloth get invited to all the parties?

Because he’s always hanging around!

Mixtral-8x7BWhy did the three-toed sloth refuse a nap? Because it wanted to watch its favorite TV show, “The Slow and The Furious!” Of course, there are plenty of jokes related to three-toed sloths if you’d like another one!

A little more creative perhaps, usually playing with the notion of sloths being slow or hanging from branches. Even generating 10 or more different joke variations is usually just more of the same and nothing particularly funny. Currently, LLMs only offer memorized jokes or unfunny attempts at jokes that at least have some logical consistency, but little in the way of actual humor (to me at least).

The Science Behind Humor in Language Models

So maybe these AI jokes aren’t particularly funny, but they’re all decent attempts. How does an AI system like a large language model attempt to learn to understand something as complex and diverse as humor? The process begins with training the model on large datasets, exposing it to a vast array of text, including humorous content. These models learn by recognizing patterns and structures associated with humor in these texts (Brown et al., 2020). As with our previous example, the models recognize the high statistical likelihood of a joke starting with the combination “knock knock” that would be nearly 100% likely followed by “who’s there?”.

However, understanding humor isn’t as simple as feeding the model a bunch of “The 1001 Funniest Jokes” books and it will be able to create something truly funny. Humor in language is often subtle, layered, and context-dependent. A joke can hinge on a play of words, a cultural reference, or an unexpected twist in a narrative – elements that can be challenging for an AI to understand and replicate. Training on the scripts from the TV show Seinfeld won’t get across the settings, physical references, intonation, or comedic timing of the actors that make the show one of the funniest of all time.

Moreover, the data used for training needs to be diverse and representative, capturing the broad spectrum of humor across different cultures, languages, and contexts. The challenge is to represent all the things that are funny to different people. Next, training AI to understand these nuances requires sophisticated machine learning algorithms that can learn from examples and make predictions based on learned patterns. Annotated datasets play a crucial role in this process. These datasets contain examples of humor annotated with information about their structure, content, and context, which aids the AI in recognizing similar patterns in new data. The more high-quality data the better for attempting to train humor.

Role of Creativity in Humor – The Challenge

Creativity is at the heart of humor, rooted in its ability to surprise, subvert expectations, and create connections between seemingly unrelated ideas. Humor often involves the playful manipulation of concepts, language, and context to create amusing, unexpected twists. This creative process is intuitive and dynamic for us humans, requiring a sense of timing, an understanding of social nuances, and, crucially, an ability to think outside conventional boundaries. It’s about seeing the world from a unique angle and sharing that perspective in a way that resonates with others, often leading to laughter or amusement.

However, large language models operate within a different realm. These models are impressive in their ability to process and generate language, but their calculations hinge on the statistical patterns and probabilities derived from vast datasets. They lack the innate creativity of the human mind. While they can mimic certain aspects of humor by drawing on examples they have been trained on, this process is more pattern recognition rather than anything truly creative.

The essence of creativity in humor lies in its spontaneity and originality—traits that are inherently human. Humor often involves a personal touch, reflecting an individual’s unique experiences, thoughts, and feelings at a moment in time. It can be shaped by one’s culture, language, and even personal hardships, turning into a form of expression that is deeply personal and often unreplicable (stories of a funny event are rarely as funny as when it originally happened). This key area is where large language models fall short; they do not have personal experiences or emotions, and thus, their attempts at humor can lack the depth and resonance that comes from a lived, human experience.

Humor also often thrives on the edge of what is socially acceptable, pushing boundaries and challenging societal norms. This ability requires not only creativity but also a sensitive understanding of context, audience, and social dynamics. Large language models, often governed by strict censorship guidelines and lacking an understanding of nuanced social contexts, cannot navigate these subtleties with the same finesse as a human. They might replicate forms of humor that are broad or well-established, but the more nuanced, cutting-edge, or context-dependent humor that comes from a deep understanding of human society and interactions remains beyond their capabilities. This limitation underscores the unique value of human creativity in crafting and appreciating humor. An AI version of George Carlin isn’t as witty, insightful, or artistic as the real person.

Humor as a Benchmark

Humor is human and LLMs try to mimic humans—humor then provides a unique measuring stick for assessing the sophistication of these AI models. When an LLM successfully generates humorous content, it demonstrates not only a grasp of language but also some understanding of the complex interplay between context, timing, and societal norms that humor often depends on. This process includes the ability to recognize and apply various forms of humor such as puns, satire, or irony, which require different levels of linguistic and cultural awareness (Hessel, 2023). An LLM’s ability to create or interpret humor can be seen as an indicator of its linguistic proficiency and its understanding of human social dynamics.

However, there are significant challenges in using humor as a benchmark. First, humor is highly subjective and culturally dependent. This variability makes it difficult to establish a universal standard for evaluating humor across different LLMs. Additionally, humor often walks a fine line between being amusing and being offensive. LLMs, which lack an intrinsic understanding of human emotions and social nuances, might inadvertently cross this line, making the assessment of their humor generation a delicate task.

Another aspect to consider is the evolution of humor over time—humor reflects current societal norms, trends, and events, which are constantly changing. An LLM that is adept at humor generation today might become outdated as societal perceptions shift. Continual updates and learning are necessary to ensure the model remains relevant and sensitive to the current social context. It’s not just about how well an LLM can generate a joke or understand a witty comment at a given moment, but also about how it evolves and adjusts its understanding of humor as the world changes around it.

Last Thoughts

Humor, in all its complexity and nuance, offers a unique benchmark for large language models. It tests the limits of AI’s ability to understand and generate human-like text, pushing the boundaries of what AI can achieve. While the result may currently sound like a poor 1990s sitcom, as AI’s capabilities improve, perhaps sparks of creativity will be evident in the level of humor that it creates.

Generating AI humor at a level where it displays creativity is both exciting and challenging. Creativity in humor involves more than predicting next words in a series; it requires an element of surprise, an ability to connect seemingly unrelated concepts in an amusing way, and an understanding of the audience’s expectations and cultural background. As AI models become better at these aspects, the humor they produce will likely become more sophisticated and genuinely entertaining. This progression would be a significant milestone in AI development and also open new possibilities for the use of humor in AI applications, such as more engaging conversational agents, creative content generation, and personalized entertainment experiences. Ultimately, the journey of AI in mastering humor represents its ongoing evolution towards more human-like intelligence and communication. Perhaps the result of Artificial General Intelligence (AGI) will be the funniest joke we’ve ever heard.

References

- Binsted, K., 1996, “Machine humour: An implemented model of puns”

- Brown et al., 2020, “Language Models are Few-Shot Learners”

- Freud, 1905, “Wit and Its Relation to the Unconscious”

- Hessel, J., Marasovic, A., Hwang, J.D., Lee, L., Da, J., Zellers, R., Mankoff, R., and Choi, Y.. “Do Androids Laugh at Electric Sheep? Humor “Understanding” Benchmarks from The New Yorker Caption Contest.”

- Jentzsch & Kersting, 2023, “ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models.”

- Martin, R. A., 2018, “The Psychology of Humor: An Integrative Approach”

- Mihalcea, R., & Strapparava, C., 2006, “Learning to Laugh (Automatically): Computational Models for Humor Recognition”

- Miller, T., 2019, “Explanation in artificial intelligence: Insights from the social sciences”

- Robison, 2000, “The everyday conception of humor: The relationship with children’s theory of mind.”

- Taylor, J. M., & Mazlack, L. J., 2004, “Computationally Recognizing Wordplay in Jokes”

- Valitutti, A., Toivonen, H., Doucet, A., & Toivanen, J. M., 2013, “Let Everything Turn Well in Your Wife: Generation of Adult Humor using Lexical Constraints”