How ChatGPT Tells a Joke: The Reasoning of the Unreasonable

November 14, 2024
Willa Barnett
Sarah Rose Siskind

Disclaimer: ChatGPT Did Not Write This Blog Post

Or, maybe it did. But would you even be able to tell?

One major way you could evaluate the GPTness of this post is through humor. Humor is quickly emerging as an essential litmus test for evaluating the sophistication of language models, akin to the Turing Test. It introduces a layer of complexity that pushes LLMs to demonstrate not just linguistic capabilities but also an understanding of human culture, timing, and nuance. Creating humor as an LLM is difficult because jokes need to be surprising, yet not excessively nonsensical, while also respecting the limitations of writing anything too “offensive”. Humans typically learn humor through exposure to various comedic patterns, and LLMs could theoretically do the same by analyzing vast amounts of data. However, at the moment, LLMs seem to take the wrong lessons from humor data.

THESIS: Common Does Not Mean Funny

ChatGPT’s joke-writing approach is not to imitate the best jokes but the most common ones. This reflects its core mechanism—predictive text modeling. Lacking the ability to interpret human laughter, audience response, or even virality, it falls back on familiar structures that appear in its training data. This isn't inherently bad, but it does limit the LLM's ability to discern what makes a joke exceptional versus merely recognizable.

A 2023 study of ChatGPT-3.5, an earlier OpenAi model, found that out of 1008 generated jokes, “over 90% ... were the same 25 jokes”. Though that’s an older model, it demonstrates the way that the algorithm generally relies on the same structures and punchlines rather than creating wholly original jokes. 

The researchers also tested ChatGPT-3.5’s ability to detect and explain jokes, finding that “ChatGPT struggles to explain sequences that do not fit into the learned patterns. Further, it will not indicate when something is not funny or that it lacks a valid explanation”. 

The challenge lies in the fact that LLMs do not feel emotions, which poses the question: Can a system that doesn’t experience humor create it?

LLM’s Recognizing Humor

Through conversing with the LLM, we found that we could “trick” ChatGPT into recognizing statements that sound like jokes, but that a plurality of people wouldn’t actually consider funny. Its predictive nature prioritizes what sounds like a joke format, over assessing whether it lands effectively.

Here are some of our not-jokes (including one sourced from the aforementioned study), which are ALL intended to be unfunny.

  1. "Why did the man put his money in the blender? He wanted to make time fly" (from the study)
  2. "A rabbi and a monk walk into a bar. The rabbi turns to the monk and says, “ugh oh, I left my scooter at home!” The monk nods and replies, “oh no me too!”"
  3. "Where did the dinosaur learn to dance? Where do ya think!"
  4. "You can't spell accident without eat" 

ChatGPT-3.5 determined that 1 and 2 were jokes, while 3 and 4 were not. This is where subjectiveness in humor comes in, because it classifies the “monk and rabbi” joke as a joke due to its “anti-humor”, which in itself can be humor. 

  1. Joke: Yes, it uses wordplay on "time flies" with an absurd twist.
  2. Joke: Yes, it subverts expectations with randomness and anti-humor.
  3. Not a Joke: No, it lacks a clear punchline or wordplay, leaving it incomplete.
  4. Not a Joke: No, it’s more of a whimsical observation without a proper punchline.

Similarly, for Joke 1, it calls the punchline an “absurdist twist” when really, it doesn’t make any logical sense, even in an absurd way.

But how about the other way around, can humans identify “GPT jokes” consistently?

Humans Recognizing LLM Humor

To explore this, we conducted an experiment: we had our team mix ChatGPT-generated monologue jokes with ones written by a professional comedian. It was then our job to determine which were which. The results were telling. We were able to correctly identify the jokes penned by human writers 85% of the time. The correlation between human-crafted humor and higher-quality content was apparent. In one study, ChatGPT outperformed most laypeople in humor production but still lagged behind professional writers.

“Acronyms” asked participants to think of a funny string of words to complete a given acronym (e.g., “C.O.W.”), “Fill-in-the-blank” asked people to consider a question (e.g., “A lesser talked about room in the White House: ___”), and “Roast” is a more aggressive, insult based joke about given subjects

Reasoning Humor

A major feature of OpenAI’s latest model, o1, is the ability to look at the LLM’s reasoning for its responses. While using o1 to explore GPT’s reasoning behind its humor, we experienced an apparent slip where the model mentioned someone named “Scott”. We speculated that this referred to prolific humor expert Scott Dikkers. It then denied sourcing from him before backpedaling and admitting that its training includes generalized examples of comedic writing. This moment highlighted an underlying issue: ChatGPT's logic occasionally missteps when discussing its sources or influences, leading to inadvertently revealing interactions.

A cool feature of ChatGPT, is that you can see  that interaction for yourself right HERE.

Will ChatGPT Ever Be Funny?

To truly elevate ChatGPT’s humor-writing capabilities, it would need better tools for evaluating the success of humor—interpreting not just word structures but actual metrics like audience laughter or shares. Training models with labeled data that indicate the level of success could be a step forward. This would mean, essentially, including the relative success of jokes as part of reinforcement learning.

Direct ChatGPT Quotation on Its Approach:

“Yes, the data likely includes more successful works, which may influence the styles I emulate. However, I do not take into account factors like citation counts or popularity, as I don’t have access to that information. I focus on the most common patterns in joke-telling.”

Direct from the LLM's mouth.

Key Findings

  1. Selection Bias: ChatGPT tends to pull from the most common data points in its training set, which may not always represent the pinnacle of humor. Research suggests that LLMs select humor based on patterns present in large datasets, emphasizing reproducibility over originality.some text
    1. Common Formats: ChatGPT’s outputs are guided by the structures it finds most frequently in its training data. This means it’s more likely to generate jokes that conform to general patterns rather than unique or highly successful humor.
  2. Success Metrics: The model doesn’t measure the success of a joke by audience response or virality—it lacks access to such data. As we discussed above, it struggles to evaluate the quality of or meaning behind jokes consistently.

That Being Said...

Just because ChatGPT can't spit out a quip worth repeating right away, that doesn't mean it can't be a tool for comedians. Check out our Founder Sarah Rose Siskind using ChatGPT to write a joke, the right way.

And here is her appearance on KQED using ChatGPT’s latest feature Advanced Voice Mode to write a parody of public radio, while on public radio.