Or, maybe it did. But would you even be able to tell?
One major way you could evaluate the GPTness of this post is through humor. Humor is quickly emerging as an essential litmus test for evaluating the sophistication of language models, akin to the Turing Test. It introduces a layer of complexity that pushes LLMs to demonstrate not just linguistic capabilities but also an understanding of human culture, timing, and nuance. Creating humor as an LLM is difficult because jokes need to be surprising, yet not excessively nonsensical, while also respecting the limitations of writing anything too “offensive”. Humans typically learn humor through exposure to various comedic patterns, and LLMs could theoretically do the same by analyzing vast amounts of data. However, at the moment, LLMs seem to take the wrong lessons from humor data.
ChatGPT’s joke-writing approach is not to imitate the best jokes but the most common ones. This reflects its core mechanism—predictive text modeling. Lacking the ability to interpret human laughter, audience response, or even virality, it falls back on familiar structures that appear in its training data. This isn't inherently bad, but it does limit the LLM's ability to discern what makes a joke exceptional versus merely recognizable.
A 2023 study of ChatGPT-3.5, an earlier OpenAi model, found that out of 1008 generated jokes, “over 90% ... were the same 25 jokes”. Though that’s an older model, it demonstrates the way that the algorithm generally relies on the same structures and punchlines rather than creating wholly original jokes.
The researchers also tested ChatGPT-3.5’s ability to detect and explain jokes, finding that “ChatGPT struggles to explain sequences that do not fit into the learned patterns. Further, it will not indicate when something is not funny or that it lacks a valid explanation”.
The challenge lies in the fact that LLMs do not feel emotions, which poses the question: Can a system that doesn’t experience humor create it?
LLM’s Recognizing Humor
Through conversing with the LLM, we found that we could “trick” ChatGPT into recognizing statements that sound like jokes, but that a plurality of people wouldn’t actually consider funny. Its predictive nature prioritizes what sounds like a joke format, over assessing whether it lands effectively.
Here are some of our not-jokes (including one sourced from the aforementioned study), which are ALL intended to be unfunny.
ChatGPT-3.5 determined that 1 and 2 were jokes, while 3 and 4 were not. This is where subjectiveness in humor comes in, because it classifies the “monk and rabbi” joke as a joke due to its “anti-humor”, which in itself can be humor.
Similarly, for Joke 1, it calls the punchline an “absurdist twist” when really, it doesn’t make any logical sense, even in an absurd way.
But how about the other way around, can humans identify “GPT jokes” consistently?
To explore this, we conducted an experiment: we had our team mix ChatGPT-generated monologue jokes with ones written by a professional comedian. It was then our job to determine which were which. The results were telling. We were able to correctly identify the jokes penned by human writers 85% of the time. The correlation between human-crafted humor and higher-quality content was apparent. In one study, ChatGPT outperformed most laypeople in humor production but still lagged behind professional writers.
“Acronyms” asked participants to think of a funny string of words to complete a given acronym (e.g., “C.O.W.”), “Fill-in-the-blank” asked people to consider a question (e.g., “A lesser talked about room in the White House: ___”), and “Roast” is a more aggressive, insult based joke about given subjects
A major feature of OpenAI’s latest model, o1, is the ability to look at the LLM’s reasoning for its responses. While using o1 to explore GPT’s reasoning behind its humor, we experienced an apparent slip where the model mentioned someone named “Scott”. We speculated that this referred to prolific humor expert Scott Dikkers. It then denied sourcing from him before backpedaling and admitting that its training includes generalized examples of comedic writing. This moment highlighted an underlying issue: ChatGPT's logic occasionally missteps when discussing its sources or influences, leading to inadvertently revealing interactions.
A cool feature of ChatGPT, is that you can see that interaction for yourself right HERE.
To truly elevate ChatGPT’s humor-writing capabilities, it would need better tools for evaluating the success of humor—interpreting not just word structures but actual metrics like audience laughter or shares. Training models with labeled data that indicate the level of success could be a step forward. This would mean, essentially, including the relative success of jokes as part of reinforcement learning.
Direct ChatGPT Quotation on Its Approach:
“Yes, the data likely includes more successful works, which may influence the styles I emulate. However, I do not take into account factors like citation counts or popularity, as I don’t have access to that information. I focus on the most common patterns in joke-telling.”
Direct from the LLM's mouth.
Key Findings
Just because ChatGPT can't spit out a quip worth repeating right away, that doesn't mean it can't be a tool for comedians. Check out our Founder Sarah Rose Siskind using ChatGPT to write a joke, the right way.
And here is her appearance on KQED using ChatGPT’s latest feature Advanced Voice Mode to write a parody of public radio, while on public radio.