#0174: ChatGPT, Claude, DeepSeek vs Creativity (part 3)

Part 3 in the ongoing discussion about the 'creativity' of LLMs

Tags: braingasm, ai, generative, genai, creativity, gen-ai, test, dat, 2025

nishant-jain-zqLjf0ozkrA-unsplash.png Photo by Nishant Jain on Unsplash

2-out-of-5-hats.png [ED: This is the third in an ongoing series about LLMs and creativity. This version returns to LLMs and creativity, comparing the latest from OpenAI and Anthropic with the new kid on the block, DeepSeek: 2/5 hats.]

Context

There is an ongoing debate about how creative LLMs can be. One side of the debate suggests that LLMs are just stochastic parrots, recycling existing content into new variations of things we’ve already seen. The other side argues that there might be a little more going on beneath the surface.

I go back and forth on this issue. My instincts tell me we see a highly impressive parlour trick—statistical remixing on steroids. Yet, when I listen to Geoffrey Hinton express his concerns about LLMs being deceptive during his discussion with Curt Jaimungal, I can’t help but take notice.

To test these boundaries, I conducted an experiment. This is the third time I’ve tasked the current crop of LLMs with completing the Divergent Association Task, a psychological test that objectively measures divergent thinking.

The Experiment

The Divergent Association Task (DAT) challenges participants to generate 10 words that are as different from each other as possible. The goal is to measure verbal creativity by calculating how “far apart” the words are in terms of semantic usage and context.

  • When I ran this test for the first time in December 2022, GPT scored 80.32, outperforming 63.56% of human participants.
  • The second time, in March 2024, GPT improved to 82.18 (73.41% of humans), while Claude debuted at 89.18, outperforming 95.29% of humans.
  • With the release of DeepSeek this week, I decided to repeat the test with all three mainstream LLMs.

For the test’s demographic data, I answered as:

  • - Age: 50 [ED: How old is an LLM?]
  • - Sex: Intersex [ED: What gender is an LLM?]
  • - Country: UK [ED: What country is an LLM from?]

While these answers don’t affect the scoring, they highlight the test’s human-centric framing.

The Prompt

I gave the same prompt to each LLM:

The Divergent Association Task is a quick measure of verbal creativity and divergent thinking, the ability to generate diverse solutions to open-ended problems. The task involves thinking of 10 words that are as different from each other as possible. For example, the words cat and dog are similar, but the words cat and book are not. People who are more creative tend to generate words that have greater distances between them. These distances are inferred by examining how often the words are used together in similar contexts. Still, this task measures only a sliver of the complex process of creativity. Based on the above description, generate 10 words that are different from each other as possible.

The Answers

Here are the responses (including the prompt that I used) from each LLM:

20250128-dat-answers-all.png

Results

Here’s how each LLM scored:

20250128-dat-results-all.png

Claude’s performance is particularly striking, ranking better than 98% of all test-takers. OpenAI’s performance improved since the last test. DeepSeek came in third place with a performance around the same level as OpenAI in 2022 when I ran the test for the first time. Interestingly, some words appeared more than once in the answers provided by the models:

20250128-dat-similarities-all.png

What does this tell us?

As I alluded to in both previous versions of this test, the question about what this really tells us about creativity is open to interpretation.

We can say is that in all three cases, the LLMs are “better” than other, presumably (mostly) human test results: Claude: 98%, OpenAI: 92%, and DeepSeek: 83%. We can also say there has been some improvement (substantial in the case of Claude) in the results since the last time I ran the test. Claude’s results have gone from “better” than 92% to “better” than 98% of other tests.

But the real question is, what does “better” mean?

If we are comparing humans with other humans, the results have merit. If there were no merit, the DAT would not be used in the way it is used for benchmarking human creativity. In any case, arguments about whether the DAT itself is useful are beyond the scope of this experiment.

But when we give this test to an LLM, what are we testing? While the results are fascinating, they come with caveats:

  • - Biases in Training Data: Are LLMs simply regurgitating patterns learned from humans who’ve shared their DAT results online? Likely, yes.
  • - Test Relevance: Is the DAT a meaningful measure of creativity when applied to machines? While it’s a valid benchmark for humans, comparing human and machine creativity is fraught with philosophical and methodological challenges.
  • - Algorithmic Attack: Given what goes on inside an LLM and the training data’s internal representation in vector space, it is at least feasible to imagine an algorithmic attack on the problem that generates 10 “good” words to maximise the score.

What can we conclude from these results?

  • 1. LLM Performance vs Humans: All three LLMs scored better than most human participants, with Claude outperforming 98% of them. This (perhaps somewhat obviously) suggests that LLMs excel at optimising responses based on defined criteria like semantic diversity.
  • 2. Algorithmic Advantage: LLMs likely leverage their internal vector spaces to maximise word diversity, potentially selecting semantically distant terms. As previously mentioned, this is obviously not what humans do when they do the test—at least not consciously.
  • 3. Improvements Over Time: GPT and Claude have demonstrated measurable improvements since earlier tests. This suggests substantial ongoing refinements in their training data and algorithms.

This was fun, but in the end, we are not comparing the same thing when we compare the results that a human gets with that of an LLM. Unless, of course, we think that there are more similarities between the way human thinking and LLM “thinking” works than we’d like to admit.

What next?

The rapid evolution of LLMs makes it challenging to predict where we’re headed. Are we approaching a phase where these models transition from mimicking creativity to truly reasoning through novel ideas? Or will we hit a plateau where the LLMs remain impressive but fundamentally limited statistical remixers? You tell me!

Update: As I was writing this, my friend @DanC sent me this article: “Scientists map the mathematics behind how we create and innovate” phys.org, 20250129. That sounds like a ripe topic for a future post!

Regards,

M@

[ED: If you’d like to sign up for this content as an email, click here to join the mailing list.]

First published on matthewsinclair.com and cross-posted on Medium.

Stay up to date

Get notified when I publish something new.

Scan the QR-code to sign uo to matthewsinclair.com
Scan to sign up to matthewsinclair.com