Ever After: Why Large Language Models Aren’t Designed for Instant Creativity and What to Do about It

LLMs (Large Language Models) work differently from a traditional search engine and our approach to using them needs to reflect the uniqueness and biases in the tool.

March 26, 2024

min read

‍

‍

LLMs (Large Language Models) are seen as a powerful tool, particularly when it comes to creativity. Both prose and pictures are natural outputs of the tool, threatening the jobs of those who have traditionally generated such content. Further up the value chain, LLMs have been seen as assisting in higher-order creativity thinking like product and process design, strategy, and innovation. Unfortunately, a fundamental misunderstanding of the tool causes many people to misuse and miss out on the true value.

Remember that LLMs have been trained by ingesting large volumes of data and creating weighted sets. It’s not wrong to think of it as autocomplete on steroids (a lot of steroids since it doesn’t just finish a sentence but can generate reams of data). Underlying it is a weighted vector model. The key term is “weighted.” The more it sees something in its input, the more likely it will give similar output. This creates a bias towards what it knows better and what it knows better tends to be more common, more average.

‍

While the AI-enabled teams produced 8% more ideas (that’s not much), the quality of ideas was mediocre.

‍

I am not a great artist. I would struggle to come up with an innovative painting. On the other hand, the Senseless Drawing Bot is very good at creating certain types of paintings. Those are far better than any painting I can draw in that style. On the other hand, as poor as my artwork is, if you put me up against the Senseless Drawing Bot to make a painting in the style of a Rembrandt, I wouldn’t beat it every time. While the Senseless Drawing Bot is creative, it is so within a very narrow range, only for a certain type of painting. bitPaintr is another example, great at a certain style, but not so great at others (e.g., don’t ask for photorealism).

In theory, either of these tools could create a Rembrandt, much in the same way that 10,000 monkeys sitting at typewriters for eternity would eventually reproduce the complete works of William Shakespeare. But if your goal was to get the monkeys to write a sonnet in any meaningful time, you’d probably have to give them some guidance along the way.

This is an extreme example, but the same idea applies to ChatGPT and other LLMs. It is optimized for certain tasks, certain outputs, and not others. It has a much wider operating range; it can produce Rembrandt-like images or write Shakespeare-like sonnets by default. But it still has a bias.

Suppose you asked me to write a short story. Chances are I’ll write science fiction because I grew up reading a lot of Asimov and Bradbury. I would read sci-fi magazines, consuming hundreds of sci-fi short stories. I have also been formally trained in physics, math, chemistry, and computer science. I’m not a great fiction writer but I could come up with something because I have a lot of examples to draw on. Asked to write a short story, I'm far less likely to write a cookie cutter Hallmark Christmas movie (woman from big city movies to small town to do something, meets an annoying man, she succeeds in her and at the same time they fall madly in love) because I don’t watch them. So, if you asked me to write ten short stories, you’d get mostly sci-fi and will hardly ever get a Hallmark Christmas movie from me.

Unless, that is, you asked me specifically to write a Hallmark Christmas movie short story. You have to give me that very specific guidance. Even then, it probably won’t be good at first. I know the plot (as outlined above), but the dialog might not be great. I know how to use a law of physics as a plot point in a sci-fi story because I’ve seen it used dozens of times, but I don’t know how to write compelling dialog between two people falling in love. I don’t have as much training in that area. So even when you explicitly push me to that specific genre, my first draft would be far from great. I could try rewriting it and eventually, after many revisions, I’d have a much better version. A writing coach wouldn’t write the story for me, but could give me hints (e.g., create more tension in this scene, tie this argument back to the prior interaction). With guidance and multiple attempts, it would get better.

When it comes to creativity, this is how we need to approach LLMs. With guidance across multiple attempts.

I asked ChatGPT, “Please complete this sentence at the end of a story: And they lived happily . . .”. ChatGPT provided the following response, “ever after, cherishing each moment as if it were the first.” It’s no surprise that it was “ever after.” You were going to answer “ever after” yourself because we’ve all seen it a hundred times. ChatGPT is very biased that way (just like you).

When you ask an LLM to give you suggestions it’s going to give you an “ever after” response, that is, the most common answer. They’re probabilistic so it won’t be the exact answer each time, but it will most likely stay in the range of narrow answers based on where the most weighting is, just like my short stories would tend towards sci-fi because that’s where my training is. You can get me to move out of my default mode (sci-fi) by giving me some direction and then, through more guidance and direction, can help me get better answers (the better romance dialog in the example above).

Many people I’ve spoken with use LLMs as a search engine. What’s a good recipe for chicken tikka masala? There’s a pretty narrow range of answers and a web search will help you find it typically from your initial search engine query. An LLM will do it, too, but that’s because there’s generally one answer (hint: it involves chicken, not salmon).

‍

They want the fast, satisficing answer because this is how we have used search engineers for decades, to get a one and done answer.

‍

Now suppose you want creative ideas for a book launch party for a novel set in the old west, a search engine may find something but not much. An LLM, upon first request, came up with some general ideas, “Have the author do a reading from the book and host a Q&A session” and some specific, but obvious ones like “Choose a venue that reflects the setting of the novel. Look for a rustic barn, an old saloon, or a ranch-style house to create an authentic Old West atmosphere” and, “Serve up hearty, Western-inspired fare such as barbecue ribs, chili con carne, cornbread, baked beans, and apple pie. For drinks, offer a selection of whiskey cocktails, sarsaparilla, and root beer floats.” Good answers, but not deeply creative (I’ll bet you would have thought of those inside two minutes).

When I gave it more guidance, “what type of unusual product or services might be relevant for a book launch party for a novel set in the old west” I started to get some answers that I might not have thought of myself. It suggested custom leather goods, old west portrait studio, vintage typewriter poetry, whisky tasting, western themed fortune teller, and blacksmith demonstration.

The first answers were more common; they were the “ever after” answers. It was only by pushing further that I got to the more creative ones. (The practicality of doing a blacksmith demonstration in Manhattan might be another issue if I actually had such a book to promote, but this is creativity so we’re in the brainstorming phase.) With additional prompting, we could probably get yet more creative answers.

Unfortunately, most people use it in the recipe finding sense. They want the fast, satisficing answer because this is how we have used search engineers for decades, to get a one and done answer. To truly unlock the creativity and innovation in an LLM we have to recognize that by default it will start with the more pedestrian answer, the “ever after,” because that’s what it was designed to do. Those are the answers that have the highest weighting. To get true creativity, we need to explicitly push it away from the central bias, away from the highest weights, to get further into the creative corners.

‍

It’s no surprise that it was “ever after.” You were going to answer “ever after” yourself because we’ve all seen it a hundred times. ChatGPT is very biased that way (just like you).

‍

I recently spoke with a high school teacher who had an LLM generate an essay based on a prompt. He reviewed it with the class and noted it was a C+ essay. It was correct, both grammatically and in content, but it was pedestrian. Again, this is because LLMs are designed to be pedestrian. (Perhaps in the future there will be dials to ask it for the less common possibilities when responding to a prompt.)

This was born out in research by Stanford University's Jeremy Utley and GEOLab’s Kian Gohar. They ran a creativity exercise and assumed the teams with access to ChatGTP would outperform the teams that didn’t have it. While the AI-enabled teams produced 8% more ideas (that’s not much), the quality of ideas was mediocre. They encourage more of a conversation with the LLM. (For those more interested in the takeaways than the research they have been going on a number of podcasts recently to talk about it, such as How to Chat with Bots: The Secrets to Getting the Information You Need from AI on Stanford’s Think Fast, Talk Smart: The Podcast.)

This is good news for teachers who worry that students may use LLMs for an instant essay. That first response essay will likely be C-level in quality, as evidenced prior. But through interaction the student can make it better. That interaction, exploring the subtle plot points, covering symbolism, far more than the typing of words, is where the learning happens. (I’m still not advocating that students outsource their homework to AI, but when they do and when they use it well, there will still be learning.)

At the end of the day, however magical it may feel (c.f. Clarke’s third law), AI is a tool. Like any tool, it has a range of efficient operations. To get the most out of the tool, you need to understand that range and its limitations. You can use a hammer to do more than just work with nails, but you need to understand its design, as well as strengths and weaknesses, to use it in different ways. Likewise, to get the most out of LLMs, particularly when we want the more reactive answers, we need to actively push the tool away from its mundane default response. For regular readers, I stand by my claim that Prompt Engineering Jobs are a Mirage and companies won’t hire them, but that doesn’t mean we won’t all need to use a little bit of prompt engineering in our daily work with LLMs.

Mark A. Herschberg