The AI Token Myth

AI token counts are easy to measure, but they reveal little about productivity. As token costs fall, the real question is what work becomes possible.

May 19, 2026

/

6

min read

‍

‍

Companies are treating tokens as both a productivity metric and a scarce resource. Both instincts are likely wrong. Token usage tells us little about whether AI is being used well, and for most routine work, token cost is likely to become a background utility cost rather than a strategic budget constraint.

As a reminder, tokens are the units used to measure input to and output from LLMs. A word in English (or another language) is converted into one or more tokens. A typical English word is roughly one or two tokens. The more words in your input or output to AI, the more tokens you use.

LLM providers like Anthropic charge customers for token usage (typically either per token or using metering limits of tokens per period of time). LLMs are also trained on tokens. Larger models and larger training datasets require more accelerator compute and energy, which is why training cost is tied to scale.

‍

Tokens Aren’t Productivity

Today, many companies mistake token usage for productivity. If they see the team using more tokens, then they think more AI is being used and productivity is going up. This is an example of the streetlamp effect where token usage is used as a metric because it’s easy to measure. (I discussed how the streetlamp effect causes an incorrect focus in hiring in “The Streetlamp Effect in Hiring;” it’s a similar issue here.)

Just because people are using AI / tokens, doesn’t mean they’re using it effectively. Some are focusing AI on the wrong problems. Others are getting “wrong” answers. I put “wrong” in quotes because the answer may be technically right, but the question was badly phrased so they’re not asking the right question (which is subtly different from the right problem).

It’s similar to SLOC, which stands for source lines of code. Some companies would measure how many new lines of code a software developer produces per day. Some days it’s zero, because the developer is in design meetings. Other days it’s only a few, she’s debugging and finding what needs to change. Still other days they write lots of code. It’s very easy to measure the metric of SLOC, but it's a dubious metric because different functions lead to different levels. Debugging should not be adding lots of SLOC, but instead changing existing lines of code, ideally minimally. Spending time designing should be encouraged, but it produces no direct SLOC. It’s not just the number, but the context in which to evaluate the number. Similarly you don’t measure lawyers by the number of lawsuits filed or marketers by the number of ads served. They’re easy to count, but not necessarily the right metric.

I know of people using AI tools to calculate averages of data sets. Excel can do that already. Simply having AI do it is using a tool for the sake of using the tool. The question is, are we using it for the right things and spending tokens where they will do the most good? For most companies, the answer is: not yet.

‍

Token Usage Efficiency

Even if they are asking the right questions and getting the right answers, they may not be doing so efficiently. For example, unlike humans who remember what you said five minutes ago, most LLMs today do not. Suppose you’re ten prompts into a chat, you have an answer but just need to refine it a bit, so you ask a clarifying question or ask it to tweak the output a little. In many chat systems, the prior chat history is resubmitted or otherwise reprocessed each time. To keep the math easy, suppose each prompt in the chat is 100 tokens in size, and we’ll just consider input tokens. Your first prompt was 100 input tokens. The second prompt in the chat was an additional 100 input tokens, but since it includes the first prompt, it has a total cost of 200 input tokens. When you ask your clarifying question after ten prior prompts, it’s 100 tokens for the clarifying question on top of the 1,000 combined tokens for the prior ten prompts. You just sent 1,100 tokens. Of course the whole session combined was 100 + 200 + 300 + … 1,100 which comes out to 6,600. That’s not a huge number, but long-running chat sessions that burn more prompts but aren’t necessarily better. In fact, they can be a huge waste. My environmentally-minded friends remind me to start new chats regularly.

Frontier models are also adding caching, memory, and other optimization strategies. These strategies will reduce some of today’s token inefficiency and lower the overall cost to companies. That brings us to the second issue: token cost.

‍

Token Cost

Andreessen Horowitz coined the term LLMflation in 2024 to describe the dropping cost of tokens. They modeled that, for equivalent model performance, inference costs were falling roughly 10x per year. Think of it like Moore’s Law which observed that CPU capability doubled every eighteen months (a $100 CPU in eighteen months would be twice as powerful as one today at the same cost). In this case, $100 spent next year may buy roughly what $1,000 buys today. There’s no reason to think this won’t continue for some time.

The counterargument has been a Jevons paradox. As tokens get cheaper, AI systems become more capable and take on bigger tasks, costing more tokens. This is also true. A chat or API call to an LLM may run longer and involve more complex work, requiring more tokens (especially if it’s creating work output on your machine as Claude Code and Codex do).

METR.org (Model Evaluation and Threat Response) actively tracks model capability by measuring the length of tasks AI agents can successfully complete. Their research suggests that the task-completion time horizon for frontier AI agents has been doubling approximately every seven months. Today some AI tasks can run for hours without human input, and the task length window continues to grow.

‍

Marginal Benefits

The question is, which hits the limit first? My thinking is efficiency improvements will outpace increased demand. Think about what most people do. They create and execute marketing plans, source and sell to customers, write code to move data around, create legal briefs based on prior case work, answer customers’ questions, design new product specs, etc. That is not computationally very expensive. We can get better marketing plans, but there’s a limit. We can better source, research, and reach out to customers, but there’s a limit. We can write more code, and it may be a very high limit, but there’s a limit.

Marketing budgets don’t magically get bigger because we can create them faster. Likewise, for legal activities. There’s a limit to how much accounting and customer support we need. Better LLMs may be able to do those better or faster, but once we hit the limit, LLM improvements just further lower cost. (I talk about these cost-center jobs in “How to Know If Your Job Is Safe from AI — Part 2: The Economic Drivers of Your Job”). There may be a while to go in terms of how much more code, in terms of volume and complexity we need, but there is some point after which improvements don’t matter much.

As an analogy, consider computer chess. In the 1970s and 1980s, computers slowly improved against human players. Once a machine could beat the best human (1997), how much did improvements matter? For most humans whether the top computer had a rating 600 points higher or 1,200 points higher was moot; the computer would always win. It only mattered to the top chess players and others who had computer chess programs vying to be the best. Teams still compete to make better and better chess playing software, but for the average human who wants to play against a computer it doesn’t matter. Most of us aren’t working on cutting-edge complex tasks.

Likewise, in the coming years, for most tasks, we won’t need the cutting-edge LLMs. There will always be some task, like rendering a high-end video or deep research, which will require the latest frontier models. However, for someone building a monthly report at work we’ll reach a point (whether it’s today, in three months, or in three years) where it’s good enough, and stronger LLMs aren’t needed for the work, they’ll just lower the cost of tokens in the good-enough models.

‍

Open-Weight and Open-Source Options

Finally, open-weight and open-source models change the economics. Historically, many have lagged the leading proprietary models by roughly six to twelve months. Obviously this can change, but it could change either way. Maybe the frontier models pull away, although it’s unlikely they’d keep that edge for long since the general approach would get out (even if not the details) and others would soon follow. More likely the gap will close. Even if it doesn’t, or if it opens up a bit, powerful low-cost models will be widely available. They still have some operating costs (CPU or GPU, as well as energy, hosting, bandwidth and other server-related costs), but at sufficient scale these can be cheaper than using proprietary frontier models.

‍

The point of all this is that marginal token cost for most routine knowledge work will be near zero. Maybe not today, maybe not tomorrow, but soon, and for the rest of your life, it will be dirt cheap. That means your job won’t involve budgeting for it the way companies budget dollars.

It’s a bit like internet bandwidth in the 1990s. Back then bandwidth was costly. Capacity quickly grew and today, most people have more bandwidth than they need. If you’re a streaming service or doing large data transfers, bandwidth usage and costs matter, but for the average person, they have more than enough. Additional capacity tends to lower cost.

Planning corporate token budgets as though tokens will remain a scarce strategic resource is probably misguided. The constraint may matter for a year or two, and it will continue to matter at the frontier, but for most routine knowledge work, token access will eventually feel more like bandwidth: abundant, expected, and mostly invisible. The important question is not how many tokens employees use, but what kind of work becomes possible when token access is effectively unlimited.

‍

By

Mark A. Herschberg