Chapter 01

Why AI bills surprise everyone

There is a pattern that plays out across engineering teams with remarkable consistency. A team builds a prototype, runs it through some tests, calculates a rough cost per query from the API pricing page, and arrives at a number that feels manageable. Then they go to production. Three months later, someone looks at the cloud bill and asks what happened.

What happened is that the simple cost-per-query calculation missed almost everything that matters. It accounted for the tokens in a typical user query and a typical model response. It did not account for the system prompt that gets prepended to every single request. It did not account for the context that accumulates over multi-turn conversations. It did not account for retry logic that fires whenever the API returns an error. It did not account for internal logging requests, evaluation runs, developer experiments in the shared environment, or the monitoring queries used to check that everything is working. All of these cost tokens. All of these add up.

The other thing that surprises teams is the non-linearity of growth. If your user base doubles, your AI costs do not simply double. Heavier users tend to have longer conversations, which means more context tokens per request. Successful products grow into use cases that were not in the original design, which tend to require larger models or longer prompts. The costs grow faster than the users, which is the opposite of the economies of scale that most software infrastructure provides.

Understanding AI costs requires a fundamentally different mental model than understanding compute or storage costs. The unit of cost is a token, not a CPU-hour or a gigabyte, and tokens accumulate in ways that are not intuitive until you have seen it happen.

1 / 7

The Hidden Cost of AI: Token Budgets, Latency, and Where Money Really Goes

Why AI bills surprise everyone