AegisPlane
Back to blog
Operate6 min readApril 5, 2025

The Most Expensive AI Mistake: Validating Costs After Calling the Model

By the time you check whether you've exceeded your budget, you've already spent the money. Here's how to enforce AI cost limits before the request leaves your system.

There's a pattern that shows up repeatedly in teams that have been running AI in production for a few months: they set up dashboards to track LLM spending, build alerts for when costs spike, and review invoices carefully after the fact.

All of that is useful. None of it prevents the spike from happening.

Validating costs after calling the model is the most expensive mistake in AI operations. By the time your alert fires, you've already spent the money.

In 30 seconds

Cost monitoring is useful, but it does not prevent overspend. The only reliable way to control budget is deciding before execution whether a request can pass.

What you'll get from this article:

  • A practical pre-execution cost control model
  • A simple formula to estimate cost per request
  • A monthly example to spot drift before the invoice

Simple cost-per-request formula

Use this as your operational baseline:

request_cost = (input_tokens / 1,000 * input_price) + (output_tokens / 1,000 * output_price)

Then apply a small safety buffer (for example, +10% to +15%) to cover output variability.

Why post-execution cost checks fail

Most teams reach for post-execution cost tracking because it's easy. Your LLM provider gives you a usage endpoint. You can log token counts. You can build a dashboard in an afternoon.

But post-execution checking has a fundamental problem: the call already happened.

Consider these scenarios:

Scenario 1, The runaway loop. A background job processes documents through an LLM. A bug causes it to process the same document 10,000 times. Your alert fires after 2 hours and $4,000 in spend. You kill the job. The damage is done.

Scenario 2, The adversarial user. A user discovers that your AI assistant generates long responses when asked to "explain in detail." They send 500 requests in an hour. Your rate limit check runs server-side, after the LLM call. Every request is billed.

Scenario 3, The month-end surprise. Individual requests look fine. But a tenant is consistently at 90% of their budget by day 15. Nobody notices until the invoice arrives. There's no mechanism to throttle or warn mid-cycle.

In each case, the check came too late.

The right model: enforce limits pre-execution

The fix is architectural. Cost and rate limit enforcement needs to happen in the request pipeline, before the call goes out to any provider.

This means:

1. Request-level budget checks Before forwarding the request, calculate the estimated cost (based on input token count and model pricing) and compare it against the tenant's remaining budget. If the call would exceed the limit, block it and return a structured error.

2. RPM enforcement at the gateway Track requests-per-minute per tenant and per model. When a tenant hits their RPM limit, reject the request immediately, before it reaches the provider. No call means no cost.

3. Provider-level and model-level caps Some providers are cheaper than others. Some models cost 10x more per token. Caps should be configurable at both levels so you can enforce different limits for different parts of your stack.

4. Hard limits vs. overage policies Hard limits block the request entirely. Overage policies allow the request to proceed but bill the excess, useful for enterprise customers who need guarantees of continuity. Both need to be enforced pre-execution, just with different outcomes.

Quick monthly example

Assume this tenant:

  • 1,200,000 requests/month
  • estimated average cost of $0.0042 per request
  • monthly budget of $4,500

Estimate:

1,200,000 x $0.0042 = $5,040

You know in week 1 that the budget won't hold. You can act early:

  • downgrade model on low-complexity routes
  • reduce max output length
  • apply a per-use-case cap

If you don't run this calculation at the start of the cycle, adjustments always arrive late.

What the pipeline looks like

A pre-execution cost control pipeline looks roughly like this:

Request arrives
  → Authenticate tenant
  → Check RPM limit (reject if exceeded)
  → Estimate token cost
  → Check budget remaining (reject if exceeded, or flag for overage)
  → Route to provider
  → Execute
  → Record actual cost

The key insight is that the budget check sits before the execution step. It's a gate, not a monitor.

The operational benefits beyond cost control

When you move to pre-execution enforcement, you get three things that post-execution monitoring can't give you:

Predictability. Tenants can't exceed their limits. Budgets become hard constraints, not soft targets.

Incident prevention. Runaway loops are stopped after the first batch that exceeds the limit, not after hours of unchecked spend.

Trust. Enterprise customers can commit to a cost cap and know it will be enforced. That's a commercial differentiator.


The Checklist

Before your next billing cycle:

  • Budget limits are enforced at the gateway level, not in application code
  • RPM limits are checked before the LLM call, not after
  • You have separate limits for each provider and each model
  • You have a defined overage policy (hard block or bill excess) per tenant
  • Blocked requests generate an audit record with reason and timestamp
  • You can see real-time budget consumption per tenant, not just end-of-month totals

What to do this week

  1. Implement request-level cost estimation in the gateway.
  2. Define a monthly cap per tenant and a cap per model.
  3. Enable pre-execution blocking in log-only mode for 48 hours, then switch to strict mode.

The invoice surprise is optional. Pre-execution enforcement makes it avoidable.

AegisPlane

Ready to apply this to your pipeline?

AegisPlane puts all these controls into production without changing your code.