The Cost of Running an AI Product: Token Economics Explained
How AI products actually cost money: tokens, context windows, model tiers, and the levers that control your inference bill. Plus how to model unit economics before you scale.
Traditional software has near-zero marginal cost per user. AI products don't - every interaction costs real money in inference. If you don't understand your token economics, you can build a product people love that loses money on every user.
This article explains how AI products actually cost money, the levers that control the bill, and how to model your unit economics before you scale into a surprise.
How models charge
AI models charge per token. A token is roughly 3-4 characters of text - "hello" is one token, a paragraph is maybe 50-100. Both the text you send (input) and the text the model generates (output) count, usually at different rates (output is typically more expensive than input).
So the cost of one AI interaction is roughly:
(input tokens × input price) + (output tokens × output price)
The prices vary enormously by model - frontier models cost many times more than small fast models. This is the single biggest lever on your bill.
The four levers on your inference bill
1. Model choice
The same task on a frontier model versus a small model can differ in cost by 10-50x. Using a frontier model for simple classification is like hiring a senior consultant to alphabetise files. Match the model to the task: cheap models for simple work, expensive models only where they earn it. This is usually the largest saving available.
2. Context length
Every token you put in the prompt costs money on every call. Products that stuff huge context (entire documents, long histories) into every prompt pay for it repeatedly. Send only the relevant context - good retrieval that finds the right few chunks beats dumping everything in.
This is why naive "just put the whole knowledge base in the prompt" approaches get expensive fast, and why RAG (retrieve only what's relevant) is both better quality and cheaper.
3. Call frequency
How many model calls per user interaction? A naive agent that makes 15 model calls to answer one question costs 15x a single-call approach. Architect for the minimum number of calls that does the job well.
4. Caching
If the same input produces the same output, cache it. Many products have significant overlap in requests. Prompt caching (supported by major providers for repeated context) and result caching both cut the bill meaningfully.
Modelling your unit economics
Before scaling, model the cost per user. The calculation:
- Cost per core interaction. Estimate input + output tokens for one valuable interaction, multiply by the model's prices. (Most providers have a tokenizer and pricing page; a rough estimate is fine.)
- Interactions per active user per month. How often does an engaged user trigger AI?
- Monthly inference cost per active user = (1) × (2).
- Compare to what you charge (or to the value if it's internal).
A worked example: if a core interaction uses 2,000 input + 500 output tokens on a mid-tier model, that might be ~$0.01-0.03 per interaction. At 200 interactions/month per active user, that's $2-6/user/month in inference. If you charge $30/month, you have healthy gross margin. If you charge $10 and heavy users do 1,000 interactions, you might be underwater on your power users.
The numbers move fast as model prices drop, but the discipline is the same: know the rough per-user cost before you scale.
The "power user" trap
Average cost can hide a dangerous distribution. If 5% of your users generate 50% of the AI calls, your blended economics look fine while your power users lose you money on every interaction.
Guard against this:
- Usage-based pricing or limits so heavy users pay proportionally (see AI product pricing)
- Per-user cost monitoring so you can see the distribution, not just the average
- Rate limits that cap runaway usage (shipping AI safely)
How the economics improve over time
Two trends work in your favour:
- Model prices keep falling. The cost of a given capability drops over time as the field improves. A model-agnostic architecture lets you ride this - switch to the cheaper model when it's good enough.
- You optimise. Early products are unoptimised. Right-sizing models, improving retrieval, adding caching - these often cut the bill substantially without touching quality.
So early high costs aren't necessarily fatal if the trajectory is favourable. But you need to know the numbers to know whether you're on a good trajectory or a bad one.
When token cost doesn't matter
For some products, inference cost is rounding error - low-frequency, high-value interactions where the user happily pays far more than the cost. A product where each user makes a few high-value AI calls per month and pays $100/month doesn't need to obsess over tokens.
Know which kind of product you are. High-frequency or long-context products must manage token economics carefully. Low-frequency high-value products can largely ignore it. Most fall in between and benefit from basic discipline: right model, right context, basic caching.
What to do next
If you're building an AI product and want to make sure the unit economics work before you scale, book a 30-minute discovery call. We model this as part of architecture.
Read next: Multi-model architecture and AI product pricing: charging when your costs are variable.
Got a Bubble or Canvas app you’d like a second pair of eyes on?
30-minute discovery call. We’ll look at your app live and tell you honestly what we’d do next.
Or grab the Bubble migration playbook PDF.