How a 12-Engineer Team Cut AI Spend by 43% in One Week
Published: March 2026 · Reading time: 8 minutes
The $14,000 wake-up call
A backend engineering team at a mid-stage SaaS company was building AI-powered features across their product — a document summarizer, a customer support chatbot, a code review assistant, and an internal knowledge search. Four projects, twelve engineers, all defaulting to premium models: Claude Opus 4.6 for code review, Claude Sonnet 4.6 for the chatbot, and GPT-4.1 everywhere else.
Their February invoice across OpenAI and Anthropic landed at $14,200. The previous month was $8,100. Nobody could explain the 75% increase. There was no breakdown by project, no way to tell which feature was responsible, and no visibility into whether the token usage was efficient or wasteful.
The CTO asked two questions:
- "Which project is burning the most money?"
- "Are we using the right models for each use case?"
Nobody had answers. They had logging — every request went to Datadog — but Datadog showed latency and error rates, not cost per call, not cost per project, not whether Claude Opus 4.6 at $5/1M input tokens was overkill for half their queries.
They needed cost governance, not more observability.
Day 1: Install and instrument (45 minutes)
The platform lead installed AgentCost on a Monday morning:
Then added one line to each project's client initialization:
from agentcost.sdk import trace
from openai import OpenAI
from anthropic import Anthropic
# OpenAI projects — one line change
client = trace(OpenAI(), project="knowledge-search")
# Anthropic projects — same pattern
client = trace(Anthropic(), project="code-reviewer")
Four projects, four trace() wrappers. Each engineer changed one line in their codebase. Total instrumentation time: 45 minutes including the PR reviews.
By Monday afternoon, the dashboard was populating:
- doc-summarizer: 2,400 calls/day on GPT-5.2, $38/day
- support-chatbot: 8,100 calls/day on Claude Sonnet 4.6, $112/day
- code-reviewer: 890 calls/day on Claude Opus 4.6, $67/day
- knowledge-search: 12,300 calls/day on GPT-4.1, $41/day
The support chatbot was making 8,100 calls per day — more than anyone expected. But the code reviewer, with only 890 calls, was costing $67/day. Something was wrong.
Day 2: The cost breakdown reveals the problem
The AgentCost dashboard showed the cost-by-model breakdown:
| Project | Model | Daily Calls | Daily Cost | Avg Tokens/Call |
|---|---|---|---|---|
| code-reviewer | claude-opus-4-6 | 890 | $67.40 | 4,200 |
| support-chatbot | claude-sonnet-4-6 | 8,100 | $112.00 | 580 |
| knowledge-search | gpt-4.1 | 12,300 | $41.00 | 320 |
| doc-summarizer | gpt-5.2 | 2,400 | $38.20 | 1,100 |
Two things jumped out immediately:
The code reviewer was sending 4,200 tokens per call to Claude Opus 4.6. At $5/1M input + $25/1M output, every wasted token costs 10x what it would on a cheaper model. The team investigated and found that the system prompt included the entire project's coding standards document (3,100 tokens) in every single request. It was copy-pasted during a prototype and never optimized.
Knowledge search was making 12,300 calls per day on GPT-4.1. At $2/1M input tokens, these were simple factual lookups — "What is our refund policy?", "How do I reset my password?", "What are the office hours?" — queries that GPT-4.1-nano at $0.10/1M input could handle perfectly.
Day 3: The complexity router saves 40%
The team enabled AgentCost's complexity router on the knowledge search project. The router classifies each incoming prompt as SIMPLE, MEDIUM, COMPLEX, or REASONING, and routes to the cheapest model that can handle it:
- SIMPLE (factual lookups, classifications) → gpt-4.1-nano ($0.10/1M input)
- MEDIUM (summarization, standard generation) → gpt-4.1-mini ($0.40/1M input)
- COMPLEX (analysis, long-form) → gpt-4.1 ($2.00/1M input)
- REASONING (math, logic, multi-step) → o4-mini ($1.10/1M input)
The result on knowledge search after one day:
| Complexity | % of Queries | Model | Daily Cost |
|---|---|---|---|
| SIMPLE | 73% | gpt-4.1-nano | $0.90 |
| MEDIUM | 21% | gpt-4.1-mini | $3.50 |
| COMPLEX | 6% | gpt-4.1 | $2.40 |
Knowledge search dropped from $41/day to $6.80/day — an 83% reduction. The simple queries (73% of traffic) were handled by GPT-4.1-nano at 1/20th the cost, with no user-visible quality difference. The team ran a blind evaluation on 200 queries and found GPT-4.1-nano answered simple factual questions with 97% accuracy versus GPT-4.1's 99%.
Day 3 (continued): Fixing the token explosion
The token analyzer flagged the code reviewer's system prompt as a problem:
Token Analyzer Alert
code-reviewer: Context efficiency score 23/100. System prompt consumes 74% of input tokens. Recommendation: Extract static content to a retrieval step or reduce system prompt length.
The team refactored the code reviewer to load coding standards via RAG instead of stuffing them into every prompt. System prompt dropped from 3,100 tokens to 280 tokens. Claude Opus 4.6 stayed as the model — code review quality matters — but the per-call cost dropped dramatically.
Result: code reviewer costs dropped from $67.40/day to $19.80/day.
Day 4: Budget enforcement prevents a repeat
With costs now visible and optimized, the team set up budget enforcement to prevent future surprises:
Project: support-chatbot
Daily budget: $150
Monthly budget: $3,500
Warning threshold: 80%
Auto-downgrade at: 90%
Block at: 100%
They configured the budget gate with auto-downgrade chains:
- At 90% of daily budget → automatically route from Claude Sonnet 4.6 to Claude Haiku 4.5 ($1/1M input — 3x cheaper)
- At 100% → block new requests, notify #ai-costs Slack channel
They also added a policy rule for the code reviewer:
{
"name": "code-reviewer-token-cap",
"conditions": {
"project": "code-reviewer",
"input_tokens": { "gt": 2000 }
},
"action": "warn",
"message": "Code reviewer input exceeds 2000 tokens — check system prompt"
}
This policy fires whenever a code reviewer call sends more than 2,000 input tokens — catching any future regression where someone accidentally bloats the prompt again.
Day 5: Forecasting catches the next spike before it happens
On Friday, the AgentCost forecast showed something concerning:
Budget Exhaustion Alert
support-chatbot: At current trajectory, monthly budget of $3,500 will be exhausted by March 22 (day 22 of 31). Ensemble forecast: $4,180 for the full month.
The team investigated and found that a new feature launch on Wednesday had increased chatbot usage by 35%. Without forecasting, they would have discovered this on the invoice at month-end. With AgentCost, they caught it on day 5 and proactively:
- Enabled the complexity router on the support chatbot (routing simple FAQs to Claude Haiku 4.5 instead of Sonnet 4.6)
- Adjusted the monthly budget to $4,000 to account for the growth
- Set up a reaction rule to notify the team lead if the forecast-to-budget ratio exceeds 110%
The results after one week
| Project | Model(s) | Before (Daily) | After (Daily) | Reduction |
|---|---|---|---|---|
| knowledge-search | GPT-4.1 → nano/mini/4.1 mix | $41.00 | $6.80 | 83% |
| code-reviewer | Claude Opus 4.6 (optimized prompts) | $67.40 | $19.80 | 71% |
| support-chatbot | Sonnet 4.6 → Haiku 4.5/Sonnet mix | $112.00 | $78.40 | 30% |
| doc-summarizer | GPT-5.2 (unchanged) | $38.20 | $34.50 | 10% |
| Total | $258.60 | $139.50 | 46% |
Weekly spend dropped from $1,810 to $977 — a 46% reduction. Projected monthly savings: $3,330/month or $40,000/year.
The breakdown of where the savings came from:
| Optimization | Monthly Savings | How |
|---|---|---|
| Complexity routing on knowledge-search | $1,030 | 73% of queries routed to GPT-4.1-nano |
| System prompt fix on code-reviewer | $1,430 | RAG replaced 3,100-token system prompt in Claude Opus 4.6 calls |
| Complexity routing on support-chatbot | $1,010 | Simple FAQs routed from Sonnet 4.6 to Haiku 4.5 |
| Total | $3,470 |
What they run today
Six weeks later, the team's AgentCost setup includes:
- 4 projects tracked with per-project budgets and forecasting
- Complexity router on 3 of 4 projects (doc-summarizer stays on GPT-5.2 due to quality requirements)
- Budget gate with auto-downgrade chains: Opus 4.6 → Sonnet 4.6 → Haiku 4.5, and GPT-4.1 → GPT-4.1-mini → GPT-4.1-nano
- 5 policy rules preventing token explosions, blocking deprecated models, and requiring approval for o3 usage (at $2/1M input, it needs justification)
- Slack notifications via the reactions engine for budget warnings, cost spikes, and weekly scorecard digests
- Agent scorecards grading each project monthly (knowledge-search earned an A, code-reviewer improved from D to B)
- Goal-aware attribution tagging costs to quarterly OKRs so the CTO can answer "how much did the Q1 AI features cost?"
The platform lead summarized it:
"We went from 'we have no idea what AI costs' to 'we know exactly what each feature costs, we forecast what it will cost next month, and the system automatically prevents overruns.' The complexity router alone paid for the setup time in the first hour."
Try it yourself
AgentCost is open-source (MIT) and self-hosted. One command to start:
Add cost tracking to your code with one line:
Live Demo — see the dashboard with sample data, no install required.
GitHub · Docs · Compare vs Langfuse/Helicone/Portkey
This case study is based on composite scenarios from real-world AI cost patterns observed across engineering teams. Pricing reflects actual model costs as of March 2026: Claude Opus 4.6 ($5/$25 per 1M tokens), Claude Sonnet 4.6 ($3/$15), Claude Haiku 4.5 ($1/$5), GPT-5.2 ($1.75/$14), GPT-4.1 ($2/$8), GPT-4.1-mini ($0.40/$1.60), GPT-4.1-nano ($0.10/$0.40), o3 ($2/$8), o4-mini ($1.10/$4.40).