Skip to content

Blog

AI Agent Cost Explosions: What Happens in 6 Unsupervised Hours

When an AI agent operates without budget enforcement, structural drift transforms a simple API call into a chain of cascading requests. One user request that should trigger 5 API calls can easily become 47 in a retry loop. In a 6-hour unsupervised window with 100 concurrent users, this drift compounds into a $677 cost event, all undetected until the invoice arrives. This is not an observability problem. Langfuse, Helicone, and Portkey will show you the cost after the fact. AgentCost prevents the cost from occurring in the first place through budget enforcement.

How Do AI Agents Enter Cost Spirals Without Warning?

The fundamental problem is structural. One API call no longer equals one task.

In traditional applications, you make one request, get one response, show it to the user. Done.

Agentic AI breaks this assumption completely.

When you ask an agent to "research competitive pricing for our SaaS product," here's what actually happens:

  1. Agent makes initial search API call
  2. Realizes it needs more context, makes 3 follow-up calls
  3. Hits a rate limit, retries with exponential backoff
  4. Gets partial results, decides to validate with different sources
  5. Makes 15 more API calls across different search endpoints
  6. Synthesizes results, makes 2 more calls to fact-check claims
  7. Formats response, makes 1 final call for grammar checking

What you expected: 1 API call, $0.024 in costs. What you got: 23 API calls, $0.552 in costs.

Now multiply that by a weekend. Multiply by retry loops. Multiply by 100 concurrent users who all triggered the same research flow.

This is what Friday 3pm to Monday 9am looks like in reality.

What's the Real Cost of a 6-Hour Runaway Agent?

Let's calculate the actual numbers from a real incident.

The scenario:

  • Agent designed to handle customer support inquiries
  • Expected pattern: 1 user request = 2 API calls (understanding + response)
  • Expected cost per request: ~$0.048

What happened over the weekend:

  • API provider experienced intermittent rate limits
  • Agent entered retry loop with exponential backoff
  • Each user request triggered 47 API calls instead of 2
  • 100 concurrent users, averaging 1 request per hour each

Cost calculation for structural drift:

expected_calls_per_request = 2
actual_calls_per_request = 47
concurrent_users = 100
requests_per_hour = 1
hours_unsupervised = 6

total_api_calls = actual_calls_per_request * concurrent_users * requests_per_hour * hours_unsupervised
# 47 * 100 * 1 * 6 = 28,200 calls

cost_per_call = 0.024  # $0.015/1K input tokens + $0.06/1K output tokens
total_6hour_cost = total_api_calls * cost_per_call
# 28,200 * $0.024 = $677.28

daily_extrapolation = total_6hour_cost * 4
monthly_if_undetected = total_6hour_cost * 120
# Daily: $2,709  | Monthly: $81,273

The team discovered this Monday morning. Not from an alert. From the invoice.

Why Does Observability Fail to Prevent Agent Cost Explosions?

Every major platform in the market positions cost tracking as an observability feature:

Langfuse gives you detailed traces and cost breakdowns after the spending happens. Their dashboard shows exactly how much each agent spent, when it spent it, and what prompts triggered highest costs. Excellent for post-mortem analysis. Useless for prevention.

Helicone provides cost analytics and caching to reduce future spend. You'll get beautiful charts showing cost trends and optimization opportunities. But if your agent enters a retry loop at 6pm on Friday, those charts won't stop it from running.

Portkey added per-key budgets to their gateway architecture. Better than pure observability, but budgets at the API key level miss the core problem: you need budget attribution per agent, not per authentication method.

The pattern is consistent across all three. These tools answer "what happened" and "how much did it cost." They don't answer "how do I stop it from happening."

The difference between watching a disaster and preventing a disaster is architecture.

How Does Governance-First Cost Control Work?

AgentCost's CEL policy engine treats cost control as a governance problem, not a monitoring problem.

Here's how the same scenario looks with budget enforcement in place:

from agentcost.sdk import trace, budget

@budget(agent="customer-support", hourly_limit=50.0)
@trace(agent="customer-support")
def handle_support_request(user_message):
    # Your existing agent logic
    response = llm.generate(prompt=user_message)
    return response

# What happens in the 6-hour window:
# Hour 1: 47 calls = $11.28 spend (22% of budget)
# Hour 2: +47 calls = $22.56 total (45% of budget)
# Hour 3: +47 calls = $33.84 total (67% of budget - soft alert)
# Hour 4: +47 calls = $45.12 total (90% of budget - page on-call)
# Hour 5: +47 calls = $50.00 HARD LIMIT REACHED
# Agent stops making API calls
# Total cost: $50.00 (vs $677.28 without governance)

The agent is constrained, not monitored. It cannot exceed the budget. Period.

What Is Structural Drift in AI Agents?

Traditional LLM applications have predictable cost patterns. One user action triggers one API call.

Agentic systems have two layers of cost drift that make traditional forecasting impossible:

Layer 1: Agentic Execution Drift

  • User requests map to variable API call chains
  • Retry logic compounds costs in failure scenarios
  • Multi-step reasoning creates cascading dependencies
  • Tool use triggers unpredictable downstream calls

A customer support agent answering "What are my billing options?" might trigger:

  • 1 initial intent classification call
  • 2 knowledge base retrieval calls
  • 3 validation calls to external billing system
  • 2 retries on failed connections
  • 1 synthesis call to format response

Total: 9 API calls instead of the 1 you budgeted for.

Layer 2: Metrological Drift

  • Token count doesn't correlate with business value
  • Similar prompts can have vastly different outcomes
  • Context switching creates hidden prompt inflation
  • Edge cases require 10x token spend for same task

An agent debugging a production incident is worth more than an agent answering a FAQ. But both might use identical token counts. Your cost model can't distinguish between them.

This is why traditional cost forecasting fails for agent cost governance. You cannot predict costs based on historical patterns when the execution model is fundamentally variable.

How Real-Time Anomaly Detection Prevents Runaway Costs

Heartbeat-based anomaly detection catches runaway agents before they exhaust budgets:

# AgentCost monitors cost velocity in real-time
anomaly_detection_rules = {
    "cost_spike_detection": "hourly_spend > baseline * 3",
    "call_pattern_anomaly": "calls_per_minute > 50",
    "retry_loop_detection": "failed_calls / total_calls > 0.4"
}

# When the customer support agent entered its retry loop:
# - 400% increase in hourly spend rate detected
# - 94 calls per minute (baseline: 12)
# - 67% failure rate on API calls
# Alert fired: 12 minutes into anomaly
# On-call engineer: paged immediately
# Problem contained: before $600+ incident

The difference in outcome:

  • Without anomaly detection: $677 weekend bill, discovered Monday
  • With anomaly detection: $50 hard limit, 12-minute response

How Does Per-Agent Cost Attribution Differ from Competitors?

Unlike competitors that track costs by API key or trace, AgentCost provides per-agent cost attribution across 2,610+ models from 40+ providers.

When your CFO asks "which agent is driving our AI costs," you get specificity: Agent: customer-support Period: Last 7 days Total spend: $1,247.82 API calls: 4,821 Average cost per call: $0.26 Top cost driver: retry loops (47% of spend) Recommended action: implement circuit breaker pattern

This level of attribution is impossible with:

  • Gateway-based solutions (only see API keys, not agent context)
  • Observability platforms (track traces without agent ownership)
  • SDK-based proxies (no agent-aware policy enforcement)

AgentCost knows which agent made each call, which means it can enforce budgets and attribute costs at the agent level, not the key level.

What Does This Mean for Engineering Architecture?

The shift from observability to governance changes how you architect AI systems:

1. Budget as code in your deployment configuration. Every agent gets explicit cost limits. No invoice surprises.

2. Circuit breakers for cost anomalies. Automatic failsafes when agents hit retry loops or cost spikes.

3. Real-time cost alerts, not monthly invoice shocks. Cost anomalies trigger pages. Not spreadsheet reviews.

4. Agent-level cost accountability. Clear attribution for each autonomous system's spending.

5. Semantic caching to reduce redundant calls. When agents make similar requests repeatedly, semantic caching reduces duplicate API calls and costs.

How to Get Started with Cost Governance

The difference between watching disasters and preventing them is governance architecture.

Step 1: Understand your current agent cost patterns. Try AgentCost's live chaos simulator to see how your agents would behave under cost stress. The simulator includes 28 events and 9 presets, all running client-side in your browser.

Step 2: Implement per-agent budgets. Add one-line SDK integration to your existing agent code. No infrastructure changes required.

from agentcost import TracedLLMClient

# Wrap your existing LLM client
client = TracedLLMClient(openai.OpenAI())

# Every API call is now tracked, attributed, and budgeted

Step 3: Deploy with governance, not just monitoring. Use CEL policies to enforce hard spend limits. Use anomaly detection to catch cost spikes before they compound.

The era of cost surprises in agentic AI is ending. The era of cost governance is beginning.

Try the demo | Star on GitHub

The Two-Layer Drift Model: Why Your AI Cost Tracking Is Blind

Your AI cost tracking is fundamentally blind to two critical problems. First, structural drift: one user request generates 15-47 API calls through agent retries, fallbacks, and chained operations. Your cost dashboard shows the spike, but you expected one call, not forty-seven. Second, metrological drift: two identical 1,000-token prompts can have wildly different business value (customer support versus code generation). Token counters are semantically blind. Most engineering teams experience 3-5x budget overruns despite comprehensive dashboards because they're using observability tools to solve governance problems. Observability answers "What did we spend?" Governance answers "How much are we allowed to spend?" These require different solutions. Understanding this two-layer drift model explains why budget surprises happen and how to prevent them through real-time cost enforcement.

What Is Structural Drift in Agentic AI Systems?

Structural drift breaks the foundational assumption of traditional API cost modeling: one user request equals one API call. In production agentic systems, this assumption collapses.

Picture a typical customer support agent handling a single support ticket:

  • Initial reasoning call (1 API request)
  • Knowledge base query with retry logic (2 additional calls)
  • Search refinement with expanded terms (3 more calls)
  • Response generation (1 final call)
  • Fallback to alternative model on timeout (2 additional calls)

Total: 9 API calls for one user request. Your budget projected 1.

This multiplier cascades across multi-agent systems. A customer onboarding workflow involving document processing, data validation, notifications, and audit logging can trigger 11-26 API calls per customer. Your budget was calculated assuming 1 call per onboarding event.

The real-world impact: Teams report that typical agentic tasks with retry logic and fallback queries generate 15-47 API calls per user request (versus 1 call in traditional request-response patterns). ReAct-style agents average 3-4 API calls per reasoning step. Multi-agent tool use averages 6.2 tool invocations per task.

This is structural drift: your cost model assumes one thing, your agents do another.

How Metrological Drift Hides in Your Cost Attribution

Structural drift is visible if you examine your logs carefully. Metrological drift is invisible because it operates at the semantic level, not the syntactic level.

Two prompts with identical token counts can have orders of magnitude different business value:

Prompt A (Customer support): "Summarize support ticket..." [1000 tokens]
Cost: $0.002 | Business value: $0.10 (faster support response)

Prompt B (Code generation): "Generate API endpoint..." [1000 tokens]
Cost: $0.002 | Business value: $500 (3 hours engineering time saved)

Token counters are semantically blind. Your cost attribution system treats these identically. Your P&L impact is 5,000x different.

This creates blind spots when you're trying to:

  • Allocate AI costs to business units
  • Calculate ROI per use case
  • Budget for next quarter's agent deployment
  • Justify AI spend to the CFO

Traditional cost tracking tools aggregate by API key, model, or time period. None capture business value or agent intent.

Why Observability Tools Cannot Enforce Governance

Current AI cost tracking platforms (Langfuse, Helicone, Portkey) are observability systems. They excel at showing you what happened. They cannot prevent what's about to happen.

The Observability-Governance Gap

Langfuse: Provides comprehensive tracing and cost breakdown. No budget enforcement. No anomaly detection. No policy engine. When an agent enters a runaway loop at 3 AM Saturday, Langfuse shows a perfect dashboard of exactly how it burned $2,400 over the weekend. It won't stop the agent from running.

Helicone: Offers semantic caching to reduce duplicate calls. Helps with cost optimization after expensive patterns are identified. No per-agent attribution. No real-time budget gates. When your Q4 plan assumes $50,000 in AI costs but you hit $180,000 by October, Helicone identifies which models were most expensive. It doesn't identify which agent or business process drove the overage.

Portkey: Provides an AI gateway with load balancing and fallbacks. This is infrastructure, not governance. Gateway-level budgets are blunt instruments applied per-API-key, not per-agent. In multi-tenant systems where dozens of agents share one API key, a budget overrun shuts down everything.

Core Problem: Observability tells you what happened. Governance prevents it from happening. These are fundamentally different problems requiring fundamentally different solutions.

From Prediction to Bounds: The Control Framework

Cost governance requires a different mental model than cost observability. Instead of predicting what you'll spend (impossible with agentic systems), you set bounds on what you're allowed to spend and enforce them in real-time.

# Traditional observability approach (no enforcement)
@trace
def expensive_agent_task(query):
    return llm.completion(query)

# Governance approach (real-time bounds)
@trace
@budget(max_cost=5.00, window="1h")  # Hard limit per agent per hour
def controlled_agent_task(query):
    return llm.completion(query)

When the controlled agent hits its hourly budget, AgentCost returns a budget exceeded error instead of making the API call. The agent can handle this gracefully (fallback to cached response, simpler model, human handoff). The runaway loop stops before it burns your budget.

This approach addresses both drift layers:

Structural drift protection: Budget limits enforce regardless of how many internal API calls an agent makes. One user request generating 47 calls hits the same budget limit as 47 separate requests.

Metrological drift protection: Budget allocation reflects business value, not token counts. Code generation agents get $20/hour budgets. Summarization agents get $2/hour budgets.

Real-Time Anomaly Detection for Runaway Agents

Governance isn't just hard budget limits. It's detecting abnormal patterns before they become expensive problems.

AgentCost's anomaly detection uses heartbeat analysis to identify agents consuming budget faster than expected:

anomaly_policy = {
    "customer_support_agent": {
        "baseline_cost_per_hour": 2.50,
        "anomaly_threshold": 3.0,  # 3x baseline triggers alert
        "escalation_threshold": 5.0  # 5x baseline triggers shutdown
    },
    "code_generation_agent": {
        "baseline_cost_per_hour": 15.00,
        "anomaly_threshold": 2.0,
        "escalation_threshold": 3.0
    }
}

When the customer support agent burns $7.50/hour (3x baseline), you get an alert. At $12.50/hour (5x baseline), it gets automatically rate-limited pending human review.

This catches runaway loops, infinite retry scenarios, and prompt injection attacks before they destroy your monthly budget.

Per-Agent Cost Attribution at Scale

Meaningful AI cost governance requires granular attribution. You need to know not just what you spent, but which agent spent it, for which business process, serving which customer.

@trace(
    agent_id="customer_onboarding_v2",
    business_unit="growth",
    customer_tier="enterprise",
    process_stage="document_validation"
)
def process_customer_documents(documents, customer_id):
    validation_results = []
    for doc in documents:
        result = llm.analyze_document(doc)
        validation_results.append(result)
    return validation_results

This attribution feeds into chargeback reporting, budget allocation planning, and ROI analysis per business unit. When your CFO asks "How much does customer onboarding cost?" you answer: "$23.50 per enterprise customer for document validation."

Implementation: One-Line SDK Integration

AgentCost wraps your existing LLM client calls with zero refactoring:

# Before: direct OpenAI calls
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# After: wrapped with AgentCost governance
from agentcost.sdk import wrap_openai
import openai

client = wrap_openai(openai.OpenAI())
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

The wrapper automatically adds:

  • Per-call cost calculation using AgentCost's vendored pricing database (2,610+ models, zero external dependencies)
  • Budget enforcement based on configured policies
  • Semantic caching to prevent duplicate expensive calls
  • Real-time anomaly detection for cost spikes
  • Multi-dimensional attribution for chargeback and ROI analysis

Works with OpenAI, Anthropic, Google, and 40+ other providers.

What This Means for Your Team

The two-layer drift model explains why your current approach isn't working. You're using observability tools to solve governance problems. You're measuring after the fact instead of controlling during execution.

If your team experiences:

  • Monthly AI bills 3-5x your projections
  • No clear way to attribute costs to specific agents or business processes
  • Runaway loops burning budget over weekends
  • Executive pressure to justify AI spend with concrete ROI numbers

You need cost governance, not cost observability. You need bounds, not predictions.

Next Steps

Try the chaos simulator to see how different budget policies handle cost spikes: demo.agentcost.in

Read the integration docs for your LLM framework (LangChain, CrewAI, AutoGen): docs.agentcost.in

Star the open-source repo and join the community: github.com/agentcostin/agentcost

The next runaway agent is already in your production system. The question is whether you'll find out when it burns your budget, or prevent it from happening at all.

Why AI Agents Need Budget Bounds, Not Dashboards

Your observability dashboard displays that AI agents spent $47,000 last month. The metrics are crisp, attribution is precise, and graphs are beautiful. You cannot prevent any of it from happening again.

This is the central paradox in AI cost management today. Engineering teams have conflated monitoring with governance. Visibility and control are not the same. Observability platforms show what happened after your budget burns. Budget bounds prevent spending before it occurs.

Agentic AI amplifies this problem. AI agents need budget bounds, not just dashboards. A single user request can trigger 47+ API calls through agent loops and tool use. Your dashboard reveals this linearly: 47 separate calls, perfect attribution, zero prevention. This distinction separates reactive cost monitoring from proactive cost governance.

This post explores three things: Why structural drift breaks cost prediction. How budget enforcement differs fundamentally from observability. How to implement governance through per-agent limits and policy-as-code.

The Structural Drift Problem: One Request, Many Calls

Traditional cost models assume linearity. One API request equals one task. Agentic AI breaks this assumption completely.

A customer support agent built with CrewAI receives one question: "Why was my order delayed?" This triggers:

  • 3 calls to retrieve order history
  • 5 calls to analyze shipping data
  • 12 calls to generate response drafts
  • 8 calls to fact-check information
  • 4 calls to format the final response

That's 32 API calls for one user question. Your dashboard shows 32 separate line items. It provides zero insight into preventing the 33rd call that might push you over budget.

This is structural drift: execution breaks the assumption that tasks map to calls. Agentic loops, tool use chains, and multi-turn reasoning multiply the cost per request unpredictably.

Why prediction fails under structural drift: You cannot forecast whether an agent will need 1 call or 47 calls to solve a problem. The only reliable approach is setting bounds and enforcing them in real-time, not predicting costs upfront.

Dashboards excel at post-hoc analysis. They cannot prevent drift.

Observability vs Governance: A Fundamental Distinction

The AI tooling market has convinced teams that visibility equals control. This is fundamentally incorrect.

What observability tools do: Track what happened. Show beautiful cost breakdowns. Enable post-incident analysis. Alert after thresholds are exceeded.

What governance tools do: Prevent what's allowed to happen. Set hard budget limits. Block API calls when bounds are breached. Enforce policy before incidents occur.

Langfuse, Helicone, and similar platforms added cost tracking as an observability feature. They show what your agents spent, but cannot prevent the spending. This is equivalent to installing a speedometer in a car and calling it cruise control.

Capability Observability Governance
Cost Attribution Total spend tracking Per-agent budget enforcement
Prevention Alerts (reactive) Hard limits (proactive)
Policy Enforcement Static thresholds Dynamic CEL rules
Runaway Protection Post-mortem analysis Real-time API blocking
Design Priority Understanding costs Controlling costs

AgentCost applies financial risk management to AI costs: Bound, don't predict. Set hard limits. Enforce them through policy-as-code. Measure drift against bounds. This is governance, not observability.

The paradigm shift is critical: You cannot forecast agentic behavior reliably. You can enforce budget limits reliably.

The Runaway Agent Problem: When Dashboards Arrive Too Late

At 3 AM on Sunday, an AI agent enters an infinite loop. It burns $400 per hour for 6 hours before anyone notices. Your monthly bill contains a $2,400 surprise.

This scenario is occurring across the industry as teams deploy multi-agent systems without proper cost controls. Common failure modes include:

Tool misuse: Agent calls expensive APIs in uncontrolled loops, accumulating costs with each iteration.

Hallucinated endpoints: Agent invents API calls that don't exist but still incur charges before failing.

Context explosion: Agent includes entire conversation history in every prompt, quadrupling token counts.

Chain reactions: One agent's output triggers cascading cost in dependent agents, creating multiplicative effects.

Observability tools excel at explaining what happened. They'll show exactly when runaway started, which prompts triggered it, and total cost. They cannot prevent it from happening next time.

Budget enforcement with hard limits would have blocked the 15th API call. The agent would fail fast instead of failing expensive. Governance prevents incidents. Dashboards document them.

How to Implement AI Agent Budget Enforcement

Real governance requires three components: per-agent attribution, policy-as-code enforcement, and hard blocking.

Step 1: Per-Agent Cost Attribution

Every agent gets its own cost bucket. This is essential for multi-agent systems where you need to understand which agent is expensive, not just total system cost.

from agentcost.sdk import trace, budget

@trace(agent_id="customer_support")
@budget(daily_limit=100.0, currency="USD")
def handle_customer_query(query: str):
    # Agent logic here
    # Cost automatically attributed to customer_support agent
    # Budget automatically enforced
    response = llm_call(query)
    return response

One line of code wraps your LLM client. From that point forward, every API call is tracked, attributed to the agent, and checked against budget limits.

Step 2: Policy-as-Code via CEL

Use Google Common Expression Language (CEL) to define dynamic budget rules that adapt to context:

# Soft limit: warn at 80% of budget
# Hard limit: block at 100% of budget
policy = """
request.agent_id == 'customer_support' &&
daily_spend > 80.0 ? 'WARN' :
daily_spend > 100.0 ? 'BLOCK' : 'ALLOW'
"""

Policies can reference agent ID, time of day, request type, or any tracked metadata. Rules adapt dynamically without code changes.

Step 3: Real-Time Enforcement

Budget enforcement happens at the API gateway level. Calls are blocked before they reach the LLM provider. This prevents runaway costs in real-time.

The result is predictable AI spend with hard upper bounds, regardless of agent behavior in production.

Supply Chain Security: Why Vendored Pricing Matters

The LiteLLM supply chain attack (CVE-2026-33634) exposed a critical vulnerability in AI infrastructure: external dependencies for pricing data.

Most cost tracking tools rely on external APIs to fetch current model pricing. When dependencies are compromised, your entire cost attribution system becomes vulnerable.

AgentCost maintains a vendored pricing database for 2,610+ models from 40+ providers. Zero external dependencies. All pricing data is embedded locally.

This architectural choice proved prescient during the LiteLLM incident. While other platforms scrambled to patch compromised external dependencies, AgentCost continued operating normally because all pricing data is embedded. Independence from external APIs is a governance requirement, not a convenience.

The H:A Ratio: Measuring Governance Maturity

The real question isn't whether you have cost visibility. It's whether you have cost control.

Consider your H:A ratio: how many humans are managing how many AI agents in production? A 20:1 ratio with proper budget governance looks very different from 20:1 with runaway loops.

Teams building production AI systems need to shift from reactive monitoring to proactive governance. This means:

  • Setting hard budget limits per agent, not just dashboard alerts
  • Implementing real-time enforcement, not post-incident analysis
  • Using policy-as-code for dynamic rules, not static thresholds
  • Maintaining vendor independence through vendored dependencies

The future of AI cost management isn't better dashboards. It's better bounds.

Next Steps: From Visibility to Control

Ready to move from cost monitoring to cost governance?

Try the interactive demo to see per-agent cost attribution and budget enforcement in action.

Explore the open source repository to integrate budget governance into your AI agents. For teams building multi-agent systems with CrewAI, AutoGen, or LangChain, per-agent cost attribution is table stakes for production deployment.

Read the CEL policy engine documentation to define dynamic budget rules that adapt to your team's governance requirements.

Discussion

What's the biggest AI cost surprise your team has experienced? Was it from runaway agents, unpredictable token usage, or something else? How are you currently preventing runaway costs in production?

The litellm Supply Chain Attack Proves AI Cost Tools Need Fewer Dependencies, Not More

How we built AgentCost with 4 dependencies — and why that decision matters more than ever after March 24, 2026.

By Founder of AgentCost | March 26, 2026


On March 24, 2026, a routine pip install litellm became one of the most devastating supply chain attacks in AI infrastructure history.

For approximately three hours, versions 1.82.7 and 1.82.8 of litellm — a package downloaded 95 million times per month — silently exfiltrated SSH keys, AWS/GCP/Azure credentials, Kubernetes secrets, crypto wallets, CI/CD tokens, database passwords, and every API key stored in .env files. The malware didn't even need you to import litellm. Simply having it installed was enough — it executed on every Python process startup.

Andrej Karpathy called it "Software horror." Elon Musk quote-tweeted "Caveat emptor." The Hacker News thread hit 324 points. And every AI developer who read the news had the same thought: Am I affected?

The answer, for a terrifyingly large number of teams, was yes.


What actually happened

The attack was the capstone of a coordinated campaign by a threat actor called TeamPCP. Here's the chain:

Step 1: TeamPCP compromised Aqua Security's Trivy — an open-source vulnerability scanner — by exploiting a GitHub Actions workflow vulnerability on March 19. They force-pushed malicious code to 75 of 76 version tags.

Step 2: litellm used Trivy in its own CI/CD pipeline for security scanning. When the compromised Trivy ran inside litellm's GitHub Actions workflow, TeamPCP harvested litellm's PyPI publishing token from the runner environment.

Step 3: With that token, they published two backdoored versions directly to PyPI, bypassing litellm's normal release process entirely.

The irony is brutal: a security scanner became the attack vector.

The payload was a three-stage weapon. Stage 1 harvested every credential it could find — SSH keys, cloud provider tokens, Kubernetes configs, environment variables, even cryptocurrency wallets. Stage 2 encrypted everything with AES-256 and exfiltrated it to models.litellm.cloud (a domain registered the day before, designed to look like legitimate litellm infrastructure). Stage 3 installed a persistent backdoor via systemd and, in Kubernetes environments, deployed privileged pods to every node in the cluster.

The attack was only discovered because the malware had a bug. Callum McMahon of FutureSearch was testing a Cursor MCP plugin that pulled litellm as a transitive dependency. The .pth file mechanism — which fires on every Python startup — created an accidental fork bomb that crashed his machine from RAM exhaustion. Without that bug, the credential stealer would have run silently for days or weeks.

As Karpathy noted: "So if the attacker didn't vibe code this attack it could have been undetected for many days or weeks."


The blast radius: it's not just litellm

Here's what makes this attack catastrophic. litellm isn't just a standalone tool. It's a transitive dependency embedded inside the most popular AI frameworks:

  • CrewAI depends on litellm as its default LLM router
  • DSPy (Stanford NLP) depends on litellm>=1.64.0
  • MLflow uses litellm for multi-provider LLM support
  • LlamaIndex has a dedicated litellm integration package
  • Browser-Use, Opik, Mem0, Instructor, Guardrails, Agno — all affected

If you ran pip install crewai or pip install dspy on March 24 without pinned versions, you were compromised — even though you never directly installed litellm. It arrived silently inside your dependency tree.

Projects scrambled to respond. MLflow filed PR #21971 within hours, pinning to litellm<=1.82.6. CrewAI went further — PR #5040 began decoupling from litellm entirely. The CVE (CVE-2026-33634) was assigned a CVSS score of 9.4.

Wiz's head of threat exposure summarized it bluntly: "The open source supply chain is collapsing in on itself. Trivy gets compromised → litellm gets compromised → credentials from tens of thousands of environments end up in attacker hands → and those credentials lead to the next compromise. We are stuck in a loop."


The dependency problem in AI tooling

This attack validates something Karpathy has been saying for months. In his post, he wrote:

"Classical software engineering would have you believe that dependencies are good (we're building pyramids from bricks), but imo this has to be re-evaluated."

He's right. And the problem is particularly acute for AI cost and observability tools. Here's why:

AI cost tools handle the most sensitive credentials in your stack. If your cost tracker needs to measure spending across OpenAI, Anthropic, Google, and Azure, it needs access to API keys for all of them. If that tool depends on litellm — which itself centralizes those keys through a proxy — a single supply chain compromise hands attackers every AI credential your organization possesses.

Transitive dependencies are invisible attack surface. Most teams don't audit what their dependencies depend on. You install a cost tracker. It depends on an LLM router. The router depends on a security scanner. The scanner gets compromised. Three layers deep, and your SSH keys are on an attacker's server.

The AI ecosystem moves fast and pins poorly. Unlike mature ecosystems where lockfiles and exact version pinning are standard practice, many AI projects still use loose version constraints like litellm>=1.64.0. During the attack window, any build or install that resolved litellm pulled the malicious version.


How we built AgentCost with 4 dependencies

When I started building AgentCost — an open-source AI cost governance platform — one of the earliest architectural decisions was about the dependency tree. The temptation was obvious: depend on litellm for multi-provider support, depend on LangChain for framework integrations, depend on a dozen utility libraries for convenience.

We chose the opposite path. AgentCost has 4 direct dependencies. That's it.

Here's why, and how:

1. We vendor our own pricing database

AgentCost maintains a pricing database of 2,610+ models from 40+ providers, updated weekly via a GitHub Action that syncs upstream pricing data. This data is vendored directly into the package — it ships with AgentCost, not as a runtime dependency on an external service.

Many competing tools depend on litellm's model_cost map for pricing data. When litellm was quarantined on PyPI on March 24 (all versions, not just the malicious ones), those tools lost access to their pricing data entirely. AgentCost's vendored database continued working normally.

2. We wrap provider SDKs directly

Instead of routing through a universal proxy like litellm, AgentCost's trace() function wraps your existing provider client:

from agentcost.sdk import trace
from openai import OpenAI

client = trace(OpenAI(), project="my-app")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

The provider SDK (openai, anthropic, etc.) is your dependency, not ours. AgentCost intercepts the call metadata — model, tokens, latency — without sitting in the credential path. We never see or store your API keys.

3. We avoid deep dependency trees

Every dependency you add is a dependency on that package's dependencies, and their dependencies, and so on. litellm alone pulls in dozens of transitive packages. Each one is a potential attack vector.

By keeping our dependency count to 4, the total transitive tree is small enough to audit manually. You can run pip install agentcostin and see exactly what goes into your environment.

4. We pin exact versions in CI

Our GitHub Actions workflows pin every dependency to exact versions with hash verification. Our Docker images are built from pinned requirements. This isn't glamorous work, but it's the difference between being compromised and not being compromised during a three-hour supply chain attack window.


What you should do right now

If you're running AI workloads in production, here's your immediate checklist:

Check if you were exposed:

pip show litellm | grep Version
# If 1.82.7 or 1.82.8, assume full credential compromise

Check transitive exposure:

pip install pipdeptree
pipdeptree --reverse --packages litellm
# Shows every package in your environment that depends on litellm

If exposed, rotate everything:

  • SSH keys
  • Cloud provider credentials (AWS, GCP, Azure)
  • Kubernetes configs and secrets
  • All API keys in .env files
  • Database passwords
  • CI/CD tokens

Check for persistence:

# Local backdoor
ls -la ~/.config/sysmon/
# Kubernetes pods
kubectl get pods -n kube-system | grep node-setup

For the long term:

  • Pin dependencies to exact versions (use lockfiles)
  • Audit your transitive dependency tree
  • Consider whether each dependency is worth the risk it introduces
  • For cost tracking specifically: choose tools with minimal dependency footprints that don't sit in the credential path

The bigger lesson

The litellm attack isn't really about litellm. It's about what happens when the AI ecosystem's most sensitive credentials flow through deeply embedded transitive dependencies maintained by small teams with limited security resources.

The attack chain — from a vulnerability scanner to an LLM proxy to your production credentials — illustrates a fundamental architectural problem that no amount of post-hoc scanning can fix. The fix is structural: fewer dependencies, vendored data, direct provider wrapping, and tools that stay out of the credential path.

Karpathy concluded his post by advocating for using LLMs to "yoink" functionality rather than importing heavy dependency trees. Whether or not that specific approach scales, the principle is sound: every dependency is a trust decision, and the AI ecosystem has been making those decisions far too casually.

We built AgentCost with 4 dependencies because we believe cost governance should be lightweight, auditable, and safe. The litellm attack proved that this isn't just a philosophical preference — it's a security imperative.


AgentCost is open-source (MIT), tracks 2,610+ models from 40+ providers, and ships as a single pip install agentcostin with 4 dependencies. Try it: github.com/agentcostin/agentcost | demo.agentcost.in


Sources: FutureSearch (original disclosure), Snyk, Endor Labs, Wiz, The Hacker News, Andrej Karpathy (X/Twitter), DreamFactory, Arctic Wolf, SANS Institute, Cybernews. CVE-2026-33634.