Skip to content

User Feedback on Traces

AgentCost lets you attach thumbs-up/down feedback to any LLM trace. This connects cost data to quality data — the missing link for knowing whether a cheaper model actually works well enough.

Why Feedback Matters for Cost Governance

Cost optimization without quality signals is dangerous. If you switch from GPT-4.1 to GPT-4.1-mini to save money, you need to know whether users notice the difference. Feedback gives you that signal:

  • Per-model quality: "GPT-4.1 gets 92% positive, GPT-4.1-mini gets 78% — the 80% cost savings come with a 14% quality drop"
  • Per-prompt-version quality: "V3 of support-bot gets 95% positive vs V2's 82%"
  • Cost-per-positive response: "GPT-4.1-mini costs $0.002 per positive-rated response vs $0.008 for GPT-4.1 — 4x more cost-efficient despite the quality gap"

Quick Start

Submit feedback via API

# Thumbs up
curl -X POST http://localhost:8100/api/feedback \
  -H "Content-Type: application/json" \
  -d '{"trace_id": "abc-123", "score": 1, "comment": "Accurate and helpful"}'

# Thumbs down
curl -X POST http://localhost:8100/api/feedback \
  -H "Content-Type: application/json" \
  -d '{"trace_id": "abc-123", "score": -1, "comment": "Hallucinated a date", "tags": ["hallucination"]}'

Submit feedback via Python

from agentcost.feedback import get_feedback_service

svc = get_feedback_service()

# Simple thumbs up
svc.submit("trace-abc-123", score=1)

# Detailed feedback
svc.submit(
    "trace-abc-123",
    score=-1,
    comment="Response was too long and included incorrect pricing",
    source="human-review",
    user_id="reviewer-jane",
    tags=["verbose", "inaccurate"],
)

Score values

Score Meaning Use when
1 Positive / thumbs up Response was correct, helpful, or met expectations
0 Neutral Response was acceptable but not noteworthy
-1 Negative / thumbs down Response was wrong, unhelpful, or harmful

Analytics

Quality by model

See which models produce the best results in your application:

curl http://localhost:8100/api/feedback/quality/models
[
  {
    "model": "gpt-4.1",
    "total_feedback": 240,
    "positive_pct": 92.1,
    "negative_pct": 4.2,
    "avg_score": 0.879,
    "avg_cost": 0.0034,
    "cost_per_positive": 0.00369
  },
  {
    "model": "gpt-4.1-mini",
    "total_feedback": 180,
    "positive_pct": 78.3,
    "negative_pct": 12.2,
    "avg_score": 0.661,
    "avg_cost": 0.0006,
    "cost_per_positive": 0.000766
  }
]

The cost_per_positive metric tells you the real efficiency story: GPT-4.1-mini costs 5x less per positive-rated response despite the lower quality percentage.

Quality by prompt version

Track whether prompt changes improve or degrade quality:

curl http://localhost:8100/api/feedback/quality/prompt/support-bot
[
  {"version": 1, "total_feedback": 50, "positive_pct": 82.0, "avg_score": 0.64},
  {"version": 2, "total_feedback": 120, "positive_pct": 91.7, "avg_score": 0.833},
  {"version": 3, "total_feedback": 30, "positive_pct": 96.7, "avg_score": 0.933}
]

Trace-level score

Get the aggregated score for a single trace:

curl http://localhost:8100/api/feedback/trace/abc-123/score
{
  "trace_id": "abc-123",
  "total": 5,
  "positive": 4,
  "negative": 1,
  "neutral": 0,
  "score": 0.6
}

Feedback Sources

Use the source field to distinguish where feedback comes from:

Source Description
user End-user thumbs up/down in your application
human-review Manual review by a team member
automated Automated quality checks (regex, format validation)
eval LLM-as-judge or evaluation pipeline
# Automated quality check
if "I don't know" in response.content:
    svc.submit(trace_id, score=-1, source="automated", tags=["refusal"])

API Reference

Method Endpoint Description
POST /api/feedback Submit feedback on a trace
GET /api/feedback List feedback (filter by project, model, score, source)
GET /api/feedback/summary Overall feedback stats
GET /api/feedback/quality/models Quality breakdown by model
GET /api/feedback/quality/model/{model} Quality for a specific model
GET /api/feedback/quality/prompt/{prompt_id} Quality per prompt version
GET /api/feedback/trace/{trace_id} All feedback for a trace
GET /api/feedback/trace/{trace_id}/score Aggregated score
DELETE /api/feedback/{id} Delete feedback

Best Practices

Collect feedback close to the user. Add a thumbs up/down button next to every LLM response in your UI. The closer to the interaction, the higher the response rate.

Use tags to categorize failures. Tags like hallucination, too-long, off-topic, formatting help you identify patterns in negative feedback.

Combine with prompt versioning. When you deploy a new prompt version, monitor the quality/prompt/{id} endpoint to catch quality regressions before they accumulate.

Set up automated feedback for known failure patterns. If a response matches a known bad pattern (refusal, empty response, format violation), submit automated negative feedback so it shows up in quality metrics immediately.

Review cost-per-positive regularly. The cheapest model isn't always the most cost-efficient — a slightly more expensive model with much higher quality can have a lower cost per successful response.