What Real LLM Monitoring Looks Like (And Why It's Hard)
You're collecting logs, tracking latency, maybe even running evals. But can you answer: which users are struggling? What patterns are emerging? Is quality actually improving? Here's what real LLM monitoring looks like.
Chris Wood
Founder of qckfx
You're in the #ai-engineering Slack channel when your CEO posts: "Is the AI agent actually getting better? Looking at the numbers for board prep."
You pull up your dashboard. You've got... logs. Some error rates. A chart showing API latency. Maybe even a homegrown eval suite that runs on a Thursday.
You type out a response, delete it, then: "Let me pull some data and get back to you."
Here's the thing: you can't answer this question because the data you need isn't there. You have logs and latency charts, but not the insights to know if quality is improving. And traditional monitoring tools weren't built for this.
Why Traditional Monitoring Doesn't Work
In a normal web app, everything is structured. A user clicks a button - that's an event. They view a page - that's a pageview. An API returns a 500 - that's an error. Wire these into PostHog or Sentry and you're done.
But in an AI application, your core product is language. Unstructured, contextual, high-dimensional language.
You want to know:
- Did the model actually understand what the user meant?
- Are there patterns in what users are asking about?
- Which users are succeeding vs struggling?
- Is quality improving or degrading over time?
- Did retrieval return irrelevant documents?
- Is the agent looping, retrying, or getting stuck?
These aren't things you can measure with status codes and latency percentiles.
The questions that keep you up at night:
- "Why are users rephrasing their questions?"
- "Which 20% of queries are taking 80% of the time?"
- "Is the new prompt actually better, or does it just feel better?"
- "Are we seeing the same issues repeatedly?"
- "Which power users should we talk to?"
Traditional tools give you "200 OK" and "p95 latency: 1.2s". Cool. Useless.
What Good Looks Like
Before we get into how to build this, let me show you what comprehensive LLM monitoring actually looks like. Here's a story of how you'd actually use this to make your product better.
Monday Morning: You Open the Dashboard
OVERVIEW - Last 24 Hours
3,847 API calls | 892 conversations | 234 users
Avg latency: 1,342ms | Total cost: $42.18
ALERTS
⚠️ Struggling Users cohort growing: 12 → 18 users (+50%)
⚠️ Session length declining in this cohort: 8.2min → 4.3min
That's weird. You click into the "Struggling Users" cohort.
The Investigation Begins
COHORT: Struggling Users (18 users)
Behavioral trend (last 30 days):
Week 1: 6.2 messages/session, 12min avg session
Week 2: 5.8 messages/session, 10min avg session
Week 3: 4.9 messages/session, 7min avg session
Week 4: 3.2 messages/session, 4min avg session ⚠️
Rephrase rate: 3.4x (normal: 1.2x)
Latency: 4,821ms avg (normal: 1,342ms)
Tool call success rate: 67% (normal: 94%)
Sessions are getting shorter and more frustrated. What are they trying to do?
You click "Show semantic clusters for this cohort"
TOP INTENTS - Struggling Users Cohort
1. Code generation with external APIs (87 interactions)
"Generate code that calls the Stripe API..."
"How do I authenticate with AWS SDK..."
"Write a function that uses the Twilio API..."
Avg latency: 6,234ms | Success rate: 41% ⚠️
Longitudinal: This intent growing 340% over 30 days
[Click to see example interactions]
2. Multi-file refactoring (43 interactions)
"Refactor these 5 files to use TypeScript..."
"Update all components to use the new API..."
Avg latency: 8,127ms | Success rate: 38% ⚠️
Tool loops detected: Agent keeps re-reading same files
Holy shit. Your struggling users aren't struggling because they're bad at using your product. They're trying to do things your product isn't good at yet.
You ask the AI analyst: "Why is external API code generation failing?"
AI ANALYSIS:
Based on 87 interactions in this cluster:
1. Retrieval is failing (evidence: docs/api-reference.md returned in 4/87 cases)
→ Your docs don't cover external API integration patterns
2. Context window exhaustion (evidence: 31/87 interactions truncated examples)
→ Users paste entire API docs, agent loses important context
3. Tool call patterns show confusion (evidence: search_docs → search_docs → search_docs)
→ Agent doesn't know how to search for external API info
Recommendation: Add RAG over external API docs (Stripe, AWS, Twilio)
Estimated impact: 87 users × 8.2 sessions/week = 713 improved sessions/week
[Data References: api_calls WHERE conversation_id IN (conv_1829, conv_2847, ...)]
Now you know exactly what to build. You add external API documentation to your RAG system.
Two Weeks Later: You Check the Power Users
COHORT: Power Users (31 users, +35% from last month)
Behavioral trend (last 30 days):
Week 1: 18 messages/session, 25min avg session
Week 2: 23 messages/session, 31min avg session
Week 3: 28 messages/session, 38min avg session
Week 4: 34 messages/session, 47min avg session ✓
Session frequency: 4.2 sessions/day (up from 2.8)
Quality score: 4.7/5
What are power users discovering? You click into their semantic clusters:
EMERGING PATTERNS - Power Users
1. Planning docs in markdown (NEW - 156 interactions, last 2 weeks)
"Create a plan.md file with today's objectives..."
"Update my todo.md to mark these tasks complete..."
"Generate a weekly planning doc with priorities..."
Pattern: Using markdown files as project planning layer
Discovered organically by users
Success rate: 96%
[Example interaction showing full workflow]
2. Iterative debugging workflows (412 interactions)
Session pattern: code → test → error → fix → test → fix
Avg iterations: 3.8 before success
Power users accept errors, iterate naturally
3. Component library as knowledge base (89 interactions)
"Based on my existing Button component, create a Card..."
Pattern: Building on their own code patterns
Tool: search_codebase returning user's previous components
This is gold. Your power users are using markdown files for planning and todos. That's not in your docs. That's not in your marketing. But it works really well.
You ask the AI: "Show me a complete example of the planning workflow"
AI: Here's user_2847's typical workflow (quality score: 4.9/5):
Session start (Morning)
├─ "Create plan.md for the authentication refactor project"
├─ Agent generates structured plan with phases
├─ User: "Add acceptance criteria for each phase"
└─ Agent updates plan with detailed criteria
Mid-session (Implementation)
├─ "Based on my plan.md, start with phase 1"
├─ Agent reads plan, implements first phase
├─ "Mark phase 1 complete in plan.md"
└─ Agent updates progress
End of day
├─ "Summarize what's left in plan.md"
└─ Agent provides status report
This pattern appears in 18/31 power users.
Average quality score: 4.8/5 (vs 4.1/5 for sessions without planning)
[Data References: conversations (2847, 2912, 3104, ...)]
[Full interaction logs available]
Now you have a new feature to build and document: first-class planning file support. You add it to your docs, create templates, add syntax highlighting for plan.md files.
The Outliers: Edge Cases That Matter
You notice something in the HDBSCAN noise - 47 interactions (1.2%) that don't fit any cluster.
OUTLIERS - Unclustered Interactions
Analyzing 47 noise points...
Pattern detected in 12 outliers:
Users asking about cost/pricing mid-session
"How much is this costing me right now?"
"What's my token usage today?"
This is a NEW intent (first appeared 3 days ago)
Not covered by existing docs or agent capabilities
Recommendation: Add real-time cost tracking tool
Quick win: Low complexity, high user desire
These are the signals you'd miss without outlier analysis. New user needs emerging before they become trends.
Week Over Week: Did Your Changes Work?
IMPACT ANALYSIS
External API Documentation (Added 2 weeks ago)
├─ Struggling user cohort: 18 → 9 users (-50%) ✓
├─ Code generation success: 41% → 78% (+37pp) ✓
├─ Avg latency for API queries: 6,234ms → 2,891ms (-54%) ✓
└─ Session length recovered: 4.3min → 7.8min (+81%) ✓
Planning File Documentation (Added 1 week ago)
├─ Feature adoption: 18 → 34 users (+89%) ✓
├─ New power users: 31 → 39 users (+26%) ✓
└─ Quality score for planning sessions: 4.8/5 ✓
Cost Tracking Tool (Deployed yesterday)
└─ Already used by 23 users, 4.6/5 satisfaction
This is what monitoring should be. A conversation with your data:
- Discover cohorts you didn't know existed
- Track them longitudinally to see if they're succeeding or struggling
- Drill into semantic clusters to understand what they're trying to do
- Ask questions and get answers backed by data
- Find outliers that signal emerging needs
- Measure impact of your changes
You're having a dialogue with your product's usage data, uncovering insights that drive real product decisions.
Now let's talk about how to get there.
The Simplest Thing You Can Build Today
To get started, you need to start capturing data. Here's something you can actually implement in an afternoon that will immediately make your life better:
A drop-in logging wrapper that captures everything.
Step 1: Set Up Storage
First, a proper Postgres schema that doesn't duplicate data:
import psycopg2
from psycopg2.extras import Json
import uuid
from datetime import datetime
class ConversationStore:
def __init__(self, db_connection_string: str):
self.conn = psycopg2.connect(db_connection_string)
self._create_tables()
def _create_tables(self):
with self.conn.cursor() as cur:
# Conversations group related interactions
cur.execute("""
CREATE TABLE IF NOT EXISTS conversations (
id UUID PRIMARY KEY,
user_id TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
metadata JSONB
)
""")
# System prompts are deduplicated
cur.execute("""
CREATE TABLE IF NOT EXISTS system_prompts (
id UUID PRIMARY KEY,
content TEXT NOT NULL,
content_hash TEXT UNIQUE NOT NULL
)
""")
# Individual messages
cur.execute("""
CREATE TABLE IF NOT EXISTS messages (
id UUID PRIMARY KEY,
conversation_id UUID REFERENCES conversations(id),
role TEXT NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
)
""")
# API calls track each LLM interaction
cur.execute("""
CREATE TABLE IF NOT EXISTS api_calls (
id UUID PRIMARY KEY,
conversation_id UUID REFERENCES conversations(id),
system_prompt_id UUID REFERENCES system_prompts(id),
message_ids UUID[] NOT NULL,
response_message_id UUID REFERENCES messages(id),
model TEXT NOT NULL,
latency_ms FLOAT NOT NULL,
input_tokens INT NOT NULL,
output_tokens INT NOT NULL,
tool_calls JSONB,
created_at TIMESTAMP DEFAULT NOW(),
metadata JSONB
)
""")
# Indexes for querying
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_api_calls_created
ON api_calls(created_at DESC)
""")
self.conn.commit()
def get_or_create_system_prompt(self, content: str) -> str:
"""Deduplicate system prompts"""
import hashlib
content_hash = hashlib.sha256(content.encode()).hexdigest()
with self.conn.cursor() as cur:
cur.execute(
"SELECT id FROM system_prompts WHERE content_hash = %s",
(content_hash,)
)
result = cur.fetchone()
if result:
return str(result[0])
prompt_id = str(uuid.uuid4())
cur.execute(
"INSERT INTO system_prompts (id, content, content_hash) VALUES (%s, %s, %s)",
(prompt_id, content, content_hash)
)
self.conn.commit()
return prompt_id
def log_api_call(
self,
conversation_id: str,
system_prompt: str,
messages: list,
response: str,
model: str,
latency_ms: float,
input_tokens: int,
output_tokens: int,
tool_calls: list = None,
metadata: dict = None
):
"""Log a single API call"""
system_prompt_id = self.get_or_create_system_prompt(system_prompt)
message_ids = []
with self.conn.cursor() as cur:
# Store messages
for msg in messages:
msg_id = str(uuid.uuid4())
cur.execute(
"INSERT INTO messages (id, conversation_id, role, content) VALUES (%s, %s, %s, %s)",
(msg_id, conversation_id, msg['role'], msg['content'])
)
message_ids.append(msg_id)
# Store response
response_id = str(uuid.uuid4())
cur.execute(
"INSERT INTO messages (id, conversation_id, role, content) VALUES (%s, %s, 'assistant', %s)",
(response_id, conversation_id, response)
)
# Store API call
call_id = str(uuid.uuid4())
cur.execute("""
INSERT INTO api_calls (
id, conversation_id, system_prompt_id, message_ids,
response_message_id, model, latency_ms, input_tokens,
output_tokens, tool_calls, metadata
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""", (
call_id, conversation_id, system_prompt_id, message_ids,
response_id, model, latency_ms, input_tokens, output_tokens,
Json(tool_calls) if tool_calls else None,
Json(metadata) if metadata else None
))
self.conn.commit()
Step 2: Wrap Your OpenAI Client
Now make a drop-in replacement that logs everything:
from openai import OpenAI
import time
import os
class MonitoredChatCompletions:
def __init__(self, client, store):
self._client = client
self.store = store
def create(self, **kwargs):
# Extract monitoring params
user_id = kwargs.pop('user_id', 'anonymous')
conversation_id = kwargs.pop('conversation_id', None)
metadata = kwargs.pop('metadata', None)
messages = kwargs.get('messages', [])
model = kwargs.get('model', 'gpt-4o')
# Create conversation if needed
if not conversation_id:
conversation_id = str(uuid.uuid4())
with self.store.conn.cursor() as cur:
cur.execute(
"INSERT INTO conversations (id, user_id, metadata) VALUES (%s, %s, %s)",
(conversation_id, user_id, Json(metadata) if metadata else None)
)
self.store.conn.commit()
# Make API call
start = time.time()
response = self._client.chat.completions.create(**kwargs)
latency = (time.time() - start) * 1000
# Log it
message = response.choices[0].message
tool_calls_data = None
if message.tool_calls:
tool_calls_data = [
{
"id": tc.id,
"function": tc.function.name,
"arguments": tc.function.arguments
}
for tc in message.tool_calls
]
self.store.log_api_call(
conversation_id=conversation_id,
system_prompt="",
messages=messages,
response=message.content or "",
model=model,
latency_ms=latency,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
tool_calls=tool_calls_data,
metadata=metadata
)
return response
class MonitoredChat:
def __init__(self, client, store):
self.completions = MonitoredChatCompletions(client, store)
class MonitoredLLM:
def __init__(self, api_key=None, store=None):
self._client = OpenAI(api_key=api_key)
self.store = store or ConversationStore("postgresql://localhost/llm_monitoring")
self.chat = MonitoredChat(self._client, self.store)
Step 3: Use It
# Change this one line:
# client = OpenAI(api_key="your-key")
client = MonitoredLLM(api_key=os.getenv("OPENAI_API_KEY"))
# Everything else stays the same:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's the capital of France?"}
],
user_id="user_123" # Just add this
)
That's it. You're now capturing:
- Every prompt and response
- Token usage and costs
- Latency per call
- User IDs and conversations
- Tool calls and metadata
Step 4: Query Your Data
Now you can answer basic questions:
-- Most expensive users today
SELECT
c.user_id,
SUM(ac.input_tokens + ac.output_tokens) as total_tokens,
COUNT(*) as num_calls
FROM api_calls ac
JOIN conversations c ON ac.conversation_id = c.id
WHERE ac.created_at > NOW() - INTERVAL '24 hours'
GROUP BY c.user_id
ORDER BY total_tokens DESC
LIMIT 10;
-- Slowest queries
SELECT
m.content as user_query,
ac.latency_ms,
ac.model
FROM api_calls ac
JOIN messages m ON m.id = ac.message_ids[array_length(ac.message_ids, 1)]
WHERE m.role = 'user'
AND ac.created_at > NOW() - INTERVAL '24 hours'
ORDER BY ac.latency_ms DESC
LIMIT 20;
-- Hourly usage
SELECT
DATE_TRUNC('hour', created_at) as hour,
COUNT(*) as calls,
AVG(latency_ms) as avg_latency,
SUM(input_tokens + output_tokens) as tokens
FROM api_calls
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour;
This is actually useful. You can now:
- Track costs per user
- Find slow interactions
- See usage patterns
- Debug specific conversations
- Calculate ROI
And you built it in an afternoon.
What Production Actually Requires
OK, so you have logging. That's great. But look at that dashboard at the top of this article. To get from "I have logs" to "I have insights", you need to solve some genuinely hard problems.
Semantic clustering means computing embeddings for every user message, storing them in Postgres with pgvector, and running HDBSCAN clustering to find patterns. The challenge? HDBSCAN loads all embeddings into memory. For 1M messages with 384-dim embeddings, that's 1.5GB just for the vectors. HDBSCAN needs significant working memory for clustering at this scale - expect 8-16GB workers running hourly, and parameter tuning is more art than science.
Anomaly detection sounds simple until you realize "abnormal" is context-dependent. High latency might be perfectly fine for complex queries. The solution is statistical models that learn your baseline patterns - what's normal for your users, your queries, your application. Building these models takes ML work: collecting training data, tuning sensitivity, validating against false positives. But once built, they work automatically. This is the kind of thing a platform can invest in once and get right for everyone, so you don't have to become an expert in time-series anomaly detection just to know when something's breaking.
Cohort analysis means defining what "success" looks like for your users, which varies wildly by use case. The good news is that behavioral clustering can discover these cohorts automatically - users who have similar session patterns, latency distributions, and interaction styles naturally group together. Then AI can analyze what distinguishes these cohorts: why do power users have different tool call patterns? What are struggling users trying to do that's failing? A platform can surface these insights automatically, highlighting the meaningful differences and emerging patterns. You get the interpretation, not just the clusters.
Quality evaluation is where most teams start with human labeling. You're already having PMs and engineers spot-check conversations, rate them, and flag issues. The problem is this doesn't scale past a few hundred interactions. Here's where it gets interesting: use those human labels to fine-tune an LLM-as-judge that learns your specific quality criteria. Now you can evaluate thousands of conversations, and use your ongoing human labels to continuously validate that your judge is still accurate.
But "quality" isn't one thing. You need different judges for different concerns. Is the user repeating themselves because the agent didn't understand? Is the user showing frustration in their language? Did they quit the conversation before their problem was resolved? Did the agent hallucinate or cite incorrect sources? Each of these is a separate signal that needs its own evaluation model. A good platform lets you define these metrics specific to your application - what "frustration" looks like in customer support is different from what it looks like in a coding assistant - and spin up custom judges trained on your human labels. Then you can track these metrics over time, break them down by cohort, and actually know if your product is getting better.
All of this runs on infrastructure you need to build and maintain. Message queues between your app and storage to handle bursts. Async workers. Background jobs for embedding computation (continuous), HDBSCAN clustering (hourly), anomaly detection (every 5 minutes), cohort analysis (hourly), and quality evaluation (daily samples). You'll need job schedulers like Airflow or Temporal, proper monitoring for your monitoring systems, PII detection and redaction, data retention policies, regional compliance for GDPR, backups, disaster recovery, and cost management.
Then there's the dashboard itself. Real-time updates, drill-downs from clusters to individual interactions, time range selectors, exports, saved views, materialized views for common queries, caching layers, pre-computed aggregations, configurable alert thresholds, user-defined cohorts, and pluggable evaluation criteria.
The reality: this is a 6-12 month project for a team. You'll need backend engineers for data pipelines and job scheduling, ML engineers for clustering and embeddings, frontend engineers for visualizations, DevOps for infrastructure, and product people to figure out what metrics actually matter. And it never stops. You're constantly tuning clustering parameters, adjusting anomaly thresholds, adding new analyses, scaling infrastructure, and handling edge cases.
So What Should You Do?
Here's where you actually are: You need monitoring. The question is how much and how fast.
If you're early stage, small scale, still figuring out product-market fit, start with the Postgres logging setup I showed you. You can implement it in a week. It'll cost you a few hours per week for manual analysis and $100-500/month for database costs. Export to spreadsheets when you need deeper analysis. This is good enough for under 10k messages/day, and it's valuable right now.
If you've got unique requirements, a team of 5+ engineers, 6-12 months, and this is strategic to your product, build it yourself. You'll get full control, deep integration with your stack, and no vendor lock-in. But be clear-eyed about what you're signing up for: $500k-$1M in engineering time, 1-2 engineers for ongoing maintenance, $2k-$10k/month in infrastructure costs, and the opportunity cost of not building features.
If you need insights now and you're growing past 10k messages/day, buy a platform. You'll get all the analyses out of the box, a dashboard that just works, automatic scaling, and ongoing support. You'll trade some customization and take on vendor dependency, but you'll have answers to your CEO's questions at the next board meeting.
The Bottom Line
LLM monitoring is not optional. You're flying blind without it.
The simple logging setup I showed you is valuable. Implement it today. It takes an afternoon and you'll immediately have better visibility.
But to get to the dashboard at the top of this article - semantic clustering, anomaly detection, cohort analysis, quality trends - that requires serious infrastructure and engineering.
The question isn't "should I monitor my LLM application?"
The question is "should I build it or buy it?"
This is what we're building. The observability layer for the AI era. A way to see what your LLMs are really doing and make them better with data, not vibes.
If you're thinking about this problem, shoot me an email at [email protected]. What are you struggling with? What would make your life easier? What questions are you trying to answer? I'd love to hear what you're building.