A Guide to Clustering LLM Chat Transcripts with HDBSCAN

If you're running an LLM-based application with thousands of conversations, you're sitting on a dataset that contains the answers to critical product questions: What are users actually trying to do? Where is your product failing them? Which user cohorts exist, and what do they need?

Traditional analytics dashboards don't help here. You can't A/B test unstructured conversations. You can't create a funnel when every session is a unique dialogue. This is where clustering chat transcripts with HDBSCAN becomes valuable.

This guide walks through how to use HDBSCAN to discover patterns in your chat data at scale. You'll learn to identify user cohorts, surface common failure modes, and uncover insights about how people actually use your product. We'll cover the algorithm choice, computational challenges, hyperparameter tuning, and maintaining clusters over time as new conversations arrive.

Why HDBSCAN for Chat Transcript Analysis

When analyzing chat transcripts to understand user behavior, traditional clustering algorithms have fundamental limitations.

K-means requires specifying the number of clusters upfront. How many distinct use cases do your users have? You don't know yet—that's what you're trying to discover. K-means also forces every conversation into a cluster, including edge cases and failure modes that should be treated as outliers. It assumes spherical clusters in high-dimensional space, which doesn't match the structure of conversation embeddings.

DBSCAN improves on this by finding arbitrarily shaped clusters and identifying outliers as noise. This is valuable for chat analysis because you want to separate genuine patterns (like "password reset requests") from one-off edge cases. However, DBSCAN requires a fixed epsilon radius parameter, which creates problems for product insights.

Consider your data: password reset conversations are repetitive and form tight, dense clusters. Users say similar things. But exploratory conversations where users are trying to understand what your product can do? Those form sparse, diffuse clusters with high variability. No single epsilon value works well across both patterns.

HDBSCAN solves this by building a hierarchy of clusters at different density levels and extracting stable clusters across scales. For product analysis, this means:

Automatic pattern discovery: You don't specify how many user cohorts exist; HDBSCAN finds them
Outlier detection: Identifies anomalous conversations that may indicate bugs or edge cases
Variable density handling: Captures both tight patterns (repeated tasks) and loose patterns (exploratory behavior)

The algorithm converts your metric space into a density-based space using mutual reachability distance, builds a minimum spanning tree, and extracts a hierarchy based on cluster stability. You don't need to understand the math to use it effectively, but these properties make it particularly well-suited for discovering insights in conversation data.

From Conversations to Insights

The clustering workflow has three steps: embedding conversations, running HDBSCAN, and interpreting results to extract product insights.

from sentence_transformers import SentenceTransformer
import hdbscan
import numpy as np

# Embed conversations with a current top-performing model
model = SentenceTransformer('dunzhang/stella_en_400M_v5', trust_remote_code=True)

conversations = [
    "User: How do I reset my password? Bot: Click on forgot password...",
    "User: I want to build a recommendation engine Bot: Let's discuss your data...",
    # ... thousands more
]

# Stella models use prompts for optimal performance
# For clustering conversations (similarity task), use the s2s prompt
embeddings = model.encode(
    conversations,
    prompt_name="s2s_query",
    show_progress_bar=True
)

# Cluster
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    metric='euclidean',
    cluster_selection_epsilon=0.0
)

labels = clusterer.fit_predict(embeddings)

Interpreting clusters requires both quantitative and qualitative analysis. We examine cluster sizes, silhouette scores, and outlier percentages, then sample 20-30 conversations per cluster for manual review. Using an LLM to generate cluster summaries helps scale this interpretation step.

# Generate cluster summaries
for cluster_id in set(labels):
    if cluster_id == -1:  # Skip noise
        continue
    
    mask = labels == cluster_id
    sample_convos = np.random.choice(
        conversations[mask], 
        size=min(20, mask.sum()), 
        replace=False
    )
    
    # Use LLM to summarize
    summary = summarize_cluster(sample_convos)
    print(f"Cluster {cluster_id}: {summary}")

Computational Constraints

HDBSCAN has significant memory and time requirements at scale. The algorithm computes a mutual reachability distance graph, which has O(n²) time and space complexity in its naive implementation.

For 100,000 conversations with 1024-dimensional embeddings (Stella default):

Embeddings: approximately 400MB
Distance matrix: approximately 40GB
Peak memory during clustering: approximately 60GB

The HDBSCAN implementation uses optimizations including the Boruvka algorithm and KD-trees, reducing average complexity to O(n log n). However, datasets beyond 500,000 samples still present memory challenges on typical hardware.

Approximate Nearest Neighbors

Rather than computing exact distances, use FAISS or ANNOY to approximate the nearest neighbor graph:

import faiss

# Build approximate NN index
index = faiss.IndexHNSWFlat(embeddings.shape[1], 32)
index.add(embeddings.astype('float32'))

# Configure for approximate computation
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    algorithm='best',
    core_dist_n_jobs=-1  # Parallelize distance computation
)

Sampling Strategies

For datasets exceeding 1 million samples, cluster a representative sample first:

# Sample strategically based on temporal distribution
sample_indices = stratified_sample(timestamps, n=100000)
sample_embeddings = embeddings[sample_indices]

# Cluster the sample
labels_sample = clusterer.fit_predict(sample_embeddings)

# Classify remaining points using the learned structure
labels_full = hdbscan.approximate_predict(clusterer, embeddings)[0]

The approximate_predict method runs significantly faster and leverages the cluster structure learned during training.

Dimensionality Reduction

UMAP can reduce embedding dimensionality before clustering:

import umap

reducer = umap.UMAP(n_components=50, n_neighbors=30, min_dist=0.0)
reduced_embeddings = reducer.fit_transform(embeddings)

labels = clusterer.fit_predict(reduced_embeddings)

This approach works well for embeddings above 1024 dimensions but provides limited benefit for models in the 384-768 dimension range. For Stella models using Matryoshka training, you can simply use a smaller dimension at inference time (e.g., 512 instead of 1024) rather than applying UMAP. The main risk with UMAP is losing information that distinguishes between clusters.

Hyperparameter Configuration

HDBSCAN has relatively few hyperparameters, but they significantly impact results.

min_cluster_size

This parameter determines the minimum number of samples required to form a cluster. Set it based on what constitutes a meaningful cluster for your use case.

Values that are too small produce hundreds of micro-clusters that aren't actionable. Values that are too large result in most points being classified as noise.

For LLM chat analysis:

15-30: Granular use cases (password reset, billing questions, feature requests)
50-100: Higher-level patterns (support vs product questions)

# Evaluate different min_cluster_size values
for mcs in [10, 20, 30, 50, 100]:
    clusterer = hdbscan.HDBSCAN(min_cluster_size=mcs)
    labels = clusterer.fit_predict(embeddings)
    
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    noise_pct = (labels == -1).sum() / len(labels)
    
    print(f"mcs={mcs}: {n_clusters} clusters, {noise_pct:.1%} noise")

Look for the point where you achieve meaningful clusters without excessive noise.

min_samples

This parameter controls how conservative HDBSCAN is when distinguishing clusters from noise. The default equals min_cluster_size, but setting it lower (5-10) reduces the number of points classified as outliers.

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=30,
    min_samples=5  # Less aggressive outlier detection
)

cluster_selection_epsilon

This parameter prevents splitting clusters that are within epsilon distance of each other, effectively merging similar clusters:

# No merging
clusterer = hdbscan.HDBSCAN(cluster_selection_epsilon=0.0)

# Merge similar clusters
clusterer = hdbscan.HDBSCAN(cluster_selection_epsilon=0.5)

Start at 0.0 and increase incrementally until you observe meaningful merges. For normalized embeddings, values between 0.3 and 0.7 typically work well.

cluster_selection_method

Two options exist: eom (Excess of Mass) and leaf. The default eom selects the most persistent clusters. The leaf method produces more granular clusters but often generates too many for practical use.

Fine-Tuning Embedding Models

Generic embedding models are trained on diverse text corpora. Chat transcripts contain domain-specific vocabulary, interaction patterns, and semantic structures that differ from training data. Fine-tuning embedding models on your specific data can substantially improve cluster quality.

Choosing a Base Model and Framework

Use Sentence Transformers v3 as your fine-tuning framework. It provides the most mature tooling with multi-GPU training, efficient loss functions, and straightforward training loops.

For base models, start with one of these open source options:

dunzhang/stella_en_400M_v5: Current top performer on MTEB retrieval benchmarks with commercial licensing. Supports Matryoshka dimensions (512-8192) for flexibility.
BAAI/bge-base-en-v1.5: Proven performance with widespread adoption, 768 dimensions.

Both fine-tune well on domain-specific data and run efficiently in production. Avoid using full LLMs (Llama, Mistral) for embeddings—they're 10-50x slower with marginal quality improvements. API-based options like OpenAI or Voyage work for prototyping but become expensive at scale (millions of embeddings) and introduce external dependencies.

Creating Training Data

Training requires pairs or triplets of conversations:

Pairs: (conversation1, conversation2, similarity_score)
Triplets: (anchor, positive, negative)

Bootstrap this data from existing clusters or user feedback:

from sentence_transformers import InputExample

# Use initial clusters as weak labels
train_examples = []
for cluster_id in set(labels):
    if cluster_id == -1:
        continue
    
    cluster_convos = conversations[labels == cluster_id]
    
    # Sample positive pairs within cluster
    for i in range(min(100, len(cluster_convos))):
        a, b = np.random.choice(cluster_convos, 2, replace=False)
        train_examples.append(InputExample(texts=[a, b], label=1.0))
    
    # Sample negative pairs across clusters
    other_cluster = np.random.choice([c for c in set(labels) if c != cluster_id])
    other_convos = conversations[labels == other_cluster]
    a = np.random.choice(cluster_convos)
    b = np.random.choice(other_convos)
    train_examples.append(InputExample(texts=[a, b], label=0.0))

Training Process

Sentence Transformers v3 introduced a modernized training approach with the SentenceTransformerTrainer, replacing the older fit method:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MultipleNegativesRankingLoss, MatryoshkaLoss
from sentence_transformers.training_args import SentenceTransformerTrainingArguments

# Load base model
model = SentenceTransformer('dunzhang/stella_en_400M_v5', trust_remote_code=True)

# Configure multi-task training
# MultipleNegativesRankingLoss is current best practice for contrastive learning
train_loss = MultipleNegativesRankingLoss(model)

# Optional: Add Matryoshka loss for variable dimensions
# This enables using smaller dimensions (e.g., 512) at inference for speed
train_loss = MatryoshkaLoss(
    model,
    train_loss,
    matryoshka_dims=[512, 768, 1024]
)

# Training arguments with new v3 features
args = SentenceTransformerTrainingArguments(
    output_dir='fine-tuned-chat-model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    warmup_steps=100,
    bf16=True,  # Use bfloat16 for faster training
    dataloader_drop_last=True,
)

# Train with new trainer
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,  # Prepared earlier
    loss=train_loss,
)

trainer.train()
model.save_pretrained('fine-tuned-chat-model')

Fine-tuning improves cluster coherence substantially. Domain-specific fine-tuning typically yields 20-30% improvements in silhouette scores, with the largest gains coming from correctly grouping conversations with similar intent but different wording, and separating conversations that use similar words but have different underlying goals.

Maintaining Insights Over Time

Users generate new conversations daily. Rerunning HDBSCAN on all historical data each time is computationally prohibitive and defeats the purpose of timely insights. You need strategies for incremental clustering that maintain stable cluster identities while adapting to new patterns.

Strategy 1: Prediction on Frozen Clusters

Train HDBSCAN once on a large representative sample. Use approximate_predict for new conversations:

# Initial clustering on historical data
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True)
clusterer.fit(embeddings_historical)

# Daily: classify new conversations
new_embeddings = model.encode(new_conversations)
new_labels, strengths = hdbscan.approximate_predict(clusterer, new_embeddings)

This approach provides fast classification and maintains consistent cluster identities over time. However, it doesn't adapt to new patterns, and clusters can drift as user behavior evolves. This strategy works well when cluster definitions are stable, such as support categories.

Strategy 2: Sliding Window with Cluster Re-identification

Cluster on a rolling window (typically 30-90 days). Use embedding similarity to match new clusters to historical ones:

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Maintain cluster centroids over time
cluster_history = []  # List of {timestamp, cluster_id, centroid, stable_id}

# Daily clustering on recent window
labels_today = clusterer.fit_predict(embeddings_recent)

# Compute centroids and match to history
for cluster_id in set(labels_today):
    if cluster_id == -1:
        continue
    mask = labels_today == cluster_id
    centroid = embeddings_recent[mask].mean(axis=0)
    
    # Match to historical clusters
    if cluster_history:
        historical_centroids = np.array([c['centroid'] for c in cluster_history])
        similarities = cosine_similarity([centroid], historical_centroids)[0]
        best_match_idx = similarities.argmax()
        
        if similarities[best_match_idx] > 0.85:
            matched_id = cluster_history[best_match_idx]['stable_id']
        else:
            matched_id = generate_new_cluster_id()
    else:
        matched_id = generate_new_cluster_id()
    
    cluster_history.append({
        'timestamp': today,
        'cluster_id': cluster_id,
        'stable_id': matched_id,
        'centroid': centroid
    })

This approach adapts to new patterns while maintaining interpretability. It requires more complex matching logic but better reflects evolving user behavior.

Strategy 3: Hierarchical Temporal Aggregation

For long-term trend analysis, cluster at multiple time scales:

Daily: Cluster each day's conversations (captures immediate patterns)
Weekly: Cluster the daily cluster centroids (meta-clustering)
Monthly: Cluster the weekly centroids

This creates a temporal hierarchy where new patterns emerge at the daily level and propagate upward if they remain stable.

# Daily clustering
daily_labels = {}
daily_centroids = {}

for date in date_range:
    convos_today = conversations[conversations['date'] == date]
    embeddings_today = model.encode(convos_today)
    labels = clusterer.fit_predict(embeddings_today)
    
    daily_labels[date] = labels
    daily_centroids[date] = compute_centroids(embeddings_today, labels)

# Weekly meta-clustering
all_daily_centroids = np.vstack([daily_centroids[d] for d in last_week])
weekly_labels = clusterer.fit_predict(all_daily_centroids)

Why Python Is Required

The question of running clustering directly in PostgreSQL with pgvector or in vector databases comes up frequently. The computational requirements of HDBSCAN make this impractical.

Computational Requirements

HDBSCAN requires:

Graph construction: Computing k-nearest neighbors for all points
Minimum spanning tree: Building a graph structure with approximately n edges
Hierarchy extraction: Traversing the tree to identify stable clusters
Linear algebra: Heavy matrix operations for distances and similarities

PostgreSQL and vector databases are optimized for different operations:

Point queries: Finding k-nearest neighbors of a single point
Set operations: Filtering and joining tables
Transaction safety: ACID guarantees

They are not optimized for:

All-pairs operations: Computing distances between all points
Complex graph algorithms: Minimum spanning trees, hierarchical decomposition
Iterative refinement: HDBSCAN's internal loops
Shared memory parallelism: Fine-grained locking across complex data structures

Extension Limitations

While theoretically possible to implement HDBSCAN as a PostgreSQL extension in C, several practical issues arise:

Memory management: PostgreSQL would need to load all embeddings into memory simultaneously. Vector databases page embeddings to disk, but HDBSCAN requires random access to the full distance matrix.
State management: HDBSCAN builds complex intermediate data structures including linkage trees and stability scores. SQL's set-based model doesn't map well to iterative graph algorithms.
Parallelization: HDBSCAN's parallel implementation uses shared memory and fine-grained locking. PostgreSQL's process-based parallelism creates significant IPC overhead for these access patterns.
Numerical libraries: HDBSCAN relies on NumPy, SciPy, and scikit-learn's optimized C and Cython implementations. Reimplementing this in PostgreSQL-compatible code represents a multi-year engineering effort.

Hybrid Architecture

The practical solution is a hybrid architecture:

# PostgreSQL: Store and retrieve embeddings
query = """
SELECT conversation_id, embedding 
FROM conversations 
WHERE created_at >= NOW() - INTERVAL '30 days'
"""

df = pd.read_sql(query, conn)
embeddings = np.stack(df['embedding'].values)

# Python: Cluster
labels = clusterer.fit_predict(embeddings)

# PostgreSQL: Store cluster assignments
df['cluster_id'] = labels
df[['conversation_id', 'cluster_id']].to_sql(
    'conversation_clusters',
    conn,
    if_exists='append'
)

Store embeddings in PostgreSQL, Pinecone, or Weaviate for efficient retrieval. Run HDBSCAN in Python where the necessary computational libraries exist. Store cluster assignments back in the database for querying and analysis.

Some vector databases are adding clustering capabilities, but these typically implement simpler algorithms like k-means or hierarchical agglomerative clustering that don't require the complex graph operations HDBSCAN needs.

Practical Guidance for Production

Several patterns emerge when clustering chat transcripts to extract product insights.

Preprocessing for Better Signal

Don't simply concatenate user and bot messages. Clean PII, handle multi-turn context thoughtfully, and emphasize intent signals:

def preprocess_conversation(messages):
    # Remove PII
    cleaned = remove_email_phone(messages)
    
    # Weight first user message (strong intent signal)
    first_user_msg = next(m for m in messages if m['role'] == 'user')
    summary = f"PRIMARY INTENT: {first_user_msg['content']}\n\nCONVERSATION: {cleaned}"
    
    return summary

The first user message often reveals what someone is trying to accomplish. Emphasizing it helps HDBSCAN group conversations by user goal rather than by arbitrary word overlap.

Monitoring for Product Changes

Track cluster centroids over time. Significant centroid movement indicates:

Product changes affecting user behavior (new features being adopted or ignored)
Emerging user needs (new clusters forming)
Degrading experiences (clusters shifting toward failure patterns)

# Track centroid movement weekly
centroid_distances = []
for week in weeks:
    current_centroid = compute_centroid(week)
    prev_centroid = compute_centroid(week - 1)
    distance = np.linalg.norm(current_centroid - prev_centroid)
    centroid_distances.append(distance)

Alert when centroid distances exceed a threshold. This is an early warning system for product issues.

Identifying Ambiguous Cases

HDBSCAN provides membership probabilities. Use them to identify conversations that span multiple use cases:

clusterer = hdbscan.HDBSCAN(min_cluster_size=30, prediction_data=True)
clusterer.fit(embeddings)

# Get soft cluster membership
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)

# Find conversations that belong to multiple clusters
ambiguous_mask = (soft_clusters > 0.3).sum(axis=1) > 1

These ambiguous conversations often reveal complex user journeys or cases where your product's boundaries are unclear.

Validating with Business Metrics

Cluster quality metrics like silhouette score are necessary but insufficient. Validate that clusters predict outcomes:

Churn rates per cluster
Conversion rates per cluster
Support ticket resolution time per cluster
Customer satisfaction scores per cluster

If clusters don't correlate with business metrics, your embeddings or hyperparameters need adjustment. The clusters should surface meaningful differences in user behavior and outcomes.

Getting Started

HDBSCAN addresses the specific challenges of chat transcript analysis: variable density clusters, unknown cluster counts, and noisy outliers. Making it work at scale requires attention to memory management, hyperparameter tuning, and incremental clustering strategies.

The Python requirement reflects the algorithm's need for sophisticated numerical computing libraries. Embrace a hybrid architecture: store conversations and embeddings in PostgreSQL or vector databases, run clustering in Python, and query results through SQL.

Start with a representative sample of conversations (10,000-50,000), iterate on embeddings and hyperparameters, and validate that clusters correspond to real user behaviors. The goal isn't perfect clusters—it's actionable insights about where your product succeeds, where it fails, and what users actually need.

Once you've identified patterns, close the loop:

Route conversations to specialized handlers based on cluster
Flag high-risk clusters for human review
Track cluster trends over time to measure product improvements
Use cluster membership to segment users in your analytics

HDBSCAN provides the signal. Turning that signal into product improvements is where the value emerges.

Working on chat transcript analysis? I'm [@c_h_wood] on Twitter.