Podcast Vector Search | Semantic Search for Podcast Transcripts

The Problem

With 70+ podcast episodes and thousands of conversation segments, finding specific insights became impossible:

"What did guests say about product-market fit?"
"Find all discussions about fundraising from angels"
"When did someone mention a specific framework or methodology?"

Keyword search fails because people express the same idea in countless ways. I needed semantic understanding.

My Approach

I built a three-layer search system:

Embedding Layer: OpenAI's text-embedding-3-small converts queries and content to vectors
Storage Layer: Supabase pgvector stores embeddings with IVFFLAT indexes for fast similarity search
Search Layer: Combines semantic similarity with keyword matching for hybrid search

The MCP protocol enables Claude Desktop to search directly, enabling conversations like: "Search for discussions about bootstrapping and summarize the main points."

Architecture

Podcast Vector Search MCP - Architecture Diagram

Key Features

Hybrid Search: Combines semantic similarity with keyword matching
Multi-layer Caching: Embedding cache + result cache with intelligent TTL
Speaker Filtering: Search within specific speaker's content
Episode Filtering: Limit search to specific episodes or date ranges
Context Expansion: Optionally include surrounding segments
Configurable Threshold: Adjust similarity cutoff (default 0.7)

Results & Metrics

Metric	Value
Indexed Segments	6,049
Average Query Time	<500ms
Cache Hit Rate	70-90%
Embedding Dimensions	1536
Storage Size	~50MB
Concurrent Capacity	50+ searches

What I Learned

The hybrid search fusion was tricky. Pure semantic search sometimes missed exact quotes users were looking for, while pure keyword search missed conceptual matches. I implemented a weighted fusion:

final_score = (semantic_score * 0.7) + (keyword_score * 0.3)

The 70/30 split worked best for our use case, but I exposed it as a configurable parameter.

Another insight: caching embeddings is more valuable than caching results. The same query phrases repeat often, but with different filters. Caching at the embedding level captures more reuse.

Frequently Asked Questions

What problem does Podcast Vector Search solve?

It enables finding specific insights across thousands of podcast transcript segments using natural language queries. Instead of remembering which episode discussed a topic, you can search semantically.

What technologies power this project?

OpenAI text-embedding-3-small for embeddings (1536 dimensions), Supabase with pgvector for vector storage and similarity search, and MCP protocol for Claude Desktop integration. Available in both Python and TypeScript implementations.

How fast is the search?

Average query time is under 500ms. With cache hits (70-90% of queries), responses are under 50ms. The system handles 50+ concurrent searches without degradation.

Frequently Asked Questions

Average query time is under 500ms. With cache hits (70-90% of queries), responses are under 50ms. The system handles 50+ concurrent searches without degradation.

More Projects

View all

AI/ML

Podcast Content Pipeline

AI/ML

Podcast Transcription MCP Server

Financial

Credit Card Benefits Organizer

Built by Abhinav Sinha

AI-First Product Manager who builds production-grade tools. Passionate about turning complex problems into elegant solutions using AI, automation, and modern web technologies.

Connect on LinkedIn Get in Touch View All Projects

The Problem

My Approach

Architecture

Key Features

Results &#x26; Metrics

What I Learned

Frequently Asked Questions

What problem does Podcast Vector Search solve?

What technologies power this project?

How fast is the search?

Frequently Asked Questions

More Projects

Podcast Content Pipeline

Podcast Transcription MCP Server

Credit Card Benefits Organizer

Built by Abhinav Sinha

Results & Metrics