AI Thumbnail Generator
Multi-stage AI pipeline for YouTube podcast thumbnails. Uses Claude for hook generation and Gemini for image synthesis with character consistency.
TL;DR: TL;DR: I built an AI pipeline that generates A/B-testable YouTube thumbnails for podcasts. Claude analyzes episodes and creates 5 hook options, selects the best 3, then Gemini renders images with consistent character faces and brand colors.
The Problem
YouTube thumbnails make or break click-through rates, but creating good ones is:
- Time-consuming: 30-60 minutes per thumbnail with design tools
- Skill-dependent: Requires knowing design principles, color theory, expressions
- Hit-or-miss: Hard to predict what will perform well
- Inconsistent: Maintaining brand identity across 70+ episodes is tough
I needed a system that could generate multiple professional options quickly for A/B testing.
My Approach
I built a multi-stage generation pipeline:
- Hook Generation (Claude): Analyzes episode and generates 5 hook options using different psychological approaches
- Hook Selection (Claude): Evaluates all hooks and picks the best 3 for testing
- Expression Mapping: Maps hook mood to facial expressions (revelation, authority, controversy)
- Image Generation (Gemini): Renders thumbnails with guest images for face consistency
The key insight was separating conceptual work (what message?) from visual work (how to render it?).
Architecture
AI Thumbnail Generator - Architecture Diagram
Key Features
- 5 Hook Approaches: Each uses different psychological trigger
- Objective Selection: Claude evaluates without bias toward its own outputs
- Expression Consistency: Mood maps to specific facial expressions
- Character Persistence: Reference images maintain face identity
- Brand Colors: Saved and reused across episodes
- Session Management: Save/restore incomplete workflows
- Prompts-Only Mode: Generate hooks without image rendering
Results & Metrics
| Metric | Value |
|---|---|
| Hooks Generated | 5 per episode |
| Thumbnails Output | 3 per session |
| Image Resolution | 2048x2048 |
| Reference Images | Up to 5 guests + host |
| Rate Limits | Claude: 50/min, Gemini: 10/min |
| Output Files | thumbnails + prompts.txt + metadata.json |
What I Learned
The hardest part was character consistency. Early versions generated great compositions but the guest's face looked different in each thumbnail. I solved this by:
- Reference image feeding: Pass up to 5 guest photos to Gemini
- Explicit face instructions: "Maintain exact facial features from reference"
- Expression guidance: Specific descriptions like "widened eyes, slight forward lean"
Another challenge was rate limiting. Gemini's image API has strict limits (10/minute), so I added exponential backoff:
# Automatic retry with backoff
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
async def generate_image(prompt: str):
...
The separation of Claude for conceptual work and Gemini for visual work was key—each model excels at different tasks.
Frequently Asked Questions
What problem does this generator solve?
It reduces thumbnail creation from 30-60 minutes to 5 minutes per episode. Instead of manually designing, you get 3 A/B-testable options with psychological hooks and consistent character rendering.
What technologies power this project?
Claude API for hook generation and selection, Gemini API for image synthesis, Pydantic for data validation, and an interactive Python CLI for the workflow.
How good are the generated thumbnails?
Quality is high for podcast-style thumbnails with text overlays and host/guest faces. Complex scenes or multiple elements may require manual refinement. The psychological hooks are based on proven CTR frameworks.
Frequently Asked Questions
More Projects
View allBuilt by Abhinav Sinha
AI-First Product Manager who builds production-grade tools. Passionate about turning complex problems into elegant solutions using AI, automation, and modern web technologies.