The Problem: Dirty Input vs Clean Data
If you’ve ever built a search feature, you know the pain. Your database has a beautifully curated catalog of albums:
Abbey Road
The Dark Side of the Moon
OK Computer
But users type things like:
beatles abbey rd
dark side moon pink floyd
ok computer radiohead 1997
How do you match these? A simple WHERE name = ? won’t cut it. You need something smarter.
I recently worked on a classification system that needed to match messy invoice line items to a product catalog. The patterns I learned there apply perfectly to music, books, or any catalog matching problem. I ended up using two PostgreSQL extensions: pg_trgm for fuzzy text matching and pgvector for semantic similarity search.
In this post, I’ll walk you through both approaches using a real dataset: 114,000 Spotify tracks from Hugging Face. You’ll be able to follow along with actual data.
The Dataset: Spotify Tracks
We’ll use the Spotify Tracks Dataset from Hugging Face. It contains 114,000+ tracks across 125 genres, with album names, artists, and popularity scores. It’s CC0 licensed (public domain), so you can use it freely.
The dataset has real-world messiness: album names like “Abbey Road (Remastered)”, “The Dark Side of the Moon (2011 Remaster)”, and “OK Computer OKNOTOK 1997 2017”. Perfect for testing our matching approaches.
Two Approaches, Two Extensions
| Approach | Extension | What it does | Best for |
|---|---|---|---|
| Fuzzy matching | pg_trgm |
Compares character sequences (trigrams) | Typos, abbreviations, word order |
| Semantic search | pgvector |
Compares meaning via embeddings | Synonyms, paraphrasing, conceptual similarity |
pg_trgm breaks text into 3-character sequences and measures overlap. “Abbey Road” becomes {" ab", "abb", "bbe", "bey", "ey ", " ro", "roa", "oad", "ad "}. If two strings share many trigrams, they’re similar.
pgvector stores vector embeddings—numerical representations of meaning generated by machine learning models. “Abbey Road” and “The Beatles final album” might have similar vectors even though they share no words.
Let’s set them up.
Database Setup
Extensions and Table
-- Enable extensions
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the catalog table
CREATE TABLE album_catalog (
id SERIAL PRIMARY KEY,
track_id VARCHAR(50),
track_name VARCHAR(500) NOT NULL,
artists VARCHAR(500),
album_name VARCHAR(500),
popularity INTEGER,
album_normalized VARCHAR(500), -- Cleaned version for fuzzy matching
album_embedding vector(768) -- For semantic search
);
Loading the Spotify Dataset
First, download the dataset and load it into PostgreSQL:
# load_spotify_data.py
from datasets import load_dataset
import psycopg2
# Download from Hugging Face (about 50MB)
print("Downloading Spotify dataset from Hugging Face...")
dataset = load_dataset("maharshipandya/spotify-tracks-dataset", split="train")
# Connect to PostgreSQL
conn = psycopg2.connect(
host='localhost',
database='music_catalog',
user='your_user',
password='your_password'
)
cursor = conn.cursor()
print(f"Loading {len(dataset)} tracks into PostgreSQL...")
# Insert tracks (deduplicate by album_name + artists)
seen_albums = set()
inserted = 0
for row in dataset:
album_key = (row['album_name'], row['artists'])
if album_key in seen_albums:
continue
seen_albums.add(album_key)
cursor.execute("""
INSERT INTO album_catalog (track_id, track_name, artists, album_name, popularity)
VALUES (%s, %s, %s, %s, %s)
""", (row['track_id'], row['track_name'], row['artists'], row['album_name'], row['popularity']))
inserted += 1
if inserted % 5000 == 0:
print(f" Inserted {inserted} albums...")
conn.commit()
print(f"Done! Loaded {inserted} unique albums.")
conn.close()
Run it:
pip install datasets psycopg2-binary
python load_spotify_data.py
This gives you around 50,000 unique albums to work with—plenty for testing both approaches.
Indexes
This is where the magic happens. Without proper indexes, both approaches would require full table scans.
-- GIN index for trigram similarity (pg_trgm)
CREATE INDEX idx_album_name_trgm
ON album_catalog
USING gin (album_normalized gin_trgm_ops);
-- IVFFlat index for vector similarity (pgvector)
CREATE INDEX idx_album_embedding
ON album_catalog
USING ivfflat (album_embedding vector_cosine_ops)
WITH (lists = 100);
A note on the lists parameter for IVFFlat: it controls how many clusters the index creates. More lists = faster queries but less accurate. The rule of thumb is lists = sqrt(rows) for small tables, or lists = rows / 1000 for larger ones. For a catalog of 50,000 albums, 100 lists is reasonable.
Approach 1: Fuzzy Matching with pg_trgm
Basic Similarity Query
SELECT
album_name,
artists,
similarity('abbey rd beatles', album_name) AS score
FROM album_catalog
WHERE similarity('abbey rd beatles', album_name) > 0.3
ORDER BY score DESC
LIMIT 5;
Result:
| album_name | artists | score |
|---|---|---|
| Abbey Road (Remastered) | The Beatles | 0.48 |
The similarity() function returns a value between 0 and 1. I typically use 0.3 as a minimum threshold to filter noise, but this depends on your data.
Using the Index with the % Operator
The similarity() function doesn’t use the GIN index directly. For indexed queries, use the % operator:
-- Set the similarity threshold
SET pg_trgm.similarity_threshold = 0.3;
-- This query uses the index
SELECT album_name, artists, similarity('abbey rd', album_normalized) AS score
FROM album_catalog
WHERE album_normalized % 'abbey rd'
ORDER BY score DESC;
Screenshot suggestion: Show EXPLAIN ANALYZE output demonstrating the GIN index being used vs a sequential scan.
Strengths and Weaknesses
pg_trgm works well for:
- Typos: “Abey Road” → “Abbey Road”
- Abbreviations: “abbey rd” → “Abbey Road”
- Missing words: “Dark Side Moon” → “The Dark Side of the Moon”
- Word order variations: “Computer OK Radiohead” → “OK Computer”
pg_trgm struggles with:
- Synonyms: “The Beatles last album” won’t match “Abbey Road”
- Conceptual queries: “90s grunge Seattle” won’t match “Nevermind”
- Completely different phrasing: “Floyd’s prism album” vs “The Dark Side of the Moon”
This is where semantic search comes in.
Approach 2: Semantic Search with pgvector
Generating Embeddings
Before you can search, you need to generate embeddings for your catalog. I use sentence-transformers with the all-mpnet-base-v2 model—it’s free, runs locally, and produces 768-dimensional embeddings.
# generate_embeddings.py
from sentence_transformers import SentenceTransformer
import psycopg2
# Load the model (downloads ~420MB on first run)
print("Loading embedding model...")
model = SentenceTransformer('all-mpnet-base-v2')
# Connect to your database
conn = psycopg2.connect(
host='localhost',
database='music_catalog',
user='your_user',
password='your_password'
)
cursor = conn.cursor()
# Fetch albums without embeddings
cursor.execute("""
SELECT id, album_name, artists
FROM album_catalog
WHERE album_embedding IS NULL
""")
albums = cursor.fetchall()
print(f"Generating embeddings for {len(albums)} albums...")
batch_size = 100
for i in range(0, len(albums), batch_size):
batch = albums[i:i + batch_size]
# Combine album and artist for richer context
texts = [f"{album} by {artist}" if artist else album
for _, album, artist in batch]
# Generate embeddings in batch (much faster)
embeddings = model.encode(texts)
# Store in database
for j, (album_id, album_name, _) in enumerate(batch):
cursor.execute("""
UPDATE album_catalog
SET album_embedding = %s
WHERE id = %s
""", (embeddings[j].tolist(), album_id))
conn.commit()
print(f" Processed {min(i + batch_size, len(albums))}/{len(albums)} albums")
conn.close()
print("Done!")
Run it:
pip install sentence-transformers
python generate_embeddings.py
On a decent CPU, this processes about 500 albums per minute. For 50,000 albums, expect around 90 minutes. Go grab a coffee.
Important: Run this once to populate your catalog. Embeddings are expensive to generate, so you don’t want to do it on every query.
Searching with Embeddings
def search_by_embedding(query, cursor, model, threshold=0.6):
"""Search using semantic similarity"""
# Generate embedding for the query
query_embedding = model.encode(query).tolist()
cursor.execute("""
SELECT
album_name,
artists,
1 - (album_embedding <=> %s::vector) AS similarity
FROM album_catalog
WHERE album_embedding IS NOT NULL
AND 1 - (album_embedding <=> %s::vector) > %s
ORDER BY similarity DESC
LIMIT 5
""", (query_embedding, query_embedding, threshold))
return cursor.fetchall()
# Example
results = search_by_embedding("Beatles final studio album", cursor, model)
# Returns: Abbey Road (even though the query shares few words with the album name)
The <=> operator calculates cosine distance. Since we want similarity (not distance), we use 1 - distance.
Strengths and Weaknesses
pgvector works well for:
- Synonyms and paraphrasing: “Beatles final album” → “Abbey Road”
- Conceptual queries: “Pink Floyd concept album about madness” → “The Wall”
- Natural language: “that Radiohead album from the late 90s” → “OK Computer”
pgvector struggles with:
- Exact matches: It might rank a semantically similar but wrong album higher
- Abbreviations: The model might not understand “rd” = “Road”
- Domain-specific jargon: Unless the model was trained on it
Text Normalization: The Secret Sauce
Here’s what I learned the hard way: both approaches work better with clean input. Raw user input is full of noise that hurts matching accuracy. And the Spotify dataset is full of real-world messiness—album names like “The Dark Side of the Moon (2011 Remaster)” or “OK Computer OKNOTOK 1997 2017”.
The Normalization Pipeline
import re
# Common abbreviations in music/media
ABBREVIATIONS = {
r'\bfeat\.?\b': 'featuring',
r'\bft\.?\b': 'featuring',
r'\bvol\.?\b': 'volume',
r'\bpt\.?\b': 'part',
r'\bv\.?\s*(\d)': r'volume \1',
r'\bst\.?\b': 'saint',
r'\b&\b': 'and',
}
# Noise patterns to remove (common in Spotify data)
NOISE_PATTERNS = [
r'\(remaster(ed)?\s*\d*\)', # (Remastered 2023)
r'\(\d{4}\s*remaster(ed)?\)', # (2011 Remaster)
r'\[deluxe(\s+edition)?\]', # [Deluxe Edition]
r'\(deluxe(\s+edition)?\)', # (Deluxe Edition)
r'\s*-\s*single\b', # - Single
r'\s*\[\d+[-/]\d+\]', # [Disc 1/2]
r'\(anniversary(\s+edition)?\)', # (Anniversary Edition)
r'\(expanded(\s+edition)?\)', # (Expanded Edition)
r'\(bonus\s+track.*?\)', # (Bonus Track Version)
r'\(super\s+deluxe\)', # (Super Deluxe)
r'\(\d{4}\s+re-?issue\)', # (2021 Reissue)
r'\s+OKNOTOK\s+\d{4}\s+\d{4}', # OKNOTOK 1997 2017
r'\s*\(original\s+motion\s+picture.*?\)', # (Original Motion Picture Soundtrack)
]
# Articles to strip from the beginning
LEADING_ARTICLES = ['the', 'a', 'an']
def normalize_album(text):
"""
Clean and normalize an album name for better matching.
Works for both catalog entries and user queries.
"""
if not text:
return ''
# Lowercase everything
cleaned = text.lower().strip()
# Remove noise patterns
for pattern in NOISE_PATTERNS:
cleaned = re.sub(pattern, '', cleaned, flags=re.IGNORECASE)
# Expand abbreviations
for pattern, replacement in ABBREVIATIONS.items():
cleaned = re.sub(pattern, replacement, cleaned, flags=re.IGNORECASE)
# Remove leading articles for better matching
# "The Dark Side" and "Dark Side" should match
for article in LEADING_ARTICLES:
if cleaned.startswith(article + ' '):
cleaned = cleaned[len(article) + 1:]
break
# Normalize whitespace
cleaned = re.sub(r'\s+', ' ', cleaned).strip()
# Remove punctuation except alphanumeric and spaces
cleaned = re.sub(r'[^\w\s]', '', cleaned)
return cleaned
# Examples from the Spotify dataset
print(normalize_album("The Dark Side of the Moon (2011 Remaster)"))
# Output: "dark side of moon"
print(normalize_album("OK Computer OKNOTOK 1997 2017"))
# Output: "ok computer"
print(normalize_album("Abbey Road (Remastered)"))
# Output: "abbey road"
print(normalize_album("Sgt. Pepper's Lonely Hearts Club Band (Deluxe Edition)"))
# Output: "sgt peppers lonely hearts club band"
Populating the Normalized Column
# normalize_albums.py
import re
import psycopg2
# ... (normalize_album function from above) ...
conn = psycopg2.connect(
host='localhost',
database='music_catalog',
user='your_user',
password='your_password'
)
cursor = conn.cursor()
# Update all catalog entries with normalized album names
cursor.execute("SELECT id, album_name FROM album_catalog")
albums = cursor.fetchall()
print(f"Normalizing {len(albums)} albums...")
for album_id, album_name in albums:
normalized = normalize_album(album_name)
cursor.execute("""
UPDATE album_catalog
SET album_normalized = %s
WHERE id = %s
""", (normalized, album_id))
conn.commit()
conn.close()
print("Done!")
Now your fuzzy matches compare apples to apples. “Abbey Road (Remastered)” and “abbey rd” both normalize to something comparable.
Combining Both Approaches
The best results come from using both methods together. My strategy:
- Try fuzzy matching first (fast, uses index)
- If score is low, fall back to embeddings (slower, more accurate for hard cases)
- Use the better result
def search_catalog(query, cursor, model, fuzzy_threshold=0.3, embedding_threshold=0.6):
"""
Hybrid search: fuzzy first, embeddings as fallback
"""
normalized_query = normalize_album(query)
# Step 1: Try fuzzy matching
cursor.execute("""
SELECT
id, album_name, artists,
similarity(%s, album_normalized) AS score,
'fuzzy' AS method
FROM album_catalog
WHERE similarity(%s, album_normalized) > %s
ORDER BY score DESC
LIMIT 1
""", (normalized_query, normalized_query, fuzzy_threshold))
fuzzy_result = cursor.fetchone()
# If fuzzy score is high enough, return it
if fuzzy_result and fuzzy_result[3] >= 0.65:
return fuzzy_result
# Step 2: Try semantic search
query_embedding = model.encode(query).tolist()
cursor.execute("""
SELECT
id, album_name, artists,
1 - (album_embedding <=> %s::vector) AS score,
'embedding' AS method
FROM album_catalog
WHERE album_embedding IS NOT NULL
AND 1 - (album_embedding <=> %s::vector) > %s
ORDER BY score DESC
LIMIT 1
""", (query_embedding, query_embedding, embedding_threshold))
embedding_result = cursor.fetchone()
# Return the better result
if not fuzzy_result:
return embedding_result
if not embedding_result:
return fuzzy_result
# Both have results - return the one with higher confidence
return fuzzy_result if fuzzy_result[3] >= embedding_result[3] else embedding_result
# Examples
print(search_catalog("abbey rd", cursor, model))
# → Fuzzy match: Abbey Road (Remastered) | The Beatles (score: 0.71)
print(search_catalog("beatles last studio album", cursor, model))
# → Embedding match: Abbey Road (Remastered) | The Beatles (score: 0.73)
print(search_catalog("that radiohead album with paranoid android", cursor, model))
# → Embedding match: OK Computer | Radiohead (score: 0.68)
Performance Considerations
pg_trgm
- Index type: GIN is recommended over GiST for read-heavy workloads
- Index size: Roughly 2-3x the size of the indexed column
- Query speed: Sub-millisecond for most queries with proper indexes
- Maintenance: Indexes update automatically on INSERT/UPDATE
-- Check index size
SELECT pg_size_pretty(pg_relation_size('idx_album_name_trgm'));
pgvector
- Index type: IVFFlat is faster but approximate; HNSW is more accurate but uses more memory
- Embedding generation: ~50ms per text with
all-mpnet-base-v2on CPU (~500 albums/minute in batch) - Query speed: 1-10ms depending on table size and index parameters
- Maintenance: Re-index if you add many new rows (>10% of original size)
-- Check if index needs rebuilding
SELECT reltuples FROM pg_class WHERE relname = 'album_catalog';
-- If this is significantly higher than when you created the index, rebuild it
Screenshot suggestion: Side-by-side EXPLAIN ANALYZE comparing fuzzy vs embedding search times on the same query.
When to Use Which
| Scenario | Recommended Approach |
|---|---|
| Typo correction | pg_trgm |
| Abbreviation expansion | pg_trgm (with normalization) |
| Natural language queries | pgvector |
| “Find similar items” | pgvector |
| Autocomplete/typeahead | pg_trgm |
| Multi-language support | pgvector (with multilingual model) |
| Limited compute resources | pg_trgm |
| Cold data (no ML model loaded) | pg_trgm |
For most real-world applications, start with pg_trgm. It’s simpler, faster, and handles the majority of cases. Add pgvector when you need semantic understanding or when fuzzy matching isn’t cutting it.
Choosing an Embedding Model
I used all-mpnet-base-v2 because it’s a solid general-purpose model. But the choice matters:
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
all-MiniLM-L6-v2 |
384 | Fast | Good | Large catalogs, autocomplete |
all-mpnet-base-v2 |
768 | Medium | Better | General purpose |
| Domain-specific models | Varies | Varies | Best for domain | Medical, legal, scientific |
For example, if you were matching pharmaceutical products, you might use SapBERT (cambridgeltl/SapBERT-from-PubMedBERT-fulltext)—a model specifically trained on biomedical terminology.
Browse models at Hugging Face sentence-transformers.
Conclusion
Matching messy input to clean catalogs is a solved problem in PostgreSQL—you just need the right tools:
- pg_trgm for fast, character-level fuzzy matching
- pgvector for semantic, meaning-based similarity
- Text normalization to clean up noise before matching
- Hybrid approach for the best of both worlds
We tested this with 50,000 real albums from the Spotify dataset, and the patterns here work for music catalogs, book databases, product matching, or any scenario where users type imprecise queries against structured data.
All the code in this post runs on standard PostgreSQL with two extensions. No external search engines, no Elasticsearch, no complex infrastructure. Just your database doing what databases do best.
The full code is available as a set of Python scripts you can run against your own PostgreSQL instance. Download the Spotify dataset from Hugging Face, load it up, and start experimenting.
Have questions or want to share your own catalog matching war stories? Drop a comment below.