Invisible link to canonical for Microformats
banner Finding 'Abbey Road' When Users Type 'beatles abbey rd' - Fuzzy and Semantic Search in PostgreSQL

Finding ‘Abbey Road’ When Users Type ‘beatles abbey rd’ - Fuzzy and Semantic Search in PostgreSQL

The DBA’s Guide to Music Matching - pg_trgm and pgvector with 100K Spotify Albums


The Problem: Dirty Input vs Clean Data

If you’ve ever built a search feature, you know the pain. Your database has a beautifully curated catalog of albums:

Abbey Road
The Dark Side of the Moon
OK Computer

But users type things like:

beatles abbey rd
dark side moon pink floyd
ok computer radiohead 1997

How do you match these? A simple WHERE name = ? won’t cut it. You need something smarter.

I recently worked on a classification system that needed to match messy invoice line items to a product catalog. The patterns I learned there apply perfectly to music, books, or any catalog matching problem. I ended up using two PostgreSQL extensions: pg_trgm for fuzzy text matching and pgvector for semantic similarity search.

In this post, I’ll walk you through both approaches using a real dataset: 114,000 Spotify tracks from Hugging Face. You’ll be able to follow along with actual data.


The Dataset: Spotify Tracks

We’ll use the Spotify Tracks Dataset from Hugging Face. It contains 114,000+ tracks across 125 genres, with album names, artists, and popularity scores. It’s CC0 licensed (public domain), so you can use it freely.

The dataset has real-world messiness: album names like “Abbey Road (Remastered)”, “The Dark Side of the Moon (2011 Remaster)”, and “OK Computer OKNOTOK 1997 2017”. Perfect for testing our matching approaches.


Two Approaches, Two Extensions

Approach Extension What it does Best for
Fuzzy matching pg_trgm Compares character sequences (trigrams) Typos, abbreviations, word order
Semantic search pgvector Compares meaning via embeddings Synonyms, paraphrasing, conceptual similarity

pg_trgm breaks text into 3-character sequences and measures overlap. “Abbey Road” becomes {" ab", "abb", "bbe", "bey", "ey ", " ro", "roa", "oad", "ad "}. If two strings share many trigrams, they’re similar.

pgvector stores vector embeddings—numerical representations of meaning generated by machine learning models. “Abbey Road” and “The Beatles final album” might have similar vectors even though they share no words.

Let’s set them up.


Database Setup

Extensions and Table

-- Enable extensions
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the catalog table
CREATE TABLE album_catalog (
    id SERIAL PRIMARY KEY,
    track_id VARCHAR(50),
    track_name VARCHAR(500) NOT NULL,
    artists VARCHAR(500),
    album_name VARCHAR(500),
    popularity INTEGER,
    album_normalized VARCHAR(500),       -- Cleaned version for fuzzy matching
    album_embedding vector(768)          -- For semantic search
);

Loading the Spotify Dataset

First, download the dataset and load it into PostgreSQL:

# load_spotify_data.py
from datasets import load_dataset
import psycopg2

# Download from Hugging Face (about 50MB)
print("Downloading Spotify dataset from Hugging Face...")
dataset = load_dataset("maharshipandya/spotify-tracks-dataset", split="train")

# Connect to PostgreSQL
conn = psycopg2.connect(
    host='localhost',
    database='music_catalog',
    user='your_user',
    password='your_password'
)
cursor = conn.cursor()

print(f"Loading {len(dataset)} tracks into PostgreSQL...")

# Insert tracks (deduplicate by album_name + artists)
seen_albums = set()
inserted = 0

for row in dataset:
    album_key = (row['album_name'], row['artists'])
    if album_key in seen_albums:
        continue
    seen_albums.add(album_key)
    
    cursor.execute("""
        INSERT INTO album_catalog (track_id, track_name, artists, album_name, popularity)
        VALUES (%s, %s, %s, %s, %s)
    """, (row['track_id'], row['track_name'], row['artists'], row['album_name'], row['popularity']))
    
    inserted += 1
    if inserted % 5000 == 0:
        print(f"  Inserted {inserted} albums...")

conn.commit()
print(f"Done! Loaded {inserted} unique albums.")
conn.close()

Run it:

pip install datasets psycopg2-binary
python load_spotify_data.py

This gives you around 50,000 unique albums to work with—plenty for testing both approaches.

Indexes

This is where the magic happens. Without proper indexes, both approaches would require full table scans.

-- GIN index for trigram similarity (pg_trgm)
CREATE INDEX idx_album_name_trgm 
ON album_catalog 
USING gin (album_normalized gin_trgm_ops);

-- IVFFlat index for vector similarity (pgvector)
CREATE INDEX idx_album_embedding 
ON album_catalog 
USING ivfflat (album_embedding vector_cosine_ops)
WITH (lists = 100);

A note on the lists parameter for IVFFlat: it controls how many clusters the index creates. More lists = faster queries but less accurate. The rule of thumb is lists = sqrt(rows) for small tables, or lists = rows / 1000 for larger ones. For a catalog of 50,000 albums, 100 lists is reasonable.


Approach 1: Fuzzy Matching with pg_trgm

Basic Similarity Query

SELECT 
    album_name,
    artists,
    similarity('abbey rd beatles', album_name) AS score
FROM album_catalog
WHERE similarity('abbey rd beatles', album_name) > 0.3
ORDER BY score DESC
LIMIT 5;

Result:

album_name artists score
Abbey Road (Remastered) The Beatles 0.48

The similarity() function returns a value between 0 and 1. I typically use 0.3 as a minimum threshold to filter noise, but this depends on your data.

Using the Index with the % Operator

The similarity() function doesn’t use the GIN index directly. For indexed queries, use the % operator:

-- Set the similarity threshold
SET pg_trgm.similarity_threshold = 0.3;

-- This query uses the index
SELECT album_name, artists, similarity('abbey rd', album_normalized) AS score
FROM album_catalog
WHERE album_normalized % 'abbey rd'
ORDER BY score DESC;

Screenshot suggestion: Show EXPLAIN ANALYZE output demonstrating the GIN index being used vs a sequential scan.

Strengths and Weaknesses

pg_trgm works well for:

  • Typos: “Abey Road” → “Abbey Road”
  • Abbreviations: “abbey rd” → “Abbey Road”
  • Missing words: “Dark Side Moon” → “The Dark Side of the Moon”
  • Word order variations: “Computer OK Radiohead” → “OK Computer”

pg_trgm struggles with:

  • Synonyms: “The Beatles last album” won’t match “Abbey Road”
  • Conceptual queries: “90s grunge Seattle” won’t match “Nevermind”
  • Completely different phrasing: “Floyd’s prism album” vs “The Dark Side of the Moon”

This is where semantic search comes in.


Approach 2: Semantic Search with pgvector

Generating Embeddings

Before you can search, you need to generate embeddings for your catalog. I use sentence-transformers with the all-mpnet-base-v2 model—it’s free, runs locally, and produces 768-dimensional embeddings.

# generate_embeddings.py
from sentence_transformers import SentenceTransformer
import psycopg2

# Load the model (downloads ~420MB on first run)
print("Loading embedding model...")
model = SentenceTransformer('all-mpnet-base-v2')

# Connect to your database
conn = psycopg2.connect(
    host='localhost',
    database='music_catalog',
    user='your_user',
    password='your_password'
)
cursor = conn.cursor()

# Fetch albums without embeddings
cursor.execute("""
    SELECT id, album_name, artists 
    FROM album_catalog 
    WHERE album_embedding IS NULL
""")
albums = cursor.fetchall()

print(f"Generating embeddings for {len(albums)} albums...")

batch_size = 100
for i in range(0, len(albums), batch_size):
    batch = albums[i:i + batch_size]
    
    # Combine album and artist for richer context
    texts = [f"{album} by {artist}" if artist else album 
             for _, album, artist in batch]
    
    # Generate embeddings in batch (much faster)
    embeddings = model.encode(texts)
    
    # Store in database
    for j, (album_id, album_name, _) in enumerate(batch):
        cursor.execute("""
            UPDATE album_catalog 
            SET album_embedding = %s 
            WHERE id = %s
        """, (embeddings[j].tolist(), album_id))
    
    conn.commit()
    print(f"  Processed {min(i + batch_size, len(albums))}/{len(albums)} albums")

conn.close()
print("Done!")

Run it:

pip install sentence-transformers
python generate_embeddings.py

On a decent CPU, this processes about 500 albums per minute. For 50,000 albums, expect around 90 minutes. Go grab a coffee.

Important: Run this once to populate your catalog. Embeddings are expensive to generate, so you don’t want to do it on every query.

Searching with Embeddings

def search_by_embedding(query, cursor, model, threshold=0.6):
    """Search using semantic similarity"""
    # Generate embedding for the query
    query_embedding = model.encode(query).tolist()
    
    cursor.execute("""
        SELECT 
            album_name,
            artists,
            1 - (album_embedding <=> %s::vector) AS similarity
        FROM album_catalog
        WHERE album_embedding IS NOT NULL
          AND 1 - (album_embedding <=> %s::vector) > %s
        ORDER BY similarity DESC
        LIMIT 5
    """, (query_embedding, query_embedding, threshold))
    
    return cursor.fetchall()

# Example
results = search_by_embedding("Beatles final studio album", cursor, model)
# Returns: Abbey Road (even though the query shares few words with the album name)

The <=> operator calculates cosine distance. Since we want similarity (not distance), we use 1 - distance.

Strengths and Weaknesses

pgvector works well for:

  • Synonyms and paraphrasing: “Beatles final album” → “Abbey Road”
  • Conceptual queries: “Pink Floyd concept album about madness” → “The Wall”
  • Natural language: “that Radiohead album from the late 90s” → “OK Computer”

pgvector struggles with:

  • Exact matches: It might rank a semantically similar but wrong album higher
  • Abbreviations: The model might not understand “rd” = “Road”
  • Domain-specific jargon: Unless the model was trained on it

Text Normalization: The Secret Sauce

Here’s what I learned the hard way: both approaches work better with clean input. Raw user input is full of noise that hurts matching accuracy. And the Spotify dataset is full of real-world messiness—album names like “The Dark Side of the Moon (2011 Remaster)” or “OK Computer OKNOTOK 1997 2017”.

The Normalization Pipeline

import re

# Common abbreviations in music/media
ABBREVIATIONS = {
    r'\bfeat\.?\b': 'featuring',
    r'\bft\.?\b': 'featuring',
    r'\bvol\.?\b': 'volume',
    r'\bpt\.?\b': 'part',
    r'\bv\.?\s*(\d)': r'volume \1',
    r'\bst\.?\b': 'saint',
    r'\b&\b': 'and',
}

# Noise patterns to remove (common in Spotify data)
NOISE_PATTERNS = [
    r'\(remaster(ed)?\s*\d*\)',            # (Remastered 2023)
    r'\(\d{4}\s*remaster(ed)?\)',          # (2011 Remaster)
    r'\[deluxe(\s+edition)?\]',             # [Deluxe Edition]
    r'\(deluxe(\s+edition)?\)',             # (Deluxe Edition)
    r'\s*-\s*single\b',                     # - Single
    r'\s*\[\d+[-/]\d+\]',                   # [Disc 1/2]
    r'\(anniversary(\s+edition)?\)',        # (Anniversary Edition)
    r'\(expanded(\s+edition)?\)',           # (Expanded Edition)
    r'\(bonus\s+track.*?\)',                # (Bonus Track Version)
    r'\(super\s+deluxe\)',                  # (Super Deluxe)
    r'\(\d{4}\s+re-?issue\)',               # (2021 Reissue)
    r'\s+OKNOTOK\s+\d{4}\s+\d{4}',          # OKNOTOK 1997 2017
    r'\s*\(original\s+motion\s+picture.*?\)', # (Original Motion Picture Soundtrack)
]

# Articles to strip from the beginning
LEADING_ARTICLES = ['the', 'a', 'an']


def normalize_album(text):
    """
    Clean and normalize an album name for better matching.
    Works for both catalog entries and user queries.
    """
    if not text:
        return ''
    
    # Lowercase everything
    cleaned = text.lower().strip()
    
    # Remove noise patterns
    for pattern in NOISE_PATTERNS:
        cleaned = re.sub(pattern, '', cleaned, flags=re.IGNORECASE)
    
    # Expand abbreviations
    for pattern, replacement in ABBREVIATIONS.items():
        cleaned = re.sub(pattern, replacement, cleaned, flags=re.IGNORECASE)
    
    # Remove leading articles for better matching
    # "The Dark Side" and "Dark Side" should match
    for article in LEADING_ARTICLES:
        if cleaned.startswith(article + ' '):
            cleaned = cleaned[len(article) + 1:]
            break
    
    # Normalize whitespace
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()
    
    # Remove punctuation except alphanumeric and spaces
    cleaned = re.sub(r'[^\w\s]', '', cleaned)
    
    return cleaned


# Examples from the Spotify dataset
print(normalize_album("The Dark Side of the Moon (2011 Remaster)"))
# Output: "dark side of moon"

print(normalize_album("OK Computer OKNOTOK 1997 2017"))
# Output: "ok computer"

print(normalize_album("Abbey Road (Remastered)"))
# Output: "abbey road"

print(normalize_album("Sgt. Pepper's Lonely Hearts Club Band (Deluxe Edition)"))
# Output: "sgt peppers lonely hearts club band"

Populating the Normalized Column

# normalize_albums.py
import re
import psycopg2

# ... (normalize_album function from above) ...

conn = psycopg2.connect(
    host='localhost',
    database='music_catalog',
    user='your_user',
    password='your_password'
)
cursor = conn.cursor()

# Update all catalog entries with normalized album names
cursor.execute("SELECT id, album_name FROM album_catalog")
albums = cursor.fetchall()

print(f"Normalizing {len(albums)} albums...")
for album_id, album_name in albums:
    normalized = normalize_album(album_name)
    cursor.execute("""
        UPDATE album_catalog 
        SET album_normalized = %s 
        WHERE id = %s
    """, (normalized, album_id))

conn.commit()
conn.close()
print("Done!")

Now your fuzzy matches compare apples to apples. “Abbey Road (Remastered)” and “abbey rd” both normalize to something comparable.


Combining Both Approaches

The best results come from using both methods together. My strategy:

  1. Try fuzzy matching first (fast, uses index)
  2. If score is low, fall back to embeddings (slower, more accurate for hard cases)
  3. Use the better result
def search_catalog(query, cursor, model, fuzzy_threshold=0.3, embedding_threshold=0.6):
    """
    Hybrid search: fuzzy first, embeddings as fallback
    """
    normalized_query = normalize_album(query)
    
    # Step 1: Try fuzzy matching
    cursor.execute("""
        SELECT 
            id, album_name, artists,
            similarity(%s, album_normalized) AS score,
            'fuzzy' AS method
        FROM album_catalog
        WHERE similarity(%s, album_normalized) > %s
        ORDER BY score DESC
        LIMIT 1
    """, (normalized_query, normalized_query, fuzzy_threshold))
    
    fuzzy_result = cursor.fetchone()
    
    # If fuzzy score is high enough, return it
    if fuzzy_result and fuzzy_result[3] >= 0.65:
        return fuzzy_result
    
    # Step 2: Try semantic search
    query_embedding = model.encode(query).tolist()
    
    cursor.execute("""
        SELECT 
            id, album_name, artists,
            1 - (album_embedding <=> %s::vector) AS score,
            'embedding' AS method
        FROM album_catalog
        WHERE album_embedding IS NOT NULL
          AND 1 - (album_embedding <=> %s::vector) > %s
        ORDER BY score DESC
        LIMIT 1
    """, (query_embedding, query_embedding, embedding_threshold))
    
    embedding_result = cursor.fetchone()
    
    # Return the better result
    if not fuzzy_result:
        return embedding_result
    if not embedding_result:
        return fuzzy_result
    
    # Both have results - return the one with higher confidence
    return fuzzy_result if fuzzy_result[3] >= embedding_result[3] else embedding_result


# Examples
print(search_catalog("abbey rd", cursor, model))
# → Fuzzy match: Abbey Road (Remastered) | The Beatles (score: 0.71)

print(search_catalog("beatles last studio album", cursor, model))
# → Embedding match: Abbey Road (Remastered) | The Beatles (score: 0.73)

print(search_catalog("that radiohead album with paranoid android", cursor, model))
# → Embedding match: OK Computer | Radiohead (score: 0.68)

Performance Considerations

pg_trgm

  • Index type: GIN is recommended over GiST for read-heavy workloads
  • Index size: Roughly 2-3x the size of the indexed column
  • Query speed: Sub-millisecond for most queries with proper indexes
  • Maintenance: Indexes update automatically on INSERT/UPDATE
-- Check index size
SELECT pg_size_pretty(pg_relation_size('idx_album_name_trgm'));

pgvector

  • Index type: IVFFlat is faster but approximate; HNSW is more accurate but uses more memory
  • Embedding generation: ~50ms per text with all-mpnet-base-v2 on CPU (~500 albums/minute in batch)
  • Query speed: 1-10ms depending on table size and index parameters
  • Maintenance: Re-index if you add many new rows (>10% of original size)
-- Check if index needs rebuilding
SELECT reltuples FROM pg_class WHERE relname = 'album_catalog';
-- If this is significantly higher than when you created the index, rebuild it

Screenshot suggestion: Side-by-side EXPLAIN ANALYZE comparing fuzzy vs embedding search times on the same query.


When to Use Which

Scenario Recommended Approach
Typo correction pg_trgm
Abbreviation expansion pg_trgm (with normalization)
Natural language queries pgvector
“Find similar items” pgvector
Autocomplete/typeahead pg_trgm
Multi-language support pgvector (with multilingual model)
Limited compute resources pg_trgm
Cold data (no ML model loaded) pg_trgm

For most real-world applications, start with pg_trgm. It’s simpler, faster, and handles the majority of cases. Add pgvector when you need semantic understanding or when fuzzy matching isn’t cutting it.


Choosing an Embedding Model

I used all-mpnet-base-v2 because it’s a solid general-purpose model. But the choice matters:

Model Dimensions Speed Quality Use Case
all-MiniLM-L6-v2 384 Fast Good Large catalogs, autocomplete
all-mpnet-base-v2 768 Medium Better General purpose
Domain-specific models Varies Varies Best for domain Medical, legal, scientific

For example, if you were matching pharmaceutical products, you might use SapBERT (cambridgeltl/SapBERT-from-PubMedBERT-fulltext)—a model specifically trained on biomedical terminology.

Browse models at Hugging Face sentence-transformers.


Conclusion

Matching messy input to clean catalogs is a solved problem in PostgreSQL—you just need the right tools:

  • pg_trgm for fast, character-level fuzzy matching
  • pgvector for semantic, meaning-based similarity
  • Text normalization to clean up noise before matching
  • Hybrid approach for the best of both worlds

We tested this with 50,000 real albums from the Spotify dataset, and the patterns here work for music catalogs, book databases, product matching, or any scenario where users type imprecise queries against structured data.

All the code in this post runs on standard PostgreSQL with two extensions. No external search engines, no Elasticsearch, no complex infrastructure. Just your database doing what databases do best.

The full code is available as a set of Python scripts you can run against your own PostgreSQL instance. Download the Spotify dataset from Hugging Face, load it up, and start experimenting.


Have questions or want to share your own catalog matching war stories? Drop a comment below.