Vector Search for E-Commerce: Why Keyword Search Breaks at Scale and What to Build Instead

Flexor Engineering

March 19, 2026

13 min read

We've rebuilt search from scratch on catalogs ranging from 30,000 SKUs to over two million, and the pattern is consistent: keyword-based search holds up fine at small scale, then falls apart in ways that are hard to diagnose. Queries return zero results for products that absolutely exist. Synonyms get missed. A user types "running sneakers" and gets back dress shoes because the word "shoes" appears in the description. This is not a configuration problem. It's a fundamental limitation of how inverted-index search works — and vector search is the right tool for fixing it, not a silver bullet you can drop in without thinking.

Hybrid search pipeline: the same query travels two paths simultaneously — keyword (BM25/inverted index) and vector (embedding model → approximate nearest neighbor index) — then both result sets merge and pass through a reranker before reaching the user. The split happens at query time; index building is a separate offline process.

What BM25 search actually does, and where it breaks

BM25 — the ranking function underneath Elasticsearch, OpenSearch, and most self-hosted Solr setups — scores documents by term frequency and inverse document frequency. Terms that appear often in a document and rarely across the corpus get high weight. It's an elegant algorithm with 30 years of production use behind it, and for a lot of workloads it's exactly the right choice.

The problem shows up clearly at scale. A catalog of 50,000+ SKUs accumulates long-tail queries — niche product names, brand abbreviations, colloquial terms, typos — that the index has never seen as literal strings. BM25 can't infer that "NB 574" and "New Balance 574" describe the same shoe. It doesn't know that "charger" and "charging cable" are related. It's purely lexical: if the token isn't in the index, it contributes nothing to the score.

The failure modes we see most often: zero-result pages for queries that should return products (synonym blindness); users getting unrelated products because they share one high-frequency token with what was searched (intent mismatch); and ranked results that look technically correct but put the most relevant item on page three because it uses different terminology than the query. All three kill conversion, and all three get worse as the catalog grows. A store with 5,000 products can paper over these issues with aggressive synonym dictionaries. At 200,000 products, that approach becomes unmaintainable.

What vector search actually does

Vector search doesn't match tokens. It converts both queries and documents into dense numerical vectors — typically 384 to 1536 dimensions depending on the model — and finds documents whose vectors are close to the query vector in that high-dimensional space. "Close" means cosine similarity above some threshold, or more practically, the top-K nearest neighbors.

The embeddings come from a language model trained on massive amounts of text, which means the model has already learned that "sneakers" and "running shoes" live near each other in vector space, that "NB 574" maps to the New Balance brand, and that "charging cable" is related to "USB-C charger." You get this semantic understanding essentially for free, without maintaining synonym dictionaries.

In practice, you pre-compute embeddings for every product in your catalog — title, description, attributes, category path — concatenate or pool them, and store the resulting vector in an index built for nearest-neighbor search. At query time, you embed the search query using the same model, then query the index. The whole thing runs in single-digit milliseconds once the index is warm. The part that's not free: choosing the right embedding model for your domain, keeping embeddings current as the catalog changes, and infrastructure to serve the ANN index at low latency.

Hybrid search: why you almost always need both

Where pure vector search fails

Pure vector search has real weaknesses in commerce. Exact lookup is one of them. If a user searches for a specific product ID — "WD-40 300011" or "SKU-RF2291-BLK" — BM25 finds it trivially. A vector model may not: it was trained on natural language, not catalog codes, and the embedding for a SKU is essentially noise. You end up with situations where a precise query returns semantically similar but wrong results.

Rare proper nouns are the other big gap. Obscure brand names, niche model numbers, and product names that appear in your catalog but aren't well-represented in the model's training data will have low-quality embeddings. The model doesn't know what "Ridgid 18V" means if it's seen it only a handful of times during pretraining.

Combining BM25 and vector with reciprocal rank fusion

The standard approach is to run both retrievers in parallel and merge the result lists. Reciprocal rank fusion (RRF) is the merge strategy we reach for first: for each document, you compute a score based on its rank in each list — 1/(k + rank), where k is typically 60 — then sum across lists. It's robust to score scale differences between BM25 and cosine similarity, doesn't require tuning weights per query type, and works well across a wide range of catalog configurations. We've compared it against linear combination approaches and it consistently comes out ahead or equivalent with a lot less parameter sensitivity.

Relative weight and query routing

One refinement worth implementing: route different query types to different blends. Queries that look like exact identifiers (alphanumeric with specific patterns) get routed to BM25-heavy mode. Natural language queries — full sentences, conversational phrasing, questions — skew heavily toward the vector side. This doesn't require a fancy classifier; a few regex rules and query length heuristics get you 80% of the way there. The rest you tune based on real query logs.

Reranking sits between retrieval and the final result list. The cross-encoder sees the full query-document pair, which is slower but far more accurate than embedding cosine similarity alone. Business rules (margin, stock, promotions) apply as a final multiplier on top of the semantic score.

Reranking: where relevance meets business logic

Retrieval — BM25 and ANN combined — gives you a candidate set, typically the top 50 to 200 documents. Reranking re-scores that candidate set with a more expensive model and applies business rules on top. It's computationally feasible because you're scoring dozens of results, not scanning millions.

Cross-encoder rerankers — models like BGE-Reranker or Cohere Rerank see the full query-document pair concatenated, not separate embeddings. This is slower but much more accurate; the model can reason about how a specific query relates to a specific document text, not just average semantic proximity.
Personalization signals — click-through rate, purchase rate, and user affinity scores from your recommendation system can be blended in as multiplicative or additive factors on top of the semantic score.
Inventory and availability — out-of-stock items should be suppressed or pushed to the bottom. Zero stock = near-zero result rank, regardless of semantic relevance. We've seen stores send high-intent traffic to OOS products because relevance scoring ignored stock.
Margin and promotion flags — if a product is on promotion or carries higher margin, surfacing it slightly higher is a legitimate business rule. Keep it subtle enough that it doesn't override genuine relevance, or you'll erode trust.
Freshness — new arrivals often perform well with an early traffic boost, but that signal decays as click and purchase data accumulates and the product can rank on its own merits.

The order matters: run retrieval first (fast, broad), rerank the candidates (slower, precise), apply business rules last (deterministic overrides). Mixing business rules into the retrieval phase is tempting but creates debugging nightmares later.

The cold-start problem for new products

New products have no click data, no purchase history, no reviews. For collaborative filtering-based systems this is a known problem with known solutions. For search, it's a different issue: the product will surface in vector search immediately (it has an embedding from day one), but it has no behavioral signal to boost it in a reranker. Fresh products can rank below stale, less relevant items simply because those items have click data.

We handle this with a freshness window — new products get a static score boost for their first 7 to 14 days in the catalog, calibrated so it doesn't push completely irrelevant items to the top but gives genuinely relevant new products a fair shot at visibility. After the window closes, the product ranks purely on its semantic relevance and behavioral signal.

Catalog enrichment also matters here. Products with thin content — a title, a SKU, no description — produce weak embeddings. Before indexing, we run a lightweight enrichment pass: pull any available attributes from the PIM, generate a short description if one doesn't exist, append category breadcrumbs and tag synonyms. This alone measurably improves recall for new products. The content quality of your product data is directly proportional to the quality of your search embeddings.

Latency and infrastructure: what you're actually choosing between

ANN indexes — Hierarchical Navigable Small World graphs (HNSW) are the standard — trade recall for speed. An exact nearest-neighbor search over a million vectors is too slow for interactive search; HNSW gets you 95-99% recall at millisecond latency by exploring a graph structure instead of scanning everything. The recall tradeoff is almost always worth it at commerce scale.

On infrastructure options: pgvector is a reasonable choice for catalogs under ~500k vectors where you want to minimize operational complexity and you're already running Postgres. It won't scale to millions of vectors and high-QPS without read replicas and aggressive index tuning. Qdrant and Weaviate are purpose-built vector databases with proper HNSW implementations, filtering on metadata, and horizontal scaling — worth the operational overhead once you're past ~200k products or ~100 queries per second. Elasticsearch with dense_vector is a solid choice if you're already running ES and want to avoid a new service; the HNSW implementation is mature and the hybrid BM25+vector path is well-documented. Hosted services (Pinecone, Zilliz) remove infrastructure concerns but add cost per vector and per query that adds up fast on large catalogs.

Latency target: search results under 200ms at p95, with p99 under 400ms. That budget needs to cover embedding the query, running both retrievers, merging, and reranking. If you're adding a cross-encoder reranker, expect it to consume 80-120ms of that budget on a CPU — worth pre-warming the model and considering GPU inference for high-traffic stores.

What to build first: the 80/20 version

Before touching the index architecture, fix your query understanding layer. Strip stopwords (unless they're significant: "off" in "white off shoulder dress" matters). Correct obvious typos — a simple edit-distance spell checker catches the majority of them. Expand abbreviations and normalize brand names. Detect intent: is this a navigational query ("Nike store"), a product query ("white Nike Air Force 1 size 10"), or an informational query ("how to clean suede shoes")? Route them differently. We've seen query preprocessing improvements alone lift zero-result rate from 8% to under 2% without touching the index at all.

Once query understanding is solid, add vector search as a parallel path to your existing keyword search. Don't replace BM25 — add to it. Measure NDCG (normalized discounted cumulative gain) and MRR (mean reciprocal rank) on a test set of queries with known-good results before and after. Track conversion per search click, not just click-through rate; CTR can improve while conversion drops if you're surfacing more irrelevant results that look relevant. Run a proper A/B test — holdout 10-20% of traffic — before fully committing. The last thing to invest in is the reranker. Start with RRF + basic business rules. Add a cross-encoder once the baseline hybrid retrieval is stable and you have enough query volume (5,000+ daily searches) to measure the lift accurately.

Common mistakes worth avoiding

Embedding drift is underrated as a failure mode. Your embedding model is a snapshot. If you embedded your catalog with text-embedding-ada-002 in 2024 and you're querying with text-embedding-3-large in 2026, the vectors aren't comparable. Pick a model and stick with it, or build a pipeline to re-embed the full catalog when you upgrade. The same applies to fine-tuned models: fine-tune on your domain-specific data, re-embed everything, then deploy. Don't mix.

Single-language models on multilingual catalogs cause more grief than teams expect. A model trained primarily on English produces poor embeddings for French, German, or Arabic product descriptions. For multilingual stores, use multilingual models (multilingual-e5-large, LaBSE, or paraphrase-multilingual-mpnet-base-v2) and test retrieval quality separately per language. The quality gap between a language the model was heavily trained on versus one it wasn't can be stark.

Over-engineering the index before fixing query preprocessing is probably the most common mistake we see. Teams spend weeks evaluating Weaviate vs Qdrant vs Pinecone while their spell checker is producing garbage input. The index is upstream of all the fancy retrieval — garbage in, garbage out. Fix the input first, measure where your actual recall failures are, then invest in the index layer that addresses those specific failures.

Back to Blog

Next step

Working on a complex commerce system?

We help engineering teams design, build, and scale high-load platforms — with a clear process and predictable delivery.

Let's talk