Architecture

AI Personalization in E-Commerce: What the Engineering Actually Looks Like

Flexor Engineering

August 20, 2025

9 min read

AI personalization has been the feature every platform vendor has been pitching for the past two years. We've built these systems from scratch and bolted them onto existing stacks, and the honest version is a lot messier than the demos suggest. Cold start, latency tradeoffs, training data that doesn't match your serving context — the problems are real and they're not small. Here's what we've learned.

A simplified personalization pipeline. Each handoff — from event collection to feature store to model to serving API — carries its own latency and staleness. The gap between "data collected" and "recommendation served" is bigger than most teams expect.

Cold start is not an edge case

We keep seeing teams treat cold start as a footnote. It isn't. On most e-commerce stores, somewhere between 40 and 60 percent of sessions belong to users the system has never seen before. Add in new product launches and catalog churn, and you're regularly serving recommendations for items with no purchase history at all. Your model doesn't have nothing to go on — it has actively misleading absence of signal.

The fallback strategy matters more than teams give it credit for. Returning global bestsellers is the floor, not the answer. We've had better results layering in category-level affinity from the current session, referral source, and — where you can get it quickly — early browse behavior in the first few page views. It's not glamorous engineering, but it moves the number.

You can't run inference per request

A recommendation widget that adds 400ms to page load will cost you more in conversion than the personalization gains back. The serving layer needs to be fast — we target under 50ms at p99. That almost always means pre-computed recommendations served from cache, not live inference.

Which means your recommendations are only as fresh as your last batch run. For most use cases — homepage carousels, category sorting, email campaigns — a 15-minute or even hourly lag is fine. Where it breaks down is real-time signals: someone who just abandoned a cart, or who just completed a purchase and you want to upsell immediately. Those require a hybrid approach, and that hybrid is where most of the implementation complexity lives.

Collaborative filtering vs. embedding models

Collaborative filtering ("users who bought X also bought Y") is the right starting point for stores with dense purchase data. It's well-understood, straightforward to implement, and genuinely works. It falls apart on thin catalogs, products with fewer than a handful of purchases, and anything highly seasonal.

Embedding-based models — Two-Tower and BERT4Rec are the architectures we use most — handle sparse data better because they encode item attributes rather than relying purely on co-purchase history. New products get reasonable recommendations from day one. The tradeoff: they're more expensive to train, harder to debug when something goes wrong, and they need more infrastructure. There's no clean answer on which to choose. It depends entirely on your catalog size, session volume, and honestly, your team's ML experience.

Where lift actually comes from

Homepage and category page sorting give the biggest returns — surfacing items a user is more likely to buy toward the top of a ranked list. Triggered emails are second: abandoned cart, post-purchase, browse abandonment. "Related items" widgets on product pages are the most visible feature and often the first thing teams build, but in our experience the direct conversion impact is modest. Discovery is real, but it's harder to attribute.

If you're building from scratch, don't try to do it all at once. Instrument sessions properly. Build behavioral cohorts. Ship popularity-based recommendations first and measure them. Layer in collaborative filtering once the pipeline is stable. Every team we've seen try to skip straight to the ML layer has had to go back and rebuild the data plumbing anyway.

The failure modes we see most

The pattern we keep seeing: a personalization system that tests fine in staging and shows flat results in production. Nine times out of ten, the training data doesn't match the serving context. The model was trained on logged-in sessions but most production traffic is anonymous. Or the feature pipeline has enough lag that the "personalized" recommendations are stale by the time they're served — essentially random.

The other one: no holdout group. You can't know if personalization is helping if you've never run it against a control. We've audited stores where a recommendation widget had been running for over a year with no A/B framework, and nobody could tell whether it was contributing anything at all. Set up the measurement before you optimize.

Back to Blog

Next step

Working on a complex commerce system?

We help engineering teams design, build, and scale high-load platforms — with a clear process and predictable delivery.

Let's talk