LLMs in Commerce Backends: Real Use Cases, Integration Patterns, and What Most Teams Get Wrong

Flexor Engineering

March 12, 2026

15 min read

LLMs started showing up in our client work in late 2023 mostly as content generation experiments. By mid-2025, we'd built production integrations across a dozen commerce stacks — generating product descriptions at scale, extracting structured attributes from supplier PDFs, classifying support tickets before they hit a human agent. What's become clear from doing this is that the hype and the reality sit in very different places. LLMs are genuinely useful for a specific class of problems in commerce, and actively harmful for another class. The gap between those two is where most teams make expensive mistakes.

LLM integration architecture for commerce: product data goes through preprocessing before reaching the LLM API. RAG context (product catalog, policy docs) is retrieved separately and injected into the prompt. Output passes through a validation layer before landing as structured JSON or rendered content. Async pipeline shown on the right handles bulk catalog jobs.

What LLMs are actually useful for in commerce

The useful zone is roughly: tasks that involve understanding or generating natural language, where the output has moderate stakes, where a human can review edge cases, and where the alternative is either manual effort or no capability at all. Product description generation, attribute extraction from unstructured supplier data, intent classification for support tickets, and catalog tagging all fit here.

The zone where LLMs cause problems: anything with hard numerical correctness requirements (pricing calculations, inventory forecasting, discount validation), anything safety-critical or fraud-adjacent, and anything where a confident wrong answer is worse than no answer. We've seen teams try to use GPT-4 to generate pricing rules. It produces grammatically perfect output that is arithmetically wrong 5% of the time — which in production translates to customers buying products below cost.

Real-time decisions that happen per request — "should I show this user this product at this price?" — are almost never a good fit for LLM inference. The latency alone rules it out. Fraud detection, return eligibility, shipping calculation: these are deterministic or ML-classification problems, not language understanding problems. An LLM will confidently give you an answer; it will not give you a reliable one. That distinction matters a lot in commerce.

Product content at scale

This is where we've done the most work. A catalog of 50,000 SKUs imported from a supplier typically arrives with titles that are effectively product codes, descriptions in broken English, missing attributes, and no SEO consideration whatsoever. Writing those descriptions manually is not economically viable. LLMs make it viable — but the pipeline around the model matters more than the model itself.

The pattern that works: extract whatever structured data exists (category, brand, dimensions, materials, existing attributes), construct a prompt that specifies tone, length, required elements, and output format (always JSON for structured fields, never free-form HTML), run the model, validate the output against a schema, flag confidence-low outputs for human review. For a client in home goods, we run this on ~2,000 new SKUs per week. The model generates first drafts in a batch job overnight; a single editor reviews the flagged ~5% that have schema violations, hallucinated specs, or tone drift the next morning.

A few things that make this work in practice: few-shot examples in the prompt matter more than model choice. Three well-crafted examples of (product data → good description) push output quality dramatically higher than carefully tuned system prompts alone. Output schema enforcement is non-negotiable — if you let the model return free-form text, you'll get unparseable output in production eventually. And for high-margin categories (luxury goods, professional tools), keep a human in the loop on the first pass. The cost of a hallucinated spec on a $3,000 camera lens is very different from the cost of one on a $15 phone case.

RAG for commerce knowledge

When retrieval-augmented generation beats fine-tuning

RAG — connecting a language model to a searchable knowledge base at inference time — is almost always the right choice for commerce applications over fine-tuning, and the reason is practical: your product catalog, support policies, and return rules change constantly. Fine-tuned models are static. Retraining every time the return window policy changes is not a pipeline you want to maintain.

We've built RAG pipelines for customer-facing chat assistants that pull from three sources in parallel: the product catalog (for product-specific questions), support documentation (for policy questions), and order history (for account-specific questions). The retrieval step matters a lot: you want semantic search over the knowledge base (which brings us back to vector embeddings), not keyword search, because users phrase questions nothing like how policies are written.

What RAG doesn't solve

RAG doesn't fix hallucinations on facts not in the retrieved context. If the retrieved chunks don't contain the answer, a capable LLM will often invent a plausible one. The fix is to tune the model to say "I don't know" or "I couldn't find that in our documentation" when confidence is low — which requires including explicit negative examples in your prompt and testing for coverage gaps in the knowledge base. This is slower to build than it sounds. We've spent more time on the retrieval quality and the "I don't know" behavior than on the generation itself.

Customer support automation

Intent classification — figuring out whether an incoming ticket is about a return, a shipping delay, a payment issue, or a product question — is a task LLMs do well, and it's genuinely valuable. Routing the wrong ticket to the wrong team is a silent cost most operations teams know about but can't easily quantify. We've seen classification accuracy above 94% on commerce ticket datasets with a well-prompted LLM, compared to the 78-82% typical of older pattern-matching approaches.

The handoff logic matters more than the classification. You need clear rules: which intents go to automation, which require a human, and which should escalate immediately regardless of channel. Returns involving damaged goods go to a human — always. Disputes and suspected fraud go to a specialist team — always. A simple "where is my order" query for a shipped order can be fully automated. Mixing these up costs more than not automating at all.

Multi-turn conversation is where things get complicated. You need to persist conversation state across turns — which intent has been identified, what data has been collected, what actions have been taken — and the LLM context window fills up fast if you're including conversation history naively. We use a structured state object that gets summarized back into the context rather than appending raw turn history. It also makes debugging possible: you can inspect the state object and understand exactly where in the flow a conversation went wrong.

Integration patterns: sync, async, and streaming

Synchronous LLM calls belong in a narrow set of use cases: real-time user-facing responses where the user is waiting and the operation can't be batched. Customer support chat, product Q&A widgets, search query expansion. For everything else — content generation, catalog enrichment, bulk classification — async pipelines are the right pattern. You publish a job to a queue, a worker picks it up, calls the API, writes the result, marks the job done. This decouples your web tier from LLM API latency and lets you control throughput.

Rate limiting and retry logic deserve more attention than they usually get. LLM APIs have token-per-minute and request-per-minute limits. At scale, you will hit them. Build exponential backoff with jitter into your client from day one, not as an afterthought. Track your token consumption per job type so you can predict costs and set circuit breakers before a runaway job drains your budget. We've seen $8,000 unexpected charges because a retry loop ran overnight without a cap.

Streaming responses — where the API returns tokens as they generate rather than waiting for the full completion — are valuable for chat interfaces where you want the UI to feel responsive. They're tricky to implement correctly: you need to handle partial JSON (if your output is structured), connection drops, and backpressure from the client. For streaming from OpenAI or Anthropic APIs to a browser, server-sent events (SSE) is the simplest transport. Avoid WebSockets for this unless you already have them in your stack; the overhead isn't worth it.

Prompt engineering for structured commerce outputs

Structured output is the core requirement for most commerce use cases. You can't use free-form text where your pipeline expects a JSON object with specific fields. The approaches we've settled on: JSON mode (OpenAI, Anthropic support forced JSON output in different ways), function calling / tool use (more reliable than plain JSON mode for complex schemas), and schema validation on the output with retry on failure.

Few-shot examples in the prompt are more reliable than elaborate system prompts for consistency. Include three to five examples that cover the edge cases you care about — short descriptions, long descriptions, products with unusual attributes, products with missing data. The model learns the pattern from examples better than from instructions. When you're dealing with a schema that has optional fields, show examples where those fields are null, not just populated.

Model updates breaking prompts is a real operational concern. OpenAI and Anthropic both ship model updates that can subtly change output behavior, formatting preferences, and how strictly the model follows instructions. We version our prompts in the same way we version code, keep a regression test suite of (input, expected output format) pairs, and run it against each model version before promoting to production. This sounds like overhead until a model update starts inserting markdown formatting into your JSON strings.

Failure modes specific to commerce

Hallucinated product specifications are the most dangerous failure mode. An LLM generating a product description for a 32GB memory card may confidently write "64GB capacity" if the training data contains many 64GB cards in similar categories. At small scale, a human catches this. At 50,000 SKUs per month, you need automated validation: extract key numeric specs from the generated content and cross-check against the source data. Any discrepancy triggers a human review flag. This is not optional if your output reaches a product page.

Inconsistent attribute values across the catalog are subtler. If you generate descriptions in batches over time, and the model changes behavior between batches (due to model updates, prompt drift, or different temperature settings), you end up with attributes that use different terminology for the same concept: "machine washable" in one batch, "safe for machine washing" in another, "can be machine washed" in a third. This breaks faceted filtering and makes your catalog data dirty in ways that are hard to detect. Enforce a controlled vocabulary for structured attributes — generate them as enum values, not free text.

Wrong pricing information in generated content is more common than you'd expect. If pricing context leaks into a product description prompt (say, a "currently on sale for $X" snippet in the product data), the model may incorporate that price into the description text. Then the sale ends, the description isn't regenerated, and now you have static copy with a stale price. Strip pricing information from any content that the model sees. Descriptions should never mention specific prices — that's what structured price fields and templating are for.

Choosing a model: the practical version

For most commerce use cases in production today, the decision comes down to OpenAI (GPT-4o and the newer o-series), Anthropic (Claude 3.5/3.7 Sonnet, Claude Opus), Google Vertex AI (Gemini 1.5/2.0 Pro), or open-source models self-hosted. The context window matters for catalog enrichment — you may need to pass long product data or multiple policy documents. GPT-4o and Claude Sonnet both handle 128k tokens well; Gemini 1.5 Pro goes to 1M tokens which is useful for bulk document processing.

Cost at scale is the number most teams underestimate. A prompt + completion averaging 1,500 tokens, at GPT-4o pricing, running against 100,000 SKUs monthly, costs roughly $450/month at current rates. That's fine. The same workload through Claude Opus is 3-4x more expensive but produces noticeably better output on complex category descriptions. Llama 3.1 70B self-hosted on a GPU instance runs at about $0.40/hour on a spot instance — attractive for high-volume, lower-complexity tasks like classification, where the output quality difference versus frontier models is small.

Data privacy is a genuine constraint for some commerce stacks. If your catalog contains supplier pricing (cost price, not retail), contractual terms, or customer PII anywhere in the product data, you cannot send it to a third-party API without reviewing your data processing agreements. Open-source models self-hosted on your own infrastructure sidestep this entirely. For most pure content generation use cases, the data is not sensitive and this isn't a blocker — but it's worth confirming before signing the model provider contract.

Back to Blog

Next step

Working on a complex commerce system?

We help engineering teams design, build, and scale high-load platforms — with a clear process and predictable delivery.

Let's talk