Integrations

ERP Integration Patterns That Don't Break at Scale

Flexor Engineering

January 30, 2025

7 min read

ERP integrations look simple on a whiteboard: system A talks to system B. The problems start when one side is slow, down, or getting hammered by traffic. We've watched a 10-minute ERP maintenance window take down order processing for two hours. Here's what actually goes wrong, and how to build something that holds up.

Synchronous vs event-driven integration: the former creates tight coupling and shared failure modes; the latter gives you isolation and recovery options.

Why synchronous integrations break

Synchronous integration is easy to reason about: something changes in system A, you call system B. No broker, no queue, no extra moving parts. We get why teams build it this way.

The catch: the two systems become tightly coupled. If the ERP slows down during a traffic spike, your store either waits (requests back up) or times out (errors reach users). If the store's order API has a hiccup, the ERP can't process. Both sides fail together.

At low order volumes this is often fine. Push it harder — flash sale, Black Friday, a bulk catalog import running at the same time as real traffic — and the failure modes stack. We've seen a 10-minute ERP maintenance window produce 400 failed order records and a two-hour manual reconciliation effort. That's not a fluke; it's what synchronous coupling does under pressure.

The message queue pattern

Put a message broker between them — RabbitMQ and Kafka are the two we reach for most often in commerce stacks, depending on throughput needs and what the team already knows.

The flow: order placed in the store → published as a message → ERP consumer reads and processes at its own pace. If the ERP goes down, messages queue up and drain when it comes back. No order is lost, no error surfaces to the user. The store doesn't even know the ERP was unavailable.

Same thing in reverse for inventory: ERP pushes a stock-change event, the store consumes it asynchronously. A few seconds of delay on a stock update is fine in most cases. A full sync failure that leaves you overselling — that's not. The queue gives you the buffer to absorb spikes and outages without data loss.

Key design decisions

Idempotency

Message queues deliver at-least-once. That means a message can come through more than once in a failure or retry scenario. Your consumers need to handle that gracefully — processing the same message twice has to produce the same outcome as processing it once.

For order creation: before inserting, check whether an order with that external ID already exists. For inventory: set the absolute value (stock = N), not a delta (stock += N). Deltas look fine until a retry turns one update into two, and now your stock count is wrong.

Dead letter queues

Some messages won't process — bad data from the ERP, a downstream timeout, a schema mismatch you didn't account for. Configure a DLQ so failed messages land somewhere visible after N retries instead of just disappearing. Then actually monitor it. A DLQ sitting at zero is a sign things are working. A DLQ growing at 3am is an incident in slow motion.

Schema versioning

ERPs get updated. The shape of an order object changes. If your consumer hardcodes the old field names, it breaks quietly or loudly — either way, you've got a problem. Include a version field in your messages. Add new fields as optional. Never remove required fields without a migration window where both versions are supported. This is less exciting than the queue architecture, but it's where integrations actually fall apart in year two.

A healthy integration: queue depth stays near zero during normal operations, DLQ count stays at 0. Monitoring both in the same dashboard simplifies incident detection.

What to monitor

You can't fix what you can't see. At minimum, track:

Queue depth over time per topic (a growing queue indicates consumer lag)
Consumer lag (difference between latest message and last processed message)
DLQ message count and growth rate
Processing time per message type (detect slow consumers before they cause backlogs)
Error rate and error classification (transient vs permanent failures)

Hook this into your alerting. Queue depth climbing steadily for 15 minutes should wake someone up. New messages appearing in the DLQ should create a ticket automatically — don't rely on someone remembering to check a dashboard.

When to use webhooks instead

Queues add infrastructure to own and operate. For lower-volume integrations where both systems are reliably up, a webhook-based approach with retry logic and a nightly reconciliation job (comparing order states across systems) can be simpler and genuinely good enough. We've run that setup successfully for stores doing a few hundred orders a day.

Our honest take: if you're processing more than ~5,000 events per hour, or if either system has regular maintenance windows or uptime below 99.9%, go with a queue. Below that, it depends on your team's operational comfort level and how much pain you're willing to take in the edge cases. There's no clean universal answer here.

Back to Blog

Next step

Working on a complex commerce system?

We help engineering teams design, build, and scale high-load platforms — with a clear process and predictable delivery.

Let's talk