03 / 12 — API & Backend

FlowCore
API.

A 50-person ops team was losing hours every week to a broken webhook mesh. Silent failures, duplicate processing, and zero visibility. We replaced it with an idempotent handler layer, a replay system, and an observability dashboard that made on-call bearable.

Client
FlowCore (B2B SaaS)
Year
2025
Timeline
8 weeks
Role
Backend & Integration
Stack
Node · Postgres · Redis · Datadog
01 — Context

A Webhook Mesh
Nobody Trusted.

FlowCore's ops platform processed thousands of webhooks per day from Stripe, HubSpot, Slack, and GitHub. Over two years, the integration layer had grown organically — different developers, different patterns, no shared schema. The result was a system that worked most of the time, and failed silently the rest.

The engineering team had lost trust in their own pipeline. On-call meant manually replaying events from logs.Billing discrepancies were traced to duplicate processing. The fix couldn't be a patch — it needed to be a rebuild.

02 — Approach

Make the Invisible
Visible First.

Before writing a single line of new handler code, I spent two weeks mapping what was actually happening. Distributed tracing, log aggregation, and a forensic audit of the past 90 days of events revealed three root causes: no idempotency keys, no dead-letter queues, and no shared retry policy.

  • Week 1–2 — Observability layerStructured logging to Datadog, distributed traces across the entire event lifecycle, a dashboard the team could actually read during an incident.
  • Week 3–4 — Idempotency layerPostgres-backed idempotency keys for all handlers. Duplicate events now no-op cleanly with full audit trail. Billing discrepancies: zero since deploy.
  • Week 5–6 — Replay systemDead-letter queue with an admin UI for safe event replay. Automatic backoff, configurable retry windows, Slack alerting on threshold breaches.
  • Week 7–8 — Migration & documentationOld handlers migrated one-by-one with zero downtime. Runbook, incident playbook, and onboarding docs so the next developer could reason about it.
03 — Results

On-Call Is Boring
Now. Good.

Three months post-launch, the team reported zero duplicate-processing incidents. Delivery reliability went from “probably fine” to a measurable 99.97%. The on-call rotation no longer requires anyone to know the event system intimately — the dashboard tells the story.

99.97%
Webhook delivery
reliability (up from ~94%)
0
Duplicate billing
events post-launch
8 wks
Audit to full
migration, zero downtime
4 hr
Saved per on-call
shift, per engineer
Before this, our on-call engineers were basically manually replaying events from raw logs. Now the dashboard just tells us what happened.Dev T. — Engineering Lead, FlowCore