FlowCore
API.
A 50-person ops team was losing hours every week to a broken webhook mesh. Silent failures, duplicate processing, and zero visibility. We replaced it with an idempotent handler layer, a replay system, and an observability dashboard that made on-call bearable.
- Client
- FlowCore (B2B SaaS)
- Year
- 2025
- Timeline
- 8 weeks
- Role
- Backend & Integration
- Stack
- Node · Postgres · Redis · Datadog
A Webhook Mesh
Nobody Trusted.
FlowCore's ops platform processed thousands of webhooks per day from Stripe, HubSpot, Slack, and GitHub. Over two years, the integration layer had grown organically — different developers, different patterns, no shared schema. The result was a system that worked most of the time, and failed silently the rest.
The engineering team had lost trust in their own pipeline. On-call meant manually replaying events from logs.Billing discrepancies were traced to duplicate processing. The fix couldn't be a patch — it needed to be a rebuild.
Make the Invisible
Visible First.
Before writing a single line of new handler code, I spent two weeks mapping what was actually happening. Distributed tracing, log aggregation, and a forensic audit of the past 90 days of events revealed three root causes: no idempotency keys, no dead-letter queues, and no shared retry policy.
- Week 1–2 — Observability layerStructured logging to Datadog, distributed traces across the entire event lifecycle, a dashboard the team could actually read during an incident.
- Week 3–4 — Idempotency layerPostgres-backed idempotency keys for all handlers. Duplicate events now no-op cleanly with full audit trail. Billing discrepancies: zero since deploy.
- Week 5–6 — Replay systemDead-letter queue with an admin UI for safe event replay. Automatic backoff, configurable retry windows, Slack alerting on threshold breaches.
- Week 7–8 — Migration & documentationOld handlers migrated one-by-one with zero downtime. Runbook, incident playbook, and onboarding docs so the next developer could reason about it.
On-Call Is Boring
Now. Good.
Three months post-launch, the team reported zero duplicate-processing incidents. Delivery reliability went from “probably fine” to a measurable 99.97%. The on-call rotation no longer requires anyone to know the event system intimately — the dashboard tells the story.
reliability (up from ~94%)
events post-launch
migration, zero downtime
shift, per engineer
Before this, our on-call engineers were basically manually replaying events from raw logs. Now the dashboard just tells us what happened.Dev T. — Engineering Lead, FlowCore