03 / 12 — API & Backend

FlowCore
API.

A 50-person ops team was losing hours every week to a broken webhook mesh. Silent failures, duplicate processing, and zero visibility. We replaced it with an idempotent handler layer, a replay system, and an observability dashboard that made on-call bearable.

Client: FlowCore (B2B SaaS)
Year: 2025
Timeline: 8 weeks
Role: Backend & Integration
Stack: Node · Postgres · Redis · Datadog

01 — Context

A Webhook Mesh
Nobody Trusted.

FlowCore's ops platform processed thousands of webhooks per day from Stripe, HubSpot, Slack, and GitHub. Over two years, the integration layer had grown organically — different developers, different patterns, no shared schema. The result was a system that worked most of the time, and failed silently the rest.

The engineering team had lost trust in their own pipeline. On-call meant manually replaying events from logs.Billing discrepancies were traced to duplicate processing. The fix couldn't be a patch — it needed to be a rebuild.

02 — Approach

Make the Invisible
Visible First.

Before writing a single line of new handler code, I spent two weeks mapping what was actually happening. Distributed tracing, log aggregation, and a forensic audit of the past 90 days of events revealed three root causes: no idempotency keys, no dead-letter queues, and no shared retry policy.

Week 1–2 — Observability layerStructured logging to Datadog, distributed traces across the entire event lifecycle, a dashboard the team could actually read during an incident.
Week 3–4 — Idempotency layerPostgres-backed idempotency keys for all handlers. Duplicate events now no-op cleanly with full audit trail. Billing discrepancies: zero since deploy.
Week 5–6 — Replay systemDead-letter queue with an admin UI for safe event replay. Automatic backoff, configurable retry windows, Slack alerting on threshold breaches.
Week 7–8 — Migration & documentationOld handlers migrated one-by-one with zero downtime. Runbook, incident playbook, and onboarding docs so the next developer could reason about it.

03 — Results

On-Call Is Boring
Now. Good.

Three months post-launch, the team reported zero duplicate-processing incidents. Delivery reliability went from “probably fine” to a measurable 99.97%. The on-call rotation no longer requires anyone to know the event system intimately — the dashboard tells the story.

99.97%

Webhook delivery
reliability (up from ~94%)

Duplicate billing
events post-launch

8 wks

Audit to full
migration, zero downtime

4 hr

Saved per on-call
shift, per engineer

Before this, our on-call engineers were basically manually replaying events from raw logs. Now the dashboard just tells us what happened.Dev T. — Engineering Lead, FlowCore

View All— Back to Work

All Case Studies

→

FlowCoreAPI.

A Webhook MeshNobody Trusted.

Make the InvisibleVisible First.

On-Call Is BoringNow. Good.

FlowCore
API.

A Webhook Mesh
Nobody Trusted.

Make the Invisible
Visible First.

On-Call Is Boring
Now. Good.