Scaling a Public API and Webhooks Under Bursty Traffic

When I say “public API,” I do not mean an internal convenience wrapper that only one team uses. I mean the kind of API that other companies build real workflows on top of. At that point, an API is not just an endpoint list. It is a contract. And once scale arrives, that contract is tested by weird traffic shapes, not only by steady growth.

In our case, normal days were fine. Spiky days were not.

The most painful incidents came from burst patterns:

partners syncing large datasets in a narrow window
retry storms after transient upstream failures
webhook consumers going down at exactly the wrong moment

Those events exposed patterns that looked acceptable at low pressure but collapsed under load: repeated expensive queries, inconsistent endpoint behavior, and webhook delivery that felt more “best effort” than “boringly reliable.”

Versioning is not documentation, it is risk control

The first principle was non-negotiable: versioning had to be explicit and enforced.

For public APIs, versioning is not a docs concern. It is how you evolve behavior without breaking your customers in production. The model I used was:

Backward compatibility by default
Additive changes whenever possible
Deliberate version boundaries for breaking changes
Clear migration guidance before behavior changes land

This keeps the contract stable while still allowing product evolution.

Make the API predictable before making it clever

Under sustained scale, complexity turns into latency and incidents. So instead of adding more one-off logic, I pushed the opposite:

standardized pagination and filtering
consistent error shapes and status semantics
fewer special branches per integration path

This sounds basic, but it reduced unknown edge behavior. Predictability improved both reliability and support response, because teams could reason about failures with fewer exceptions and fewer “this endpoint is different” caveats.

Caching should be correct, not random

Performance work mattered, but only if it stayed honest.

I avoided ad hoc caching and focused on standard HTTP behavior for endpoints that were safe to revalidate:

explicit Cache-Control strategies
conditional requests with ETags

That allowed clients and intermediaries to reuse responses safely, reduced origin pressure during spikes, and kept the behavior aligned with HTTP semantics instead of hidden app logic.

Webhooks: assume failure and design for recovery

Outbound webhooks are delivery into someone else’s infrastructure. Failure is normal, not exceptional.

So the baseline assumption was at-least-once delivery. That means duplicates can happen, and consumers need idempotency support. From there, outbound delivery was treated as a product:

queue-first dispatch
retry policies with backoff and jitter
tenant isolation to avoid noisy-neighbor failure cascades
replay tooling for recovery and support workflows
structured telemetry for delivery lifecycle states

This changed webhooks from “fire and hope” into an operable delivery system.

The real outcome: spikes became measurable

The biggest win was not surviving one high-traffic day. The win was turning spikes into a known shape.

Teams could monitor and reason about:

queue depth
drain time
success rates by attempt
error class distribution

After these changes, failures were still possible, but they were no longer mysterious. Reliability work became incremental and trackable, not reactive and chaotic.

In short: versioning protected the contract, caching protected origin performance, and queue-based webhooks protected everyone’s sanity.

References