Behind the Scenes at Netcore: Designing a System for 435 Million Messages and 10,000+ TPS Events

Public awareness campaigns present a unique set of engineering challenges where success is measured by reach, reliability, and precise timing. Recently, our engineering team supported a large-scale initiative that required sending approximately 665 million messages over a four-day period.

The scale of this operation was immense:

  • Total Messages Delivered: ~435 million.
  • Peak Daily Volume: ~200 million messages within a single 13-hour window.
  • Outbound Throughput: ~4,000 Transactions Per Second (TPS).
  • Inbound Event Traffic: ~10,000 TPS of delivery status webhooks.

To handle this national-level throughput without compromising stability, we moved away from steady-state scaling and designed a system specifically for burst-scale performance.

The Challenge: Managing 20,000+ TPS Burst Traffic

Standard messaging systems are often designed for steady-state loads. However, public awareness campaigns demand a "burst-scale" mindset, where massive volume is compressed into tight daily windows. In this case, the system was required to deliver 435 million messages over four days, with peak daily volumes reaching 200 million messages in a single 13-hour window.

From an engineering perspective, the difficulty was not just outbound delivery, but managing the massive write amplification from incoming webhooks. Each message sent triggers multiple status events—Sent, Delivered, and Read—resulting in an inbound event stream of approximately 10,000 Transactions Per Second (TPS). Without architectural safeguards, this volume would quickly lead to database write saturation and indexing latency spikes.

Architectural Strategy: Producer–Sender Decoupling

To maintain operational stability, the system was designed with a decoupled architecture that separates message generation from the physical delivery process.

Local In-Memory Queuing

The campaign producer generated message payloads at a rate of approximately 5,000 TPS, while the delivery engine was rate-limited to 4,000 TPS to adhere to external API constraints. By using local in-memory queues, the system absorbed this rate differential.

This decoupling ensures that:

  • Upstream production is never stalled by downstream API latency or temporary network fluctuations.
  • Outbound delivery remains smooth and predictable, maximizing throughput without exceeding API rate limits.
  • The sender always has a consistent buffer of messages, ensuring zero idle time during the 13-hour delivery window.

null

Stream Isolation and Priority-Based Handling

Managing 10,000+ inbound events per second requires a specialized data pipeline. The core engineering decision was to treat every status event as a first-class, independent stream through dedicated Pub/Sub topics.

Critical Signal Prioritization

For campaign tracking, Delivered status is the most critical signal. By isolating these events into their own stream, the team could:

  • Allocate more consumer capacity specifically for delivery status to ensure real-time reporting.
  • Shield the reporting pipeline from secondary traffic, such as "Read" receipts, which are high-volume but less time-sensitive.
  • Ensure faster database writes for critical data points, keeping dashboards accurate for stakeholders.

Operational Flexibility

Isolating event streams prevents cascading failures. If a backlog occurs in the "Read" event stream, it has zero impact on the processing of "Delivered" events. This allowed for independent throttling and batching across different data types to protect the core database cluster from write saturation.

Performance Validation: Knowing Infrastructure Limits

At this scale, engineering success depends on pre-campaign validation rather than mid-campaign reactions. The team utilized rigorous load testing and profiling to define safe operational boundaries.

Stress Testing for 20,000+ TPS

Using k6, the team simulated webhook traffic at rates exceeding 20,000 TPS. This verified that:

  • Message brokers could absorb massive traffic spikes without growing unbounded.
  • Downstream consumers were capable of processing at least 8,000 database writes per second without performance degradation.
  • Operational boundaries were clearly defined, establishing exactly when backpressure or throttling should be engaged.

Runtime Profiling with Go

To optimize the message processing services, the team utilized Go’s pprof tooling. This enabled targeted improvements by:

  • Identifying hot paths in the message processing logic.
  • Pinpointing I/O and serialization bottlenecks that could impede throughput.
  • Ensuring that every code optimization translated directly into measurable throughput gains rather than speculative tuning.

Conclusion:

The success of this campaign demonstrates that burst-scale engineering is a matter of deliberate trade-offs. By prioritizing critical signals, decoupling production from delivery, and rigorously testing infrastructure limits, the team achieved national-level reach with zero emergency interventions. These principles now form the foundation for how we engineer high-impact, large-scale communication systems