Hot scaling core NATS (no JetStream) for bursty traffic — patterns and operator experience ? #7738

NevinDry · 2026-01-16T16:03:05Z

NevinDry
Jan 16, 2026

Hello everyone,

We are running core NATS (no JetStream) for a very high-traffic, low-latency component. Our current setup is a single cluster of 4 seed nodes on AWS Fargate, with Rust applications connecting via the async_nats client.

Our traffic profile is highly bursty and unpredictable: we occasionally see very intense spikes over short periods of time.
To give a sense of scale:

We typically handle up to ~18 million active subscriptions
Around 3,800 concurrent client connections
Subscriptions are evenly distributed across all seed nodes
During peak events, we can see bursts of up to ~1.5× of the subscription count within ~10 minutes
At peak, we observe short bursts of ~60,000 messages transiting in ~2 seconds

Context and constraints

We understand that horizontal hot scaling within a single NATS cluster is intentionally limited, due to the full-mesh route topology and the associated N(N-1)/2 cost. In practice, adding nodes under load often increases CPU/memory pressure before it helps.

Today:

We can vertically scale or increase the number of nodes ahead of forecasted peaks
We cannot easily do so during a spike, since this requires restarts
Permanently overprovisioning the cluster is not cost-optimal

Approach we are exploring

We are currently evaluating superclusters with gateways as a way to absorb bursts:

Keep the existing cluster stable
Spin up an additional “burst” cluster on demand when we observe high traffic spike
Connect it via gateways (we assume it automatically joins the pool)
Steer new client connections to the new cluster to relieve the existing one (and load balance)

This seems aligned with gateway scaling characteristics with a better connection topology cost (Ni(M-1)), but we are still early in testing.

Questions for the community

For operators running core NATS at high throughput, how do you typically design for bursty or unpredictable load ?
Are superclusters a common pattern here ?
Are there other approaches we should consider (partitioning ?) ?
For those using superclusters in production, are there pitfalls or limitations we should be aware of when using them specifically for on-demand capacity expansion?

On the client side:

How does the Rust async_nats client select servers and rebalance connections (from why we saw it is using random pattern) ?
In practice, how do people steer new connections toward newly added capacity and load-balance correctly ?

We are looking for best practices and real world patterns.
If this scenario has not been explored much yet, we would be happy to share feedback once we gain more experience with hot scaling NATS clusters.

Thanks in advance for your insights.

ripienaar · 2026-01-16T19:47:26Z

ripienaar
Jan 16, 2026
Maintainer

We cannot easily do so during a spike, since this requires restarts

Adding nodes to a cluster does not require restarts, like clients learn via seed nodes about new members so do clusters. You can start a new node pointing it to a few existing nodes and it will join the fully meshed cluster with no restarts needed.

So the pattern would be to keep a core stable few servers around and scale out new ones pointing them to those as seeds

10 replies

ripienaar Jan 19, 2026
Maintainer

why not foo.$user_id.>?

neilalexander Jan 19, 2026
Maintainer

Or even *.$user_id or *.$user_id.> etc.

NevinDry Jan 20, 2026
Author

In our case, the large number of subjects comes from how the platform is used internally. We expose NATS as a generic messaging layer that multiple teams use independently to build their own client → NATS → backend flows. Each team owns its subject structure and message semantics.

Using something like foo.$user_id.> or *.$user_id.> would effectively force all teams to agree on a shared subject hierarchy and behavior, which creates coupling between otherwise independent implementations. We’ve intentionally avoided that so teams can evolve their use cases without impacting each other.

That trade-off gives us flexibility and isolation, but it does push subscription counts very high, which is why we’re now re-evaluating infrastructure and topology choices at this scale.

ripienaar Jan 20, 2026
Maintainer

I'd say when weighing the options - coordinating between teams and using software in a way that's known to cause problems - there is only really one way, you have to coordinate or come up with per team namespacing like $team.$user.> or something like that. This gets you a low friction point while optimising the interest graph.

Just punting the hard work of having multiple teams onto the software will forever cause failures

NevinDry Jan 20, 2026
Author

I agree this is a trade-off between organizational coordination and system-level efficiency.

For now, we’ll continue evaluating both architectural and platform changes, and we’ll be happy to come back and share feedback once we have more concrete results.

Thanks again for the thoughtful discussion :)

jnmoyne · 2026-01-17T12:21:56Z

jnmoyne
Jan 17, 2026
Collaborator

I think spinning up another cluster is not necessarily any better than starting a new server in the cluster in the sense that your main issue is rebalancing the existing client connections, regardless of how you add servers. Since you are only using core NATS then no problem adding/removing servers to the cluster 'on the fly'. Then to trigger the re-balancing you would probably have to create a process that you run on a regular basis that looks at how the connections are distributed amongst the servers (so it has to be able to know how many servers are currently in the cluster) and then it can 'kick' connections from the most crowded servers and when those applications re-connect they should have learned from the gossip the complete list of current servers and pick from that to reconnect, and that's one way to re-balance. In any case you want to throttle that re-balancing in order to not create a 'thundering herd' by having suddenly thousands of client applications re-connecting.

2 replies

ripienaar Jan 17, 2026
Maintainer

CLI has a conn rebalanced built in already :)

NevinDry Jan 19, 2026
Author

Hello, thanks for the detailed answer.

“Since you are only using core NATS then no problem adding/removing servers to the cluster on the fly.”

While this is true from a functional standpoint, in our experience adding more servers under high traffic can itself introduce performance issues at our current scale. In particular, we observe that:

Adding servers increases CPU and memory usage per server

Route connections grow O(N²)

Subject interest propagation and the interest graph size grow with cluster size

Beyond a certain point, the overhead introduced by additional servers outweighs the benefit of distributing client connections, and we start to see a measurable negative impact on both CPU and memory.

From your response, my understanding is that the proposed mitigation would be to add servers and then actively rebalance client connections so that each server handles a similar number of connections, thereby evening out resource usage.

While this could help distribute client-side load, our concern is that it does not address the cluster-level overhead introduced by increasing the number of servers. Even with perfectly balanced connections, the cost of routing, interest propagation, and cluster membership still grows with cluster size. As a result, we expect to eventually hit a limit where adding more servers no longer provides net capacity and instead degrades overall performance.

This is the main reason we are questioning whether adding servers to a single cluster is the right scaling lever for us, and why we are exploring alternatives that reduce cluster-wide overhead rather than just redistributing it.

Uh oh!

Hot scaling core NATS (no JetStream) for bursty traffic — patterns and operator experience ? #7738

Uh oh!

Uh oh!

NevinDry Jan 16, 2026

Context and constraints

Today:

Approach we are exploring

Questions for the community

Replies: 2 comments · 12 replies

Uh oh!

ripienaar Jan 16, 2026 Maintainer

Uh oh!

ripienaar Jan 19, 2026 Maintainer

Uh oh!

Uh oh!

neilalexander Jan 19, 2026 Maintainer

Uh oh!

NevinDry Jan 20, 2026 Author

Uh oh!

ripienaar Jan 20, 2026 Maintainer

Uh oh!

NevinDry Jan 20, 2026 Author

Uh oh!

jnmoyne Jan 17, 2026 Collaborator

Uh oh!

ripienaar Jan 17, 2026 Maintainer

Uh oh!

NevinDry Jan 19, 2026 Author

NevinDry
Jan 16, 2026

Replies: 2 comments 12 replies

ripienaar
Jan 16, 2026
Maintainer

ripienaar Jan 19, 2026
Maintainer

neilalexander Jan 19, 2026
Maintainer

NevinDry Jan 20, 2026
Author

ripienaar Jan 20, 2026
Maintainer

NevinDry Jan 20, 2026
Author

jnmoyne
Jan 17, 2026
Collaborator

ripienaar Jan 17, 2026
Maintainer

NevinDry Jan 19, 2026
Author