Network · May 22, 2026
Designing Low-Latency Trading Infrastructure on AWS: Network Architecture
A benchmark-driven case for running latency-sensitive trading workloads in a single AWS Availability Zone.
Introduction
Building a distributed, cloud-native trading platform comes with its own set of infrastructure challenges. One of the first things I had to get right in KubeTrader was the network architecture, because it dictates where workloads run, the latency profile, and a large part of the cloud bill. It also had to be able to evolve with the platform without turning into a bottleneck later.
From the start, the goal was a secure network optimized for the lowest wire latency to the exchange, where the physical path allows.
Bootstrapping KubeTrader alone and on my own funds, I was mainly constrained by three things:
Cost: I had to cut cloud spend as aggressively as I could. On the compute side, that meant leaning heavily on spot nodes (covered in Blog 1). But for data-intensive workloads, most of the hidden cost is in the network, especially connectivity services like NAT gateways, PrivateLink, and cross-AZ transfer. That alone already pushed me toward single-AZ deployments, trading away cross-AZ high availability to avoid those costs.
Mental bandwidth: It's limited, and switching between writing trading logic and debugging infrastructure takes a real toll on focus. I needed the network design simple enough to hold in your head, and to reload from a couple-minute read.
Extensibility: This is an iterative project meant to evolve over time and in several directions (currently on v2). I don't have the time or the appetite for big rewrites, so the foundational network layout had to be organized and decoupled enough to extend cleanly.
1. The Single-AZ Decision
AWS's Well-Architected Framework explicitly flags single-AZ deployment as an anti-pattern. The Reliability pillar wants workloads spread across Availability Zones, so the failure of one zone can't take you down. For most applications that's correct. But AZs are physically separate, and that separation is latency. A negligible overhead for standard applications, but not for trading.
The other half of the picture is the exchange. In our case, Binance, and we don't know where it actually sits. Somewhere near Tokyo, but which AZ? Maybe a peered datacenter nearby? No idea. From our side it could be closest to a, c, or d, and there's no reason the path from each zone would be the same.
So the hypothesis: if the paths are uneven, then putting the hot workload in a single AZ, the one closest to Binance, should beat a multi-AZ layout on latency. If the paths turn out roughly equal, the hypothesis is wrong and the HA default wins. Either way it's measurable, which is what the experiment below does.
The experiment
The goal is straightforward: find which AZ in ap-northeast-1 has the lowest wire latency to Binance. We deployed three identical instances across the three available AZs, ingested real-time trades from Binance Spot simultaneously, measured local_ts - event_ts per trade, saved results locally, then pushed both the data and instance metadata to S3 at the end of the run. From there we pulled everything down to align and analyze.
Before the numbers mean anything, three things had to be true.
1. AZ naming is not portable
AWS randomizes the AZ name to physical zone mapping per account. For this one:
| AZ name | AZ ID |
|---|---|
ap-northeast-1a | apne1-az4 |
ap-northeast-1c | apne1-az1 |
ap-northeast-1d | apne1-az2 |
apne1-az3 is missing from this account, worth being aware of. The result is reported by AZ ID, not by name, since the name maps to a different physical zone in someone else's account.
2. The signal is small, so the measurement has to be finer
AWS cross-AZ baseline sits somewhere in the high microseconds to low milliseconds, so millisecond-resolution timestamps would round the per-AZ delta away entirely. We need microsecond resolution on both ends.
Binance's spot trade WebSocket stream supports a timeUnit=MICROSECOND query parameter, which exposes the exchange-side event timestamp at microsecond resolution. That gives us a usable measurement: local_ts - event_ts per trade, where local_ts is captured on the receiving instance the moment the frame is read off the socket.
3. The clocks have to agree
Binance's exposed timestamps are in UTC, but how accurate their clocks actually are to true UTC is another black box, same as their datacenter location. We'll assume they run something reasonable (PTP against PHC-enabled hardware is typical in AWS and at major exchanges). That leaves the question on our side: how accurate is our local clock to true UTC, and how stable is it over the run?
AWS Time Sync exposes a Stratum 1 NTP source at the link-local address 169.254.169.123, which chronyd picks up by default on Amazon Linux. After the benchmark, chronyc tracking on one of the instances (trimmed to the relevant fields):
sh-5.2$ chronyc tracking
Reference ID : A9FEA97B (169.254.169.123)
Stratum : 2
Last offset : +0.000000041 seconds
RMS offset : 0.000000411 seconds
Root delay : 0.000107722 seconds
Root dispersion : 0.000029770 seconds
Two things to read from this. The Reference ID confirms the clock is locked to AWS Time Sync, and the RMS offset of 411 ns is how steadily chrony holds the local clock against that reference. That's stability, not absolute accuracy against true UTC. For absolute accuracy, the conventional NTP bound is:
root_delay/2 + root_dispersion + |offset|
From the values above, that works out to roughly 84 µs. So our single absolute readings of local_ts - event_ts carry around 84 µs of clock uncertainty against true UTC.
This was measured on one instance, but all three run the same hardware and sync to the same AWS Time Sync source. So when chrony corrects one instance's clock, it's reacting to the same reference the other two are reacting to. Their offsets to UTC drift around, but they drift together, the gap between them stays small. That's the part that matters: each absolute reading is fuzzy by around 84 µs, but when we compare AZ to AZ that fuzziness mostly cancels. The ranking we get from this is reliable. The absolute µs numbers aren't.
Results
The benchmark instances ran a standard AWS 2023 Linux image with no kernel tuning, no NIC optimization, and intentionally basic code. We collected the data from all three instances, aligned trades by trade ID so each trade appears in all three zones, then grouped per (AZ, symbol). We focused on the minimum per-symbol latency, the sample where kernel and app overhead were closest to zero, leaving a value dominated by the network and exchange components.
We're subtracting E (event-emit timestamp), not T (matching-engine trade time), so local_ts - E is close to a pure transport-plus-stack measurement, with no significant Binance-internal processing bundled in. Worth noting: E differs slightly per instance even for the same trade ID, since Binance fans out the frame individually to each subscriber.
White line = median. Box = IQR. Whiskers = 1.5×IQR. Dots = individual symbol outliers. Dashed line = best AZ min median reference.
The result is clear: apne1-az4 wins by a large margin, the next AZ sits about 650 µs higher, and the third higher still. For Binance spot trade on this endpoint in ap-northeast-1, apne1-az4 is the right home for any latency-sensitive workload on this account.
Reading the absolute number.
The fastest single-symbol reading we saw was around 275 µs. Binance exposes the WebSocket over public endpoints, so this isn't AWS-internal traffic, it's two Tokyo-region hosts talking over the public internet (out our IGW, across whatever path the internet picks, into Binance's edge). The 650 µs gap between apne1-az4 and the next AZ is the additional distance the public path takes when starting from a different AWS AZ. Whether that's a longer internal path to the IGW, a different egress edge, or different routing on the public side, we can't see. What we can say is that apne1-az4 has the shortest public-internet path to Binance from this account, by a margin that's not noise.
Limitations worth naming.
This is a single collection window. The ranking is robust because the gap is large, but exchange-path latency can move with time of day and routing, so production observability is what confirms it holds over time. Other Binance endpoints could rank differently. We also can't test apne1-az3, it's not exposed in this account, so whether it would be closer still is genuinely unknown.
Verdict: single-AZ co-location beats the multi-AZ default for hot workloads. The hypothesis holds.
Beyond exchange latency
v1 focused on minimizing the network distance to the exchange. v2 shifts some attention inward, to the latency between our own services, and two AWS features we'll explore for that path:
- Cluster placement groups: instances placed on the same low-latency, high-bandwidth network segment within the AZ, which can shave latency on intra-cluster hot paths. Trade-off: more concentrated capacity (occasional
InsufficientCapacityon scale-up) and correlated failure inside that segment. - ENA Express: improves tail latency for high-throughput intra-cluster flows specifically. Trade-off: only available on supported (newer Nitro) instance types, so it factors into instance selection and cost.
2. Deployment Topology & Subnet Architecture
The previous section established that single-AZ is the right model for hot workloads. That doesn't mean the whole network is single-AZ. AWS forces multi-AZ for some services (the Application Load Balancer we use to expose dashboards needs at least two AZs, no way around it), and the non-latency-sensitive workloads could legitimately benefit from cross-zone redundancy.
From that, two workload classes emerge:
| Class | Examples | Latency-sensitive | External egress |
|---|---|---|---|
| Business-critical | Kubestream, trading engine, ticker plant | Yes | Exchange feeds and APIs |
| Supporting | Data platform, research env, CI/CD, observability | No | Image pulls, package updates, dashboards |
The business-critical side was covered in section 1. For the supporting workloads, multi-AZ would be the textbook answer. Honestly, I kept them in single-AZ for v1 to avoid the cross-zone transfer cost, and v2 still does the same until the returns justify the extra spend. The switch is straightforward when it's time.
Security stance
The first stated goal was a secure network, and the topology has to deliver on that before anything else sits on top. The stance is straightforward: minimize the attack surface, assume everything is hostile until proven otherwise, and make security boring by making it the default.
The rules:
- Subnets default to private. No public IP allocation on instance launch unless explicitly required by the workload.
- No SSH keys, no SSH ports. Shell access is via SSM Session Manager only, IAM-audited, no port 22 anywhere.
- Default-deny on ingress. Nothing accepts inbound traffic until a security group rule explicitly opens it.
- No manual security group changes. All network configuration is in Pulumi, versioned in git, deployed via CI.
This section is the network-layer security boundary. IAM, network policies, workload security, and secret handling sit above this layer and are covered elsewhere. Within the network layer, NACLs are available if a coarser default-deny is ever needed, but the security-group posture above has been sufficient so far.
Connectivity concerns
Three connectivity choices shaped the topology, each driven by a different mix of cost and latency.
NAT gateway. Kubestream ingested roughly 1 TB/day of market data in v1, headed for ~5 TB/day in v2. If that traffic flowed through a NAT gateway to preserve the "private subnet" posture, the data-processing charges alone would run around $22k/year, close to the entire v1 cloud spend, and that's just Binance. Extending to other exchanges would multiply it. NAT also adds a hop. In-region NAT lands in the hundreds of microseconds to a couple ms, not free for a market-data path counting microseconds.
So Kubestream egresses directly via IGW from a public subnet. The security rules above do the work that NAT's no-public-IP property would have done. It's not the same property. NAT also removes the host from being routable at L3, which is a real defense-in-depth layer we give up, but the trade is acceptable for a workload that only initiates outbound WebSocket connections to a known exchange.
Gateway endpoints. The data platform leans on S3 for tiered storage, plus Loki, Dagster logs, and experiment artifacts. Letting any of that egress over the public internet would mean extra latency variance and NAT data-processing for nothing. The fix is the S3 gateway endpoint: free, in-region, no NAT hop. We added a DynamoDB gateway endpoint at the same time since it's also free.
Interface endpoints. A Secrets Manager interface endpoint keeps secret access on the AWS backbone instead of hopping to a public API endpoint. The same pattern fits ECR, SSM Parameter Store, KMS, and others, but interface endpoints carry an hourly cost per AZ plus per-GB processing, so we kept it to Secrets Manager only for now. The goal was to establish the pattern, not pre-pay for every service.
Cases that didn't fit the default
Two workloads needed exceptions to the "private by default" rule.
Kubestream into a public subnet. Covered above: the NAT cost and the NAT hop together pushed it out of a private subnet, and the security stance is what keeps it defensible.
Trading engine into private + NAT with a pinned EIP. Binance bound HMAC API keys to a whitelisted set of elastic IPs, the alternative was rotating keys every few weeks or so. I couldn't migrate to the new Ed25519 signing scheme at the time, so I needed a fixed set of egress IPs.
On one side, a fixed whitelisted EIP. On the other, Karpenter creating and destroying ephemeral nodes on demand. There's no clean way to bind a static EIP to an autoscaled node pool. Two paths:
- An EventBridge + Lambda setup that listens for node-launch events and re-associates EIPs as nodes appear and disappear.
- Pin the EIPs to a NAT gateway and route the trading engine's egress through it.
I picked the NAT, ate the latency, and called it a temporary measure. The Ed25519 migration in v2 removes the constraint, at which point the trading engine moves to a path more consistent with the latency budget.
The same pattern (single-AZ co-location + EIP-pinned NAT for whitelisted exchange APIs) generalizes to any exchange integration, CEX or DEX, when the exchange demands a static egress IP.
Aurora into its own isolated subnet. Aurora sits in a small DB-only /24 with no NAT, no IGW, and an ingress rule that only allows the workload security group.
The topology
With those constraints and exceptions in place, here's what the network looks like:
A single primary AZ (apne1-az4) holds everything that does real work: a public subnet for Kubestream and edge resources (IGW, NAT, ALB), a private workload subnet for everything else egressing through the NAT, and an isolated data subnet for Aurora. The two other AZs hold the minimum AWS requires: ALB target registrations and an Aurora reader for DB-tier failover. No workload pods schedule outside the primary AZ.
Building the component
This whole layout is a custom Pulumi component. When I started, I leaned into Pulumi Crosswalk for AWS (awsx), which gives you a VPC with sensible defaults in a few lines and is the right tool for a project that follows AWS best practices. The latency and cost choices in this post pull the opposite way: single-AZ for hot workloads, NAT pinned to a specific AZ with a fixed EIP, public-subnet placement for the ingest path, and awsx doesn't give enough flexibility for that. You can't pick the AZ for a single NAT, route tables are auto-managed so custom routes have to be tacked on outside the component, and the subnet layout is one of a handful of presets.
So I wrote our own on top of the Pulumi primitives directly. One upside of owning the component is that extending it is straightforward. It already carries hooks for VPC peering and PrivateLink, no immediate use case in v1 or v2, but cheap to scaffold now and expensive to retrofit later. Future cases worth being ready for: another exchange integration that lives in a different VPC, and BYOC deployments for managed services like ClickHouse Cloud.
Takeaways
The cloud's defaults are a starting point, not a rule. The Well-Architected Framework says single-AZ is an anti-pattern, and for most workloads it's right, but my constraints weren't the ones it was written for. So I measured instead of assuming, and the benchmark made the call.
Everything after that falls out of one measurement: the subnet layout, the private-by-default security stance, the NAT and endpoint choices, the custom Pulumi component. None of it is clever on its own. It's just what's left when you design around what the cloud already hands you (zonal placement, gateway endpoints, primitives you can build on) instead of fighting the defaults to keep an HA checkbox ticked.