Network ·

Designing Low-Latency Trading Infrastructure on AWS: Network Architecture

A benchmark-driven case for running latency-sensitive trading workloads in a single AWS Availability Zone.

Introduction

Building a distributed, cloud-native trading platform comes with its own set of infrastructure challenges. One of the first things I had to get right in KubeTrader was the network architecture, because it dictates where workloads run, the latency profile, and a large part of the cloud bill. It also had to be able to evolve with the platform without turning into a bottleneck later.

From the start, the goal was a secure network optimized for the lowest wire latency to the exchange, where the physical path allows.

Bootstrapping KubeTrader alone and on my own funds, I was mainly constrained by three things:

Cost: I had to cut cloud spend as aggressively as I could. On the compute side, that meant leaning heavily on spot nodes (covered in Blog 1). But for data-intensive workloads, most of the hidden cost is in the network, especially connectivity services like NAT gateways, PrivateLink, and cross-AZ transfer. That alone already pushed me toward single-AZ deployments, trading away cross-AZ high availability to avoid those costs.

Mental bandwidth: It's limited, and switching between writing trading logic and debugging infrastructure takes a real toll on focus. I needed the network design simple enough to hold in your head, and to reload from a couple-minute read.

Extensibility: This is an iterative project meant to evolve over time and in several directions (currently on v2). I don't have the time or the appetite for big rewrites, so the foundational network layout had to be organized and decoupled enough to extend cleanly.

1. The Single-AZ Decision

AWS's Well-Architected Framework explicitly flags single-AZ deployment as an anti-pattern. The Reliability pillar wants workloads spread across Availability Zones, so the failure of one zone can't take you down. For most applications that's correct. But AZs are physically separate, and that separation is latency. A negligible overhead for standard applications, but not for trading.

The other half of the picture is the exchange. In our case, Binance, and we don't know where it actually sits. Somewhere near Tokyo, but which AZ? Maybe a peered datacenter nearby? No idea. From our side it could be closest to a, c, or d, and there's no reason the path from each zone would be the same.

So the hypothesis: if the paths are uneven, then putting the hot workload in a single AZ, the one closest to Binance, should beat a multi-AZ layout on latency. If the paths turn out roughly equal, the hypothesis is wrong and the HA default wins. Either way it's measurable, which is what the experiment below does.

The experiment

The goal is straightforward: find which AZ in ap-northeast-1 has the lowest wire latency to Binance. We deployed three identical instances across the three available AZs, ingested real-time trades from Binance Spot simultaneously, measured local_ts - event_ts per trade, saved results locally, then pushed both the data and instance metadata to S3 at the end of the run. From there we pulled everything down to align and analyze.

Before the numbers mean anything, three things had to be true.

1. AZ naming is not portable

AWS randomizes the AZ name to physical zone mapping per account. For this one:

AZ nameAZ ID
ap-northeast-1aapne1-az4
ap-northeast-1capne1-az1
ap-northeast-1dapne1-az2

apne1-az3 is missing from this account, worth being aware of. The result is reported by AZ ID, not by name, since the name maps to a different physical zone in someone else's account.

2. The signal is small, so the measurement has to be finer

AWS cross-AZ baseline sits somewhere in the high microseconds to low milliseconds, so millisecond-resolution timestamps would round the per-AZ delta away entirely. We need microsecond resolution on both ends.

Binance's spot trade WebSocket stream supports a timeUnit=MICROSECOND query parameter, which exposes the exchange-side event timestamp at microsecond resolution. That gives us a usable measurement: local_ts - event_ts per trade, where local_ts is captured on the receiving instance the moment the frame is read off the socket.

3. The clocks have to agree

Binance's exposed timestamps are in UTC, but how accurate their clocks actually are to true UTC is another black box, same as their datacenter location. We'll assume they run something reasonable (PTP against PHC-enabled hardware is typical in AWS and at major exchanges). That leaves the question on our side: how accurate is our local clock to true UTC, and how stable is it over the run?

AWS Time Sync exposes a Stratum 1 NTP source at the link-local address 169.254.169.123, which chronyd picks up by default on Amazon Linux. After the benchmark, chronyc tracking on one of the instances (trimmed to the relevant fields):

sh-5.2$ chronyc tracking
Reference ID    : A9FEA97B (169.254.169.123)
Stratum         : 2
Last offset     : +0.000000041 seconds
RMS offset      : 0.000000411 seconds
Root delay      : 0.000107722 seconds
Root dispersion : 0.000029770 seconds

Two things to read from this. The Reference ID confirms the clock is locked to AWS Time Sync, and the RMS offset of 411 ns is how steadily chrony holds the local clock against that reference. That's stability, not absolute accuracy against true UTC. For absolute accuracy, the conventional NTP bound is:

root_delay/2 + root_dispersion + |offset|

From the values above, that works out to roughly 84 µs. So our single absolute readings of local_ts - event_ts carry around 84 µs of clock uncertainty against true UTC.

This was measured on one instance, but all three run the same hardware and sync to the same AWS Time Sync source. So when chrony corrects one instance's clock, it's reacting to the same reference the other two are reacting to. Their offsets to UTC drift around, but they drift together, the gap between them stays small. That's the part that matters: each absolute reading is fuzzy by around 84 µs, but when we compare AZ to AZ that fuzziness mostly cancels. The ranking we get from this is reliable. The absolute µs numbers aren't.

Benchmark Pipeline · stream_trades · ap-northeast-1
3 × c8g.xlarge · ap-northeast-1 · 3 Availability Zones · Graviton4
Pulumi SSM Document v0.3 ap-northeast-1
▸ AWS Region  ·  ap-northeast-1  (Tokyo)
IaC · infra provisioning
Pulumi
Execution engine
SSM Document v0.3
Result store
S3 Bucket
ssm:send-command
1a
1c
1d
VPC  ·  10.0.0.0/16
ap-northeast-1a
subnet · 10.0.1.0/24
EC2
c8g.xlarge
Graviton4 · 4 vCPU · 8 GiB
SSM DOC v0.3
bootstrap
aws s3 sync scripts
write metadata.json
stream_trades → trades.csv
upload {iid}/ → s3
./i-0a3f8c12/
metadata.json — B
trades.csv — rows
ap-northeast-1c
subnet · 10.0.2.0/24
EC2
c8g.xlarge
Graviton4 · 4 vCPU · 8 GiB
SSM DOC v0.3
bootstrap
aws s3 sync scripts
write metadata.json
stream_trades → trades.csv
upload {iid}/ → s3
./i-0c7e92b4/
metadata.json — B
trades.csv — rows
ap-northeast-1d
subnet · 10.0.3.0/24
EC2
c8g.xlarge
Graviton4 · 4 vCPU · 8 GiB
SSM DOC v0.3
bootstrap
aws s3 sync scripts
write metadata.json
stream_trades → trades.csv
upload {iid}/ → s3
./i-0d2a16e9/
metadata.json — B
trades.csv — rows
S3
s3://kubetrader-benchmark-results-tokyo
arn:aws:s3:::kubetrader-benchmark-results-tokyo  ·  ap-northeast-1  ·  pulled locally to analyze
step 02 · scripts step 06 · {iid}/ results

Results

The benchmark instances ran a standard AWS 2023 Linux image with no kernel tuning, no NIC optimization, and intentionally basic code. We collected the data from all three instances, aligned trades by trade ID so each trade appears in all three zones, then grouped per (AZ, symbol). We focused on the minimum per-symbol latency, the sample where kernel and app overhead were closest to zero, leaving a value dominated by the network and exchange components.

We're subtracting E (event-emit timestamp), not T (matching-engine trade time), so local_ts - E is close to a pure transport-plus-stack measurement, with no significant Binance-internal processing bundled in. Worth noting: E differs slightly per instance even for the same trade ID, since Binance fans out the frame individually to each subscriber.

[LAT] Latency Benchmark · Binance → Multi-AZ ingress
316 symbols / AZ · 1 pt = 1 symbol avg
min latency
mean latency
LATENCY (µs) 142633112316142104 µs FASTEST BASELINE · 375 µs min mean min mean min mean apne1-az4apne1-az1apne1-az2
Latency = local_ts − event_ts (µs).  Min approximates true wire latency; mean includes kernel + userspace overhead.
White line = median.  Box = IQR.  Whiskers = 1.5×IQR.  Dots = individual symbol outliers.  Dashed line = best AZ min median reference.

The result is clear: apne1-az4 wins by a large margin, the next AZ sits about 650 µs higher, and the third higher still. For Binance spot trade on this endpoint in ap-northeast-1, apne1-az4 is the right home for any latency-sensitive workload on this account.

Reading the absolute number.

The fastest single-symbol reading we saw was around 275 µs. Binance exposes the WebSocket over public endpoints, so this isn't AWS-internal traffic, it's two Tokyo-region hosts talking over the public internet (out our IGW, across whatever path the internet picks, into Binance's edge). The 650 µs gap between apne1-az4 and the next AZ is the additional distance the public path takes when starting from a different AWS AZ. Whether that's a longer internal path to the IGW, a different egress edge, or different routing on the public side, we can't see. What we can say is that apne1-az4 has the shortest public-internet path to Binance from this account, by a margin that's not noise.

Limitations worth naming.

This is a single collection window. The ranking is robust because the gap is large, but exchange-path latency can move with time of day and routing, so production observability is what confirms it holds over time. Other Binance endpoints could rank differently. We also can't test apne1-az3, it's not exposed in this account, so whether it would be closer still is genuinely unknown.

Verdict: single-AZ co-location beats the multi-AZ default for hot workloads. The hypothesis holds.

Beyond exchange latency

v1 focused on minimizing the network distance to the exchange. v2 shifts some attention inward, to the latency between our own services, and two AWS features we'll explore for that path:

  • Cluster placement groups: instances placed on the same low-latency, high-bandwidth network segment within the AZ, which can shave latency on intra-cluster hot paths. Trade-off: more concentrated capacity (occasional InsufficientCapacity on scale-up) and correlated failure inside that segment.
  • ENA Express: improves tail latency for high-throughput intra-cluster flows specifically. Trade-off: only available on supported (newer Nitro) instance types, so it factors into instance selection and cost.

2. Deployment Topology & Subnet Architecture

The previous section established that single-AZ is the right model for hot workloads. That doesn't mean the whole network is single-AZ. AWS forces multi-AZ for some services (the Application Load Balancer we use to expose dashboards needs at least two AZs, no way around it), and the non-latency-sensitive workloads could legitimately benefit from cross-zone redundancy.

From that, two workload classes emerge:

ClassExamplesLatency-sensitiveExternal egress
Business-criticalKubestream, trading engine, ticker plantYesExchange feeds and APIs
SupportingData platform, research env, CI/CD, observabilityNoImage pulls, package updates, dashboards

The business-critical side was covered in section 1. For the supporting workloads, multi-AZ would be the textbook answer. Honestly, I kept them in single-AZ for v1 to avoid the cross-zone transfer cost, and v2 still does the same until the returns justify the extra spend. The switch is straightforward when it's time.

Security stance

The first stated goal was a secure network, and the topology has to deliver on that before anything else sits on top. The stance is straightforward: minimize the attack surface, assume everything is hostile until proven otherwise, and make security boring by making it the default.

The rules:

  • Subnets default to private. No public IP allocation on instance launch unless explicitly required by the workload.
  • No SSH keys, no SSH ports. Shell access is via SSM Session Manager only, IAM-audited, no port 22 anywhere.
  • Default-deny on ingress. Nothing accepts inbound traffic until a security group rule explicitly opens it.
  • No manual security group changes. All network configuration is in Pulumi, versioned in git, deployed via CI.

This section is the network-layer security boundary. IAM, network policies, workload security, and secret handling sit above this layer and are covered elsewhere. Within the network layer, NACLs are available if a coarser default-deny is ever needed, but the security-group posture above has been sufficient so far.

Connectivity concerns

Three connectivity choices shaped the topology, each driven by a different mix of cost and latency.

NAT gateway. Kubestream ingested roughly 1 TB/day of market data in v1, headed for ~5 TB/day in v2. If that traffic flowed through a NAT gateway to preserve the "private subnet" posture, the data-processing charges alone would run around $22k/year, close to the entire v1 cloud spend, and that's just Binance. Extending to other exchanges would multiply it. NAT also adds a hop. In-region NAT lands in the hundreds of microseconds to a couple ms, not free for a market-data path counting microseconds.

[$] AWS · VPC Network Pricing · ap-northeast-1 / Tokyo rev 2026.04
Daily Volume
GB / per day
10 GB100 GB1 TB10 TB
NAT Gateway
Data processing
$0.062 / GB
Daily
Monthly
Yearly
Cross-AZ
Inter-AZ transfer
$0.01 / GB
Daily
Monthly
Yearly

Notes — NAT Gateway also carries a fixed hourly charge (~$0.062/hr ≈ $45/mo per gateway) not shown above. Cross-AZ traffic is billed on both sides; effective per-transfer rate is $0.02/GB. Monthly = daily × 30 · Yearly = daily × 365.

Last verified Apr 2026

So Kubestream egresses directly via IGW from a public subnet. The security rules above do the work that NAT's no-public-IP property would have done. It's not the same property. NAT also removes the host from being routable at L3, which is a real defense-in-depth layer we give up, but the trade is acceptable for a workload that only initiates outbound WebSocket connections to a known exchange.

Gateway endpoints. The data platform leans on S3 for tiered storage, plus Loki, Dagster logs, and experiment artifacts. Letting any of that egress over the public internet would mean extra latency variance and NAT data-processing for nothing. The fix is the S3 gateway endpoint: free, in-region, no NAT hop. We added a DynamoDB gateway endpoint at the same time since it's also free.

Interface endpoints. A Secrets Manager interface endpoint keeps secret access on the AWS backbone instead of hopping to a public API endpoint. The same pattern fits ECR, SSM Parameter Store, KMS, and others, but interface endpoints carry an hourly cost per AZ plus per-GB processing, so we kept it to Secrets Manager only for now. The goal was to establish the pattern, not pre-pay for every service.

Cases that didn't fit the default

Two workloads needed exceptions to the "private by default" rule.

Kubestream into a public subnet. Covered above: the NAT cost and the NAT hop together pushed it out of a private subnet, and the security stance is what keeps it defensible.

Trading engine into private + NAT with a pinned EIP. Binance bound HMAC API keys to a whitelisted set of elastic IPs, the alternative was rotating keys every few weeks or so. I couldn't migrate to the new Ed25519 signing scheme at the time, so I needed a fixed set of egress IPs.

On one side, a fixed whitelisted EIP. On the other, Karpenter creating and destroying ephemeral nodes on demand. There's no clean way to bind a static EIP to an autoscaled node pool. Two paths:

  • An EventBridge + Lambda setup that listens for node-launch events and re-associates EIPs as nodes appear and disappear.
  • Pin the EIPs to a NAT gateway and route the trading engine's egress through it.

I picked the NAT, ate the latency, and called it a temporary measure. The Ed25519 migration in v2 removes the constraint, at which point the trading engine moves to a path more consistent with the latency budget.

The same pattern (single-AZ co-location + EIP-pinned NAT for whitelisted exchange APIs) generalizes to any exchange integration, CEX or DEX, when the exchange demands a static egress IP.

Aurora into its own isolated subnet. Aurora sits in a small DB-only /24 with no NAT, no IGW, and an ingress rule that only allows the workload security group.

The topology

With those constraints and exceptions in place, here's what the network looks like:

[VPC] kubetrader · network topology · 10.0.0.0/16 · ap-northeast-1
Primary AZ 1a · 1 public /24 · 2 private /24 · /20
Ingress · Public internet
Market data feeds  ·  client traffic
Binance  ·  Kraken  ·  others  ·  ALB clients
Egress · Public internet
Binance exchange
Order endpoint (whitelisted EIP)
Internet boundary
VPC 10.0.0.0/16 · vpc-kubetrader
ap-northeast-1  ·  Tokyo
AZ ap-northeast-1a primary
Public subnet 10.0.0.0/24
kubetrader-public-1a
256 IPs · Map public IP on launch · Holds the only edge resources in the VPC
▸ Route table rtb-public-1a
DestinationTarget
10.0.0.0/16 local
0.0.0.0/0 igw-kubetrader
pl-S3 (tokyo) vpce-s3
pl-DynamoDB vpce-dynamodb
IGW
Internet Gateway
VPC-attached  ·  bidirectional public ↔ VPC
igw-kubetrader attached
NAT
NAT Gateway
Egress for private subnets  ·  Binance whitelist EIP
nat-kubetrader-1a EIP · static
ALB
Application Load Balancer
Internet-facing  ·  fronts KubeTrader dashboards (pod-IP targets in private workload subnet)
alb-kubetrader internet-facing
K8s
Kubestream workload
Karpenter NodeClass selects public subnet  ·  direct IGW for low-latency exchange ingest
ng-kubestream-public public placement
Private subnet 10.0.10.0/24
kubetrader-private-data-1a
256 IPs · Fully isolated — no NAT, no IGW · DB-only ingress from workload SG
▸ Route table rtb-data-1a
DestinationTarget
10.0.0.0/16 local
DB
Aurora Serverless v2
PostgreSQL  ·  no internet egress
aurora-kubetrader isolated
Private subnet 10.0.16.0/20
kubetrader-private-workload-1a
4096 IPs · EKS node pool capacity · Egress via NAT for exchange APIs
▸ Route table rtb-workload-1a
DestinationTarget
10.0.0.0/16 local
0.0.0.0/0 nat-kubetrader-1a
pl-S3 (tokyo) vpce-s3
pl-DynamoDB vpce-dynamodb
10.100.0.0/16 pcx-shared-servicesfuture
EKS
KubeTrader workload
Karpenter NodeClass selects private subnet  ·  egress via NAT
ng-kubetrader-workload private placement
AZ ap-northeast-1c standby · HA
Public 10.0.1.0/24 ALB registration target · no NAT
Private 10.0.11.0/24 Aurora reader · fail-over target
Same pattern reserved in ap-northeast-1d — future expansion, kept symmetric for capacity headroom.
▸ VPC endpoints Private routing over the AWS backbone — no internet hop, no data-transfer cost
Gateway endpoints Public + workload RTBs · free
S3 DynamoDB
Interface endpoint ENI in private subnet
secretsmanager
▸ Security groups Workload SG — egress-only profile, SSM Session Manager replaces SSH
sg-kubetrader-workload attached to · EKS nodes · KubeTrader pods
↓ Ingress no inbound rules deny all
↑ Egress 0.0.0.0/0 · all protocols allow
SSH · port 22 — closed SSM Session Manager — agent initiates outbound, no inbound port needed

A single primary AZ (apne1-az4) holds everything that does real work: a public subnet for Kubestream and edge resources (IGW, NAT, ALB), a private workload subnet for everything else egressing through the NAT, and an isolated data subnet for Aurora. The two other AZs hold the minimum AWS requires: ALB target registrations and an Aurora reader for DB-tier failover. No workload pods schedule outside the primary AZ.

Building the component

This whole layout is a custom Pulumi component. When I started, I leaned into Pulumi Crosswalk for AWS (awsx), which gives you a VPC with sensible defaults in a few lines and is the right tool for a project that follows AWS best practices. The latency and cost choices in this post pull the opposite way: single-AZ for hot workloads, NAT pinned to a specific AZ with a fixed EIP, public-subnet placement for the ingest path, and awsx doesn't give enough flexibility for that. You can't pick the AZ for a single NAT, route tables are auto-managed so custom routes have to be tacked on outside the component, and the subnet layout is one of a handful of presets.

So I wrote our own on top of the Pulumi primitives directly. One upside of owning the component is that extending it is straightforward. It already carries hooks for VPC peering and PrivateLink, no immediate use case in v1 or v2, but cheap to scaffold now and expensive to retrofit later. Future cases worth being ready for: another exchange integration that lives in a different VPC, and BYOC deployments for managed services like ClickHouse Cloud.

Takeaways

The cloud's defaults are a starting point, not a rule. The Well-Architected Framework says single-AZ is an anti-pattern, and for most workloads it's right, but my constraints weren't the ones it was written for. So I measured instead of assuming, and the benchmark made the call.

Everything after that falls out of one measurement: the subnet layout, the private-by-default security stance, the NAT and endpoint choices, the custom Pulumi component. None of it is clever on its own. It's just what's left when you design around what the cloud already hands you (zonal placement, gateway endpoints, primitives you can build on) instead of fighting the defaults to keep an HA checkbox ticked.