Skip to content

DNS round-robin is not load balancing

Last year a client asked me to review their "high-availability setup." Turned out the entire thing was two A records in a DNS zone. Two servers, no load balancer, no health checks — just the authoritative nameserver handing out IPs in alternating order. "We have load balancing now," they said. Except they did not. What they had was an outage waiting to surface as soon as one of those backends hiccuped.

What DNS round-robin actually is

You take a domain — say app.example.com — and instead of one A record pointing to one IP address, you add two or three:

; multiple a records for the same domain
app.example.com.  300  IN  A  10.0.1.10
app.example.com.  300  IN  A  10.0.1.11
app.example.com.  300  IN  A  10.0.1.12

When a client resolves the domain, the DNS server returns all three IPs, rotating the order each time. The client is supposed to pick the first one in the list, and because the order rotates, traffic is theoretically distributed across all three backends.

In a lab, this works fine. In production, it falls apart in ways that are difficult to diagnose and expensive to recover from.

The problems nobody mentions until it is too late

No health checking

This is the most critical failure. DNS has absolutely no concept of whether a backend is alive. If 10.0.1.11 goes down — process crash, kernel panic, full disk, network partition — DNS will keep handing out that IP to clients. There is no mechanism to detect the failure and stop advertising the dead backend.

In my experience, this is how most DNS round-robin incidents play out: a backend dies at 2 AM, roughly a third of all users start getting connection timeouts, and the on-call engineer spends twenty minutes figuring out why "some users" are affected while others are fine. The fix is to manually edit the DNS zone and wait for propagation. That is not a resilient architecture. That is hope-driven operations.

TTL caching destroys distribution

Even when all backends are healthy, DNS round-robin does not distribute traffic evenly. The reason is caching at every layer of the resolution chain.

You set a TTL of 300 seconds on your A records. Your authoritative DNS server dutifully rotates the order. But between your server and the end user sit recursive resolvers — the ones operated by ISPs, corporate networks, Google (8.8.8.8), Cloudflare (1.1.1.1). Each of these caches the response for the full TTL. Every client behind that resolver gets the same cached answer, hitting the same backend, for the entire TTL window.

A large corporate network with 5,000 employees behind a single resolver? All of them go to one backend for five minutes straight. Then they all rotate together to the next one. This is not load balancing — it is load batching.

And it gets worse. Some resolvers and client libraries ignore TTL entirely. Java's default DNS cache in older JVMs was infinite. Some embedded devices and IoT clients never re-resolve. I have debugged situations where a single client was pinned to a decommissioned IP for weeks because the application cached the DNS result at startup and never looked again.

No session affinity

If your application has any concept of user sessions — login state, shopping carts, multi-step workflows — DNS round-robin will break it. A user might hit backend A for the login request and backend B for the next page load. Unless you have externalized all session state to a shared store, the user gets logged out or loses their data mid-flow.

A proper load balancer can implement sticky sessions through cookies or consistent hashing. DNS has no such capability.

No connection draining

You need to deploy a new version of your application. With a load balancer, you pull a backend out of rotation, let existing connections finish gracefully, deploy the update, health-check the new version, and add it back. Zero downtime.

With DNS round-robin, you remove an A record and pray. Clients that already resolved the old IP will keep connecting to it for the remainder of the TTL — or longer, if they cache aggressively. There is no way to drain connections gracefully. Your options are to either keep the old version running alongside the new one for an extended period, or accept that some users will hit errors during the transition.

Uneven distribution by design

Even ignoring caching, the distribution is fundamentally uneven. DNS round-robin distributes resolution requests, not actual traffic. A single resolved IP might serve one user making two requests or a power user making two thousand. A proper load balancer can route based on active connections, response times, or server capacity. DNS knows none of this.

What to use instead

Put a real load balancer in front of your backends. The specific technology matters less than having one at all.

HAProxy works well for self-hosted environments. It supports active health checks, multiple balancing algorithms (round-robin, least-connections, consistent hashing), connection draining, and graceful backend removal. It operates at layer 4 or layer 7, giving you flexibility in routing decisions. Nginx is another popular option, though its open-source version only supports passive health checks — it detects failures by observing real traffic errors, not by probing backends independently. Active health checks are available in Nginx Plus (the commercial version) and in Envoy Proxy, which offers them out of the box alongside advanced load balancing features.

A minimal HAProxy configuration that solves every problem listed above:

frontend app
    bind *:443 ssl crt /etc/ssl/app.pem
    default_backend servers

backend servers
    # least-connections balancing with health checks
    balance leastconn
    option httpchk GET /healthz

    server app1 10.0.1.10:8080 check inter 5s fall 3 rise 2
    server app2 10.0.1.11:8080 check inter 5s fall 3 rise 2
    server app3 10.0.1.12:8080 check inter 5s fall 3 rise 2

Nothing exotic here.

This gives you health checks every five seconds, automatic removal of failed backends after three consecutive failures, least-connections balancing, and a single stable IP for your DNS record. When you need to deploy, you can drain individual backends with a single socket command.

Cloud load balancers (AWS ALB/NLB, GCP Load Balancer, Azure Load Balancer) provide the same functionality as a managed service. They add cross-zone balancing, integration with auto-scaling groups, and TLS termination without managing certificates on every backend. If you are running in a cloud environment, there is little reason not to use one.

The key capabilities you gain with any proper load balancer:

  • Health checks: dead backends are removed from rotation within seconds, not minutes or hours.
  • Connection-aware routing: traffic goes to the backend that can actually handle it, not just the next one in a list.
  • Graceful draining: during deploys, existing connections finish before the backend is shut down.
  • Session affinity: when needed, users stay on the same backend for the duration of their session.
  • Observability: you get metrics on request rates, error rates, and latency per backend — data that DNS will never give you.

When DNS round-robin is actually fine

I am not saying DNS round-robin has zero legitimate uses. There are narrow cases where it is acceptable.

Simple stateless services with client-side retry. If the client application is smart enough to try the next IP when one fails, and the service is fully stateless, DNS round-robin can work as a basic distribution mechanism. Many modern HTTP clients and gRPC libraries will do this automatically. But note: this pushes the reliability burden onto every client, and you need to trust that all clients implement retry correctly.

Coarse geographic distribution. DNS-based geographic routing — where resolvers in Europe get European IPs and resolvers in Asia get Asian IPs — is a legitimate use of DNS for traffic management. But this is typically done with GeoDNS or latency-based routing, not simple round-robin, and there is usually a proper load balancer at each geographic endpoint.

Internal service discovery in controlled environments. If you control both the clients and the servers, and you have health-checking built into the client layer (like many service mesh implementations), DNS can serve as a lightweight discovery mechanism. Kubernetes DNS works this way, but it is backed by an entire control plane that updates records in real time.

DNS round-robin is a tempting shortcut. It requires no additional infrastructure, no new software to manage, and it appears to work in testing. But it trades short-term simplicity for long-term fragility. The first time a backend dies and a third of your users go dark for the duration of a DNS TTL, the cost of that shortcut becomes painfully clear.

If your service needs to stay available — and if people are paying you for it, it does — invest the time to set up a proper load balancer. It is one of the highest-return infrastructure decisions you can make.