Skip to main content
  1. Posts/

Migrating Route 53 records from Latency Based Routing to Geolocation routing with zero downtime

··1265 words·6 mins·
DevOps Route 53 DNS Aws
Table of Contents

Amazon Route 53 is a pretty incredible managed Domain Name System(DNS) service. DNS translates domain names to IP addresses and allows for computers to connect to each other. More than just a managed DNS service, Route 53 lets us create records using different routing policies. The routing policy determines how Route 53 responds to a DNS request.

Primer on Routing Policies
#

Route 53 supports the following routing policies.

  • Simple routing - Simple routing is a standard DNS record. This is what most other DNS services would do.
  • Failover routing - Failover routing policy lets us route traffic to an alternate resource when the primary resource is unhealthy.
  • Weighted routing - Weighted routing lets us associate multiple records with a single domain, attach a weight to the records that determine how much portion of the traffic is sent over to these records.
  • Geolocation routing (“GBR”) - Geolocation routing policy lets us route traffic to different resources based on the geographic location of where the request is originating from.
  • Latency-based routing (“LBR”) - Latency-based routing policy lets us route traffic to various resources based on the best latency that was measured.

The routing policy is selected as we create a record in Route 53.

Latency-based routing
#

Latency-based routing (“LBR”) is a nifty way of improving the performance of your application. With LBR, AWS automatically and transparently routes traffic to the least-latency resource across the configured regions. AWS does this by determining which region provides the lowest latency and routing the traffic to the region. If you’re interested in the details behind how AWS does this, do read their documentation on How Amazon Route 53 uses EDNS0 to estimate the location of a user.

At work, a lot of the services that did not need multi-region routing were configured with simple routing. For services that could handle multi-region but had backing services and thus could not handle auto-failover would be configured with LBR without health checks. In the case where a failover was required, or a region needed to be disabled, we’d manually mark that domain as invalid, preventing traffic from being routed there.

For multi-region capable services that were stateless and did not have backing services, health checks are configured for the services so they would automatically failover when the health checks would fail.

Latent issues
#

For some time now, we’d been getting sporadic reports about LBR not working as expected and some requests were being served from a region that is worse off(purely looking at latency). To improve these latencies, one of the teams reached out to us to migrate their existing LBR records to GBR.

And that’s where I come in - this came up during my on-call shift. As part of the on-call process, the on-call engineer on shift also handles support requests that were escalated to us. For a LBR record, for each domain, we create multiple records with each record pointing to a load balancer in a different region. For GBR, a record needs to be created to map each country or continent to a load balancer in the desired target region. So assuming you’d want to route a request from each continent, you’d end up creating 8 records - a record for each continent plus a default record. Since a record is created mapping each continent, you might wonder why do we need a default record. Route 53 documentation states that not all IPs would contain geolocation info, and if it receives a DNS request from a location that it cannot identify, Route 53 would return a “no answer”. To prevent these errors, a default record is created that maps to our largest region.

Now if creating 8 records for a domain was not painful enough, I had to do them for 10 domains, for dev & stage (and later prod) which meant that if I were to do manually via Route 53 console, that’s about a 160 records. if you’ve used the new Route 53 console, you’d know the pain - it gets incredibly slow, especially if you have a lot of records.

To support easy creation of Route 53 records, I wrote a Python script called “Route 53 mapper”. Route 53 mapper reads a YAML file that contains a mapping of a domain to a target destination and makes a series of API calls using boto to create the records. These were invaluable when we moved from ELBs to NLBs (probably a story for another post). In the current context, I figured extending Route 53 mapper is the best way to complete creating these entries.

Lights, Camera, Action
#

If there’s one philosophy I strongly agree with, is that before you can automate something, you need to understand it. This was nicely expressed by Kelsey Hightower in the Changelog Podcast

Automation is the serialization of understanding.

Before you can automate something, you need to understand it. And since I didn’t understand how to create and update the records, I tried with a test domain. For a single record LBR, I could update the LBR record to GBR with no problems.

However, when I tried to update a multi-record LBR to GBR, I wasn’t able to do so and Route 53 threw an error message

(InvalidChangeBatch 400: RRSet with DNS name test.sathyabh.at., type A, SetIdentifier ue1, and routing to [continent = (NA, North America)] cannot be created because a non-GEO RRSet with the same name and type already exists).

It seems like it would not be possible to have the same domain pointing to an LBR & a GBR at the same time. I opened a support case with Amazon to verify this, and while waiting for a response, completed some more tests to verify this. I realized that the easiest way to migrate the records is to delete all the LBR entries, and create the corresponding GBR ones. However, this would create downtime and this wasn’t an option. After a lot of tests, I concluded that the only way to get a zero-downtime approach is to delete all but one LBR record, update this LBR to a GBR and then create the rest of the GBR entries. This would have a side effect of possibly some high latency calls (since deleting the LBR entries meant all requests are sent to a single region), inefficient routing, and possibly a spike in traffic for the time when the other LBR entries were not available.

The next day, AWS Support confirmed what I had concluded earlier with a correction to my original plan that I need not convert the single LBR record to a simple routing record and then to GBR, Route 53 would support updating a single LBR to GBR.

Armed with this knowledge, I updated my Route 53 mapper script, tested it out on some of my test domains and it worked well. Later in the week, I tried this approach on the services’ dev & stage records and it went well without any problems. I documented these and committed the changes to git.

About a month later, the same changes were required for prod, and one of the Customer Success Managers, who wasn’t aware that these steps were required, tried to create the records manually for production. Thankfully, he reached out to me before creating more records. We got on a video call and rolled out the migration, moving about 80 records in a little under an hour (including time to monitor for changes and prepare the YAML config that was needed) all with no reported errors or downtime.

Hope this helps someone in the future.

Sathyajith Bhat
Author
Sathyajith Bhat
Author, AWS Container Hero and DevOps Specialist.

Related

Notes on moving Amazon Linux AMI to Amazon Linux 2
··435 words·3 mins
DevOps Packer Aws AMI
I picked up a task to migrate our base image from Amazon Linux to Amazon Linux 2. Here’s what I needed to get it to work.
Checking for connectivity of ElastiCache Redis instances across peering connections using VPC Reachability Analyzer
··751 words·4 mins
DevOps Aws VPC
The VPC Reachability Analyzer tool is a network diagnostic tool from AWS. In this post I look at how effective the VPC Reachability Analyzer tool to diagnose a network connectivity issue from a VPC peering connection to a private subnet.
Running Folding@Home on AWS with AWS CDK
··440 words·3 mins
DevOps Aws AWS CDK Folding@Home
Folding@Home(aka FAH) is a distributed computing project. To quote from their website, FAH is a distributed computing project for simulating protein dynamics, including the process of protein folding and the movements of proteins implicated in a variety of diseases. Folding@Home involves you donating your spare computing power by running a small client on your computer. The client then contacts the Folding@Home Work Assignment server, gets some workunits and runs them, You can choose to have it run when only when your system is idle, or have it run all the time.