How Egnyte Migrated Its DNS At Scale With No Service Disruptions

Introduction:

Egnyte, as a custodian of vast volumes of customer data and files, manages billions of files and petabytes of data originating from millions of users. With a system processing over a million API requests per minute, spanning metadata operations and analytical queries, the need to balance throughput and maintain exceptional service quality is paramount. This blog delves into the technology challenges, strategies, and technical intricacies of migrating Egnyte's Domain Name System (DNS) infrastructure at scale with minimal service disruptions.

The Role of DNS at Egnyte:

The DNS stands as a linchpin in ensuring the quality of service at Egnyte. Handling approximately 300 million DNS requests daily, equivalent to 8 billion monthly requests, Egnyte's DNS infrastructure is continually expanding. The production DNS zone currently houses hundreds of thousands of primary records. Furthermore, hundreds of new domains are onboarded weekly for trial and paid customers. Any interruptions or latency in DNS operations directly affect overall performance and hinder new customer onboarding.

Motivations for Migration

The decision to migrate the DNS infrastructure was driven by several key factors:

Outages: Recurring disruptions in our API services, which stem from our current DNS service provider, have resulted in persistent challenges when attempting to create and manage DNS records. Consequently, we have experienced a surge in after-hours support inquiries from our customer base, all relating to their inability to access and manage their domain.

Cost Efficiency: The operational cost associated with the existing DNS provider was prohibitively high.

Compliance: The current DNS provider did not meet FedRAMP compliance requirements.

Elevated Security and Monitoring: Egnyte aimed to enhance its monitoring and security aspects beyond the capabilities of the existing provider.

Selecting Google Cloud DNS

The choice to migrate to Google Cloud DNS was underpinned by several technical merits:

High Performance: The time it takes to resolve DNS queries can significantly impact the user experience. Google Cloud DNS offers exceptional performance, ensuring efficient and reliable DNS resolution.

Security Features: Google Cloud Platform (GCP) supports DNS Security Extensions (DNSSEC) and provides advanced security tools like Cloud Armor, enhancing DNS security.

DNS64: Google Public DNS64 offers DNS64 as a global service, leveraging the reserved NAT64 prefix 64:ff9b::/96.

Compliance: Google Cloud DNS aligns with FedRAMP compliance, ensuring conformity with regulatory requirements.

Stability and Resilience: Google Cloud DNS is recognized for its stability and resilience, reducing the risk of service disruptions. This is exemplified by improved performance and availability for Egnyte's internal use cases.

Existing Partnership: An established partnership with Google streamlined the migration process.

Cost Efficiency: Google Cloud DNS offers cost advantages, optimizing expenses.

Migration Challenges

Operating at scale presented unique challenges that needed to be addressed in the migration strategy:

Zero Downtime: Ensuring business operations remain unaffected during the migration was pivotal, guaranteeing a seamless transition for users.

Robust Rollback Procedure: A comprehensive rollback procedure served as a contingency plan for unforeseen issues during the migration, necessitating the implementation of automation.


Technical Migration Strategy

The migration strategy was carefully devised to minimize business disruption and incorporate a robust rollback procedure:

Automation: Automation was a cornerstone of the migration strategy. It encompassed updates to the domain registration service to add records in both the existing provider and Cloud DNS, migration scripts, and validation and verification scripts.

Dual Registrations: Enhancements to the registration service allowed simultaneous writing to both existing and new DNS providers behind a feature flag. This approach facilitated thorough validation of the new provider's setup and ensured consistency with the current configuration. Automation pipelines maintained synchronization between the two providers, providing flexibility for rapid switching without user disruption.

Gradual Transition: The migration unfolded in phases, with new registrations initially handled by both providers.

Observation and Validation: A crucial phase involved weeks of setup observation while validation and verification scripts ensured data accuracy.

Delegation to Cloud DNS: Once system stability was confirmed, DNS resolution was delegated to Cloud DNS.

Post-Migration: Additional observation and validation were carried out before disabling registrations with the old provider.

Technical Aspects of Migration Rollout:

The structured rollout plan consisted of the following technical components:

Pre-rollout / Readiness:

  1. Creation of DNS Zones in Google Cloud DNS.
  2. Updates to the registration service to accommodate both providers.
  3. Loading existing DNS records into the Cloud DNS zone.
  4. Rigorous validation and verification of records.
  5. Automation scripts for verifying DNS resolution and synchronization between providers.
  6. Reduction of TTL to 5 minutes for all DNS records.

Rollout:

  1. Validation scripts ensure no differences in DNS verification.
  2. Delegation of assigned name servers at the domain registrar to Google Cloud DNS.
  3. A 72-hour waiting period before removing domain records from the old provider to prevent caching issues.

Post-rollout:

  1. Updates to the registration service to cease registering with the old DNS provider.
  2. Exclusive creation of new registrations in Google Cloud DNS.
  3. Deletion of DNS records from the old DNS provider.

Technical Challenges:

Several technical challenges were encountered and resolved during the migration:

DNS Record Caching: To mitigate DNS record caching, the DNS TTL was reduced to 5 minutes from 24 hours before and reset back after the migration.

Name Server Caching: A waiting period of up to 72 hours was done before any cleanup on the old DNS provider to avoid potential name server caching issues.

Python-Based Migration Scripts: Python-based migration scripts, executed via the Jenkins pipeline, were run in multi-process mode. Each process handled a chunk of records, accelerating the loading of existing records. Idempotency was added to further reduce load time, resulting in a significant time reduction. An increase in API rate limits from Google for Cloud DNS was requested to accommodate the scale of records.

Incorporation of Technical Features:

Several technical features were integrated into the migration process:

DNS Logging: Cloud DNS logging was enabled to provide insights into DNS performance and to aid the proactive monitoring and debugging of DNS issues and latencies.
Central Validation Database:
To maintain consistency in the registration process across multiple regions, a central validation database was introduced. This database ensured the availability of subdomains before registration, reducing registration failures due to DNS propagation latency.
Secondary DNS: Azure was implemented as a secondary DNS service to enhance redundancy and improve reliability. Operating in active-passive mode alongside the primary DNS server, secondary DNS mirrors the primary server's DNS records, ensuring a reliable backup in case of primary server issues.

Conclusion:

Migrating DNS at scale is a complex endeavor, particularly for a high-scale SaaS enterprise. Nevertheless, meticulous planning, phased execution, and a strong emphasis on automation and validation make it possible to achieve a seamless transition without compromising service quality. With precise planning, execution, and robust contingencies in place, Egnyte's decision to adopt Google Cloud DNS has yielded significant benefits, allowing for the continued delivery of high-quality and secure services to its customers.

Editor’s Note: Narendra Patel, Ash Patel, and Gireesh Velupoori also contributed to this blog post. 

Get started with Egnyte today

Explore our unified solution for file sharing, collaboration and data governance.

Engineering Hackathon Continues to Enable Innovation and Efficiencies at Egnyte
February 13, 2024
Roman Kleiner
Read Article
Part 2: Audio/Video Search Using Automatic Speech Recognition
November 2, 2023
Sameer Rastogi
Read Article
Author
Manoj Chauhan

View All Posts
Don’t miss an update

Subscribe today to our newsletter to get all the updates right in your inbox.

By submitting this form, you are acknowledging that you have read and understand Egnyte's Privacy Policy

Thank you for your subscription!

Welcome to
Egnyte Blog

Company News
Product Updates
Life at Egnyte
Industry Insights
Use Cases