How and Why Egnyte Redesigned Its Core Configuration System

October 13, 2021

Configuration at scale is hard.

At Egnyte, we’d developed a flexible system that was advantageous early on but put increasing stress on our engineers and processes as the company grew. And, being a cloud deployed software product, we needed to be able to serve all of its customers, which meant we had to come up with a solution that addressed our current challenges and set us up to support our future growth as well.

If our system demands were simpler, we could just roll out a code change for each necessary modification. Need to modify the location of the database? Declare a new value for the constant and hit the rebuild button. Need to make sure some API calls are throttled if they’re overused? Define the throttling criteria and the thresholds and redeploy.

But code-based decisions like this become more difficult as the code grows in complexity and specificity. How many branches can your testing support? How much maintenance is required for customer, or even user-specific code? Moreover, as software developers, we frown upon this kind of code because it defies our expectations, making it less predictable and more prone to errors.

In this blog, we’ll look closer at the challenges of configuration at scale, and we’ll describe the system we developed to support a near infinite number of keys, contexts, and values—all while maintaining in-memory speeds.

Growth Lead to Configuration Challenges

In most modern software, the system is split into code, configuration, execution context and inputs. The code is mostly the same for all customers or users of the system. It’s a set of predefined rules that, given other inputs, produce outputs or actions.

The configuration determines which parts of the code are used for each customer or user. The execution context determines the identity of the invocation—for example, the customer, user, remote device executing the action. And input is the data on which we execute, whether it’s user submitted, computer generated, time generated, or even randomly generated.

At Egnyte, the configuration piece had become large enough that we wanted to improve it. By 2016, we had over 200 customer-specific configuration values that determined how the code behaved. This created technical issues, such as database table width and the time to load these values. But more importantly, there was a lack of standardization around reading and updating configuration values. This created process issues, which are more complex and require a shift in thinking.

At a smaller scale, it was a tremendous advantage to have the split flows. We could quickly develop new configuration-bound flows and easily (read: procedurally) control how they were updated. However, as Egnyte’s codebase and customer base grew, it became a hindrance.

Every configuration flow needed to handle common actions, such as read, update, validate, log, audit, monitor, recover, authorize, and evict from cache. Moreover, we lacked visibility into the customer’s configuration, from our site reliability engineers to our product managers and even to our support engineers. Basically, if the developer who originally developed the feature didn’t think, “This should be exposed to support,” it wasn’t.

We Couldn't Just ‘Break Things’

Almost all of Egnyte’s traffic is from paying customers. As such, we can’t “break things” on a regular basis in an effort to innovate. This flies in the face of common startup mantras like, “experiment with live data” and “move fast,” but we are on the other end of the software development spectrum.

We value our customers’ data over our own inconveniences. That meant we needed to be able to roll out changes gradually, sometimes even with customer engagement through beta or preview programs, while maintaining the ability to revert if an issue occurs.

Moreover, changes to our largest customers needed to be done carefully to avoid downtime, which could have led to business loss on their ends. As a result, a support team member or even a product manager would occasionally have to engage with the customer and set up a timeline for changes and corresponding commitments. We also needed to protect against unintentional changes. As any developer (or DBA or SRE) knows, writing direct SQL statements to the database isn’t the safest way to achieve this. We needed a layer to isolate and validate a change before it’s submitted.

One final issue came not from our production system but from our developers: Adding a new configuration value, i.e. new database column, was a non-trivial effort. We had already automated part of the database modification process, but it still required quite a bit of effort to write a change in the system that, ultimately, only told the developer whether a customer had access to a certain feature.

We wanted to streamline that process to the point where developers wouldn’t even have to consider whether to add a feature flag or a disable flag. The effort had to take a few minutes at most, less if possible.

To sum up, more customers and a larger codebase led to:

More configuration points needed (experiments, gradual rollouts)
More configuration values (different values per customer group, customer, user, and device)
Higher requirement from the supporting infrastructure, which was nearing a limit

How we Solved the Problem

Because of our customer commitments and our growing scale, it became obvious we needed a revolution, not an evolution. We had multiple, roundabout ways to achieve the same requirement, but none of them was sufficient for our needs. So we created a new dedicated configuration system, dubbed “Settings Service” internally, to help carry us into the future.

After analyzing our requirements, we managed to distill it down to a few crucial points. We dubbed these our design principles.

The main design principles of the Settings service are:

A value is a function of key and execution context.
Execution contexts are hierarchical.
Extremely low latency for critical flows, regardless of load.

And some secondary principles are:

Developers must love the new system.
Emergency overrides should be possible.
The system should aid in reducing overall complexity throughout the application lifecycle.

We Defined Keys, (Execution) Contexts and Values

In the context of our configuration system, a key is the top-level container of values. A key can be “ui.theme.color” or “sftp.upload.thresholds.window_size,” or even “dao.read_only_mode.” A key name is globally unique (within Egnyte) and identifies the configuration item.

However, we already know that different entities—customers, users, devices—have different values for the same configuration items. For example, one customer may like a pink polka-dot theme while another might prefer a more traditional color theme.

So it doesn’t usually make sense to talk about the value of a key. Instead, we always ask the value of a key-context. Today, we have more than 1,200 queries, or keys, within the system, like, “Which file types are considered as a video for customer X?” (for the purpose of transcoding). On rare occasions we ask a question for the whole system, such as, “Which databases are excluded from writing?” but even that value is different between different environments.

We Added Hierarchical Structures

A simple question such as “What’s the theme for the current user?” is actually tricky to answer. Normally you’d check for specific customization for the user, then for the whole customer. But maybe marketing just issued a new holiday theme, so you better check that. Or maybe we rolled out a new theme with the new code. However, we wanted this new configuration system to be simple. Let the developer ask a single question and we’ll figure out the rest.

We solved this issue by introducing hierarchy into contexts. We have a defined order in our system: user → customer→ global → code-default. When you ask Settings Service a question in a specific context, you’ll get an answer from either that one or (recursively) its parent. So the code needs only to ask “What’s the theme for the current user?” and resolution to parent contexts (customer, global and code-default) is done automatically, without the developer having to worry about it.

This means developers are actually encouraged to ask for configuration in more specific contexts, even if eventually they plan on changing the configuration in a more granular context. A general principle at Egnyte is to always add at least the customer, but try to add the user and device if possible. This effectively implements a customer-level configuration that we can later use in emergencies or unforeseen circumstances to avoid downtime until a patch is released.

Moreover, this principle enables us to roll a feature forward, then roll it back for an individual customer if they prefer the previous version.

We Addressed Millions of Requests Per Minute

It’s relatively straightforward to make feature-rich systems that serve a few requests per minute. It’s harder to make a configuration system that has to serve 4 million requests a minute, which ours does today.

One of the requirements from the new configuration system was “don’t make me think.” We wanted developers to be comfortable adding checks against flags and other configuration values. Inevitably, this led to configuration calls being made inside loops, which sometimes were inside even more loops.

Admittedly, the histogram for the amount of calls per key looks like an exponential decay function (reverse hockey stick), with very few keys participating in over 80% of the calls. However, even if we trimmed those ultra frequent calls, we still had slightly less than 1 million requests a second. Any service that needs to satisfy that amount of requests needs to be clever about performance.

We Implemented a Client Library

Instead of using a pure SOA design, we created a client library (in Java, and later a limited one in Python) because it delivered faster performance and convenience. In addition to encapsulating the main service protocol in easy-to-use objects, it also provides caching logic.

So, instead of having a single protocol between the client and the service, e.g. “get key value for {key=login.lockout_timeout, context=customer_id=1234},” we also created a secondary protocol through our cache system, Memcached. The client would make requests to Memcached to try to satisfy the original request, and it would only contact Settings Service if there was a failure.

That brought down service calls to a modest 300 calls per second, which is a light load. Even though the problem was shifted to our Memcached servers, they were much better suited to bear the load than the application servers.

However, we had to give up functionality in the process. We had to make sure that almost all requests could be answered by a few key-value (as in a key-value, in-memory database, which is what Memcached is) requests. No secondary indexes, no maps, no interesting data structures.

Moreover, we had to make sure updates were done in a reasonable amount of time. If the design ended up with any key update having to flush significant portions of the cache, the system would have ground to a halt multiple times a minute.

It boiled down to storing key-context-value—including, “We know there is no value for this key-context”—for every level of the context hierarchy. For example, requesting “What’s the theme for the current user?” would fetch four keys from Memcached:

key.theme.context:global,customer=X,user=Y
key.theme.context:global,customer=X
key.theme.context:global
key.theme.context:default_value (this last one is what the key rolls out with)

At first we were skeptical it would work. Common wisdom dictates denormalizing for heavy read flows, and we just went with a fully normalized structure. However, it appeared Memcached servers were many times faster than heavyweight application servers, by multiple orders of magnitude.

Still, that wasn't enough. As fast as our cache servers are, the additional out-of-process latency was detrimental to overall request throughput.

We Cached Values In the Requesting Processes Memory

In addition to storing the values on a centralized cluster of cache servers, we also store them in the memory of the requesting process. This time we only store the value for the key-context we actually requested, so “{key.theme.context:global,customer=X,user=Y}” from the above example. This made sense as we hypothesized that the requests from a single application server tend to repeat. And they did—to the tune of 90% of the requests being stopped at this in-memory layer just by caching values for one minute.

The final breakdown was:

3.6 million calls per minute served by the small in-memory map. Most of these are repeated calls made during the lifetime of the same request.
0.35 million calls per minute (5,800 per second) served from our Memcached server. Most of these are grouped so a single actual network roundtrip can serve multiple calls.
Only 300 calls per minute (five per second) actually hit our processing servers.

A relatively modest cluster could serve all of that load easily; we barely make a dent on the database load charts.

Now some might scoff at this approach and say we could have easily cut off most of these requests by being smarter at the code level. And that’s correct. But it would’ve forced developers to design specifically for our configuration system.

When developers aren’t comfortable using something, they often find a way to work around it, which, in this case, would have led to statements like, “It’s too much of an effort to introduce the feature flag here,” which invariably leads to, “Oh, I wish I had a feature flag there because the production system is burning and customers are screaming.” We didn’t want a scenario where shortcuts lead to production crises.

Better System, Better Results

To recap, we built a configuration system that:

Could serve values to specific keys for the current execution context.
Could seamlessly apply hierarchy rules so requests can be made at higher resolution, leading to better possibilities for changes.
Has a cache-first design for both in-memory and external central cache to reduce over 99.99% of the calls to the service.

As of today, our new configuration system answers over 4 million (configuration) questions per minute and over more than 1,200 distinct queries (keys), with a history of about 50,000 total changes (context-value). On average, we add 36 new configuration keys per month.

Ultimately, this system solved a major obstacle for us, helping us scale to meet the growing demands of more than 16,000 organizations, while improving the overall experience for our developers.