Lessons Learned from 10 years of Egnyte Object Store

January 19, 2022

While we at Egnyte don’t think of ourselves as a storage company, the very act of storing files— billions and billions of them—is fundamental to what we do. Our customers need to secure, access, and share files, so storage is something we have to get right.

Today we hardly think of what it takes to store and secure billions of these files, often taking this process for granted. It has become like the act of breathing—fundamental to existence yet rarely given a second thought. This act of storing files, occurring today with stunning regularity and rhythm, actually took years to develop and perfect. It was born out of sheer necessity during a turbulent time in our history, and without it we would not be here today.

Early Trial and Error

Like many early stage startups, we believed in the mantra of “Do things that don’t scale,” so back in 2009, we built our collaboration platform on NFS. Now, I know what you’re thinking, but at the time, the word “cloud” was used predominantly in reference to the weather, so our options were limited.

Early customers really took to the platform, but our storage management was rudimentary as we tried to scale traditional NFS storage. Each application server was hardwired to one or more NFS storage filers and thus limited by their capacity. Each NFS mount was 16 TB, so each customer could only store that amount! Clients wanted to pump more data, and the solution was for our Ops team to physically copy data from one storage filer to another.

The alternative solutions were not pretty. We could either procure fancy storage management for millions of dollars and hope it fixes the problem or use Amazon S3 (the only viable public cloud back then) at $0.25/GB plus high bandwidth costs. Both options were economic death knells for an early stage company that had raised just $6 million in capital at the time. But there was a third option: build something in-house to tackle this monstrous challenge.

This last path would not be a quick fix and would lead us into the unknown. It wasn't an easy decision, but we had to commit to an in-house solution.

We decided to build an object store in-house. We called it EOS, an acronym for Egnyte Object Store, which also happens to be the goddess of dawn in Greek mythology. In hindsight, it was a good decision. We could build a custom object store more suited to our needs and quickly develop features that we wanted. EOS completed 10 years in production this past December.

The accompanying infographic shows some of our key statistics. The most important one for us has been the availability of 99.996% and durability of 99.999%.

Lessons Learned from 10 Years of EOS

Over the course of a decade, we learned many things while building and managing our object store and petabytes of data. Here’s a quick recap of some of the most important takeaways.

In the earlier days of building our object store, the focus was on scale and how to scale fast. As our scale increased over time, we realized that analytics and reconciliation was becoming an issue. Even generating simple reports was taking quite a bit of time. To overcome this classic speed vs. accuracy problem, the only way out was to build a resumable offline job framework operating on read-only metadata copies instead of live data.
Most scale-related problems have to do with processing a large number of concurrent requests. An object store adds a new dimension to it—durability. Imagine you need to process billions of requests for billions of objects, while still making sure that for each object at least one copy is reachable at all times. As your object store grows and ages, you need to figure out how to keep moving data from older hardware to new ones and keep validating your data at rest. This is a continuous process; it’s like the human body’s immunity—at work at all times.
Always keep a remote DR copy. DR copies are a must and will save your day. One can use other public object stores for it if required. Don't build everything yourself.
Cache everything you can but enforce business rules and permissions in real time. You can’t serve billions of requests if you are hitting a database or a storage at rest component for every request. You have to cache your metadata in an in-memory key-value store (like Memcache or Redis) while the active objects need to be on fast storage like SSDs. The key here is to plug in your object store caches after your business rules and permissions are applied. This eliminates complex cache synchronization, and you can rely on a simple LRU strategy to maintain cache. It also makes things much easier to scale and eliminate single points of failure.
Range requests can kill your object store in no time. Either serve them from non-blocking I/O-based services backed by SSDs in the initial layers of your objects store or simply delay the requests by rate limiting. Allowing them to flow deep into your system is DOA.
Only the paranoid survive. Durability is a beast—you need to build tons of jobs, processes, and monitors to make sure you always hold onto your data at rest over long periods of time.
Cleanup of failed operations like failed uploads is very important. At our scale, even a small percentage of leaks could pile up quickly
Moving petabytes of data at rest is cumbersome. Whether you’re rebalancing your data, replicating it to create more copies, or moving to newer storage devices, be prepared to allocate significant resources—both human and machine—to see it through.

Looking Ahead

As we now enter the next decade of our object store in production, we are looking at another lesson: how do we rewrite/refactor/rearchitect a critical and heavily used infrastructure piece of the stack and prepare for our next phases of growth.

This is not a new problem, or one that’s unique to Egnyte. It exists with almost every popular product out there, from Unix to Google Search to Amazon S3. Stay tuned. We will tell you about what we’re doing in this space soon.