Share This Article:Share on FacebookTweet about this on TwitterShare on LinkedIn

Kris Lahiri Egnyte VP of Operations and Co-FounderWe’re working on a new series to learn how IT provide their service to their customers. For this interview we caught up with Kris Lahiri, VP Operations & co-founder at Egnyte. Egnyte supports thousands of customers around the world with three data centers (West Coast, East Coast, and Europe) with their online storage service. His team is responsible for keeping the infrastructure that supports their service up and running.

Thousands of companies rely on you to back up their data. How do you backup the backup?

There is no easy way to backup data at our scale. Plus all of our data is always changing so it’s already obsolete by the time you’re done backing it up. We make sure we keep multiple copies of your data in completely redundant and different systems. This means two copies of the customer’s data are never on the same rack, or same server, or same location. Additionally for some customers we offer a DR service – we can explicitly maintain yet another copy in a different location. It is replicated to a geographically different location.

All the data is encrypted so it’s meaningless without the meta-data that goes around it. We back up the database and all the meta-data associated with the user data in the normal sorts of ways.

Getting a new customer’s data set up must be quite the task. Any interesting stories to share?

About 4 years back, we were trying to get 260GB of one customers data (the largest data set at that time) uploaded and we were running into all kinds of stalls and network problems. On a whim, we asked the customer to copy their data onto a USB drive and overnight it to our office. We then took this drive into the data center and ran an rsync (yes, we needed to preserve original timestamps) over regular USB2.

This gave birth to a more formal “data migration” process that we offer customers who try to on-board large data sets (anything more than 500GB) in a very short time. Lesson learned: never underestimate old-world sneaker net techniques.

Managing three data centers can be a challenge! How do you manage them? Do you have a NOC?

We have a virtual NOC model. There isn’t one physical location where all our Operations Engineers work. We have 3 levels of Operations Engineers. Ops engineer IIIs are looking at dashboards and alerts. They have playbooks and scripts to figure out what the problem is and escalate to Ops II or Ops I as needed. They also work with our support folks who field calls from users 24/7.

How do you NOC folks collaborate?

One thing we did was to create chat rooms in OpenFire. Ops Engineers can ask questions and get answers from others in the chat room. We also have a chat room where Ops Engineers log information about actions they are taking. This is a light-weight way to keep track of what is going on (without tickets, etc). We can then review the chat log to see the patterns that are developing and to find the root cause.

What tools do you use to monitor the system?

We use a mix of CactiNagiosGraphite, and Zenoss for monitoring our systems. Alerts are important to get notifications about problems and to fix them. But it’s very important to keep track of trends of information as well. It is invaluable when it is time to make critical decisions.

I haven’t heard of Graphite before. What do you use it for?

Graphite is an open source tool to keep track of trend information (similar to rrd2). We use it to trend all sorts of information. Where we used to have a gut feeling about something now we know for sure. For example, if the trend shows that all the servers in rack4 are running a few degrees hotter, then we can check the vent or cooling systems.

Have you found and solved issues because of Graphite?

Absolutely! We had released a new version of our sync client and noticed that there was a spike in the number of rsync requests. Now the users never saw any change in functionality but because we saw the change in the trend we were able to track down the bug and fix it before it became a full-blown issue.

Kris's team in the Egnyte Datacenter

Tell us about how you hire for positions on your team. What kinds of folks are you looking for?

Everyone who interviews for an ops position interviews with an Ops Engineer, a support person, and 1-2 people in the engineering team. Folks on my team interact with all of them. I’m not looking for super-star specialists. I’m interested in people who are familiar with a whole variety of technologies especially from large-scale environments.

Is there a question you ask on every interview?

My personal top question is “What do you do out of work?” Ops is pretty much a 24/7 job. Sure, you may be at the office only at certain times but your availability is very important. You should be able to get on the Internet at the drop of a hat. The reason why I ask the question is people do different things to recharge. So if someone likes to go hiking for days outside of cell phone range they might not be a good candidate!

Thanks for all the great info! Would you recommend some resources to continue learning?

Highscalability is a site we read to learn about how other large scale architectures are maintained. You’ll read about the Facebooks and Googles of the world and how they solve their problems. Some of the lessons there apply equally to you whether you run 5 or 5,000 servers.

May 11, 2012, Spiceworks Community

To view the original article on Spiceworks or to ask followup questions, click here.

Comments are closed.