How Egnyte Built Snapshot Recovery to Mitigate Ransomware Attacks
As companies accumulate and store large amounts of business data in the cloud, data security and governance become a major concern.
More than 16,000 companies use Egnyte to manage, secure, and govern their content. These businesses rely on the unified platform to keep their business running smoothly, because data loss due to ransomware attacks or accidental file deletion could have profound impacts on their bottom lines.
The Egnyte platform has many advanced features for permissions and data management, including point-in-time recovery. As we developed this capability internally, we knew optimum, near-real-time data loss protection meant enabling customers to restore or recover their data as fast as possible. We evaluated many possible solutions, and two ideas looked feasible, considering the scale of our stack and user base:
- Event-based restore. Restore to a specific point in time by replaying the events in reverse order until a specific point in time.
- Snapshot-based restore. Restore to a point in time before the incident.
Replaying each event in reverse order could be very complex when dealing with a variety of file system actions and their associated metadata. This could include restoring metadata like permissions, links shared for ad hoc collaboration, comments, workflows, and extended properties. Because of this, we finally decided to use the snapshot-based restore.
Snapshot Restore Architecture and Components
After a few weeks of discussion with different teams, we came up with the following design.
There are four main components to Egnyte’s snapshot restore architecture:
Snapshot Scheduler. We’ve built a module to take application-consistent snapshots of the disk.
- Application-consistent snapshots capture the state of application data at the time of the snapshot with all application transactions completed and all pending writes flushed to the disk.
- We’ve configured the snapshot module on each instance.
- The snapshot module has custom jobs that run before and after the snapshot is captured.
- We created automated procedures to deploy a snapshot module in all metadata DR nodes.
Metadata snapshots Listener Service. Listener service consumes calls on pre- and post-snapshots. It first makes sure that snapshots are in a ready state and talks to the central database to get the required metadata. It calls the Metadata service to add or delete snapshots in inventory.
Metadata snapshot service. For the feature to run consistently we needed information on each snapshot being created. The listener service informs the Metadata snapshot service about every snapshot being created. The Metadata snapshot service maintains an inventory of snapshots. The lifecycle of the snapshot is 15 days. The main responsibilities of service are to:
- Create and delete VM instances to mount the snapshot.
- Mount and unmount snapshots asynchronously when requested by the customer admin.
Our tests show that we can mount 700 GB snapshots in about four minutes.
Snapshot restore service. Snapshot restore service is a Java service. We managed to reuse Java components written for the file system metadata.
Snapshot restore service features include the ability to:
- Navigate through the file system
- Preview files
- Download files
- Restore files and folders
- Sync with Egnyte Object Store
Now that we’ve described the system architecture, let’s look at some of the features we developed to manage the system internally and some functionality that’s presented to users.
Restore Purged Data
Customers could select snapshots from within a 15-day window. In that period, users could purge files and folders from Egnyte. Snapshot restore service calls Egnyte Object Store services to restore the purged object from the trash. This allows users to restore the purged files.
Administrators that use Egnyte have the option to restore files where they previously were, or somewhere else. Here are the options they have for each.
Restore files in the same location:
- For existing files, the service restores a safe version from the snapshot and the encrypted version gets pushed to the version history.
- Restored files and folders keep their existing permissions, shared links, and metadata.
- For files that are deleted, it restores the latest version of the file from the snapshot.
Move all the encrypted files to a different location:
- Potentially encrypted folders get renamed. Renamed time is added as a suffix to the folder name.
- A new folder with the same name gets created and all files are restored in that.
- Permissions cannot be automatically restored from the snapshot but can be manually applied by referring to the "Folder permissions report" for the renamed folder.
Snapshot Restore Audits
Snapshot-based restoration can potentially change a lot of files, rename folders, etc. At the same time, customers should also be able to know who initiated the snapshot mount and restoration job, and it might be even more important if different admins initiated different restoration jobs.
We extended snapshot metadata and snapshot restore service to audit the below events:
- Snapshot preview request
- Snapshot unmount preview request
- Snapshot restore initiated request
- Snapshot restore completed event
- Download files from snapshot preview
- Preview files from snapshot preview
Admins can navigate to the reports section and see new reports under Audit reports, where they will be able to create, delete, schedule reports, and view or export data fetched under reports.
The system collects many helpful metrics we use to track its performance, including metrics on:
- Snapshot added/deleted from inventory
- Time taken for mount/unmount requests (in seconds)
- Number of preview requests from snapshots
- Time taken for restore of files and folders
Finally, here are some additional stats about how the system operates, including figures on scalability and performance:
- 106 metadata snapshots are taken every four hours across data centers
- Mounting 1 TB snapshots takes around five mins.
- Unmounting of snapshots is pretty fast and finishes in a few seconds
- A 10 TB (339k files) Metadata restore took around two hours sequentially.
Ultimately, We have ensured that if our customers are at risk of losing their data, we have a very simple-to-use, fast, and strong system for recovering data.
Get started with Egnyte today
Explore the best secure platform for business-critical content across clouds, apps, and devices.
LATEST PRODUCT ARTICLES
Don’t miss an update
Subscribe today to our newsletter to get all the updates right in your inbox.