How Egnyte Built Snapshot Recovery to Mitigate Ransomware Attacks

June 17, 2022

As companies accumulate and store large amounts of business data in the cloud, data security and governance become a major concern.

More than 16,000 companies use Egnyte to manage, secure, and govern their content. These businesses rely on the unified platform to keep their business running smoothly, because data loss due to ransomware attacks or accidental file deletion could have profound impacts on their bottom lines.

The Egnyte platform has many advanced features for permissions and data management, including point-in-time recovery. As we developed this capability internally, we knew optimum, near-real-time data loss protection meant enabling customers to restore or recover their data as fast as possible. We evaluated many possible solutions, and two ideas looked feasible, considering the scale of our stack and user base:

Event-based restore. Restore to a specific point in time by replaying the events in reverse order until a specific point in time.
Snapshot-based restore. Restore to a point in time before the incident.

Replaying each event in reverse order could be very complex when dealing with a variety of file system actions and their associated metadata. This could include restoring metadata like permissions, links shared for ad hoc collaboration, comments, workflows, and extended properties. Because of this, we finally decided to use the snapshot-based restore.

Snapshot Restore Architecture and Components

After a few weeks of discussion with different teams, we came up with the following design.

There are four main components to Egnyte’s snapshot restore architecture:

Snapshot Scheduler. We’ve built a module to take application-consistent snapshots of the disk.

Application-consistent snapshots capture the state of application data at the time of the snapshot with all application transactions completed and all pending writes flushed to the disk.
We’ve configured the snapshot module on each instance.
The snapshot module has custom jobs that run before and after the snapshot is captured.
We created automated procedures to deploy a snapshot module in all metadata DR nodes.

Metadata snapshots Listener Service. Listener service consumes calls on pre- and post-snapshots. It first makes sure that snapshots are in a ready state and talks to the central database to get the required metadata. It calls the Metadata service to add or delete snapshots in inventory.

Metadata snapshot service. For the feature to run consistently we needed information on each snapshot being created. The listener service informs the Metadata snapshot service about every snapshot being created. The Metadata snapshot service maintains an inventory of snapshots. The lifecycle of the snapshot is 15 days. The main responsibilities of service are to:

Create and delete VM instances to mount the snapshot.
Mount and unmount snapshots asynchronously when requested by the customer admin.

Our tests show that we can mount 700 GB snapshots in about four minutes.

Snapshot restore service. Snapshot restore service is a Java service. We managed to reuse Java components written for the file system metadata.

Snapshot restore service features include the ability to:

Navigate through the file system
Preview files
Download files
Restore files and folders
Sync with Egnyte Object Store

System Features

Now that we’ve described the system architecture, let’s look at some of the features we developed to manage the system internally and some functionality that’s presented to users.

Restore Purged Data

Customers could select snapshots from within a 15-day window. In that period, users could purge files and folders from Egnyte. Snapshot restore service calls Egnyte Object Store services to restore the purged object from the trash. This allows users to restore the purged files.

Restore Policies/Categories

Administrators that use Egnyte have the option to restore files where they previously were, or somewhere else. Here are the options they have for each.

Restore files in the same location:

For existing files, the service restores a safe version from the snapshot and the encrypted version gets pushed to the version history.
Restored files and folders keep their existing permissions, shared links, and metadata.
For files that are deleted, it restores the latest version of the file from the snapshot.

Move all the encrypted files to a different location:

Potentially encrypted folders get renamed. Renamed time is added as a suffix to the folder name.
A new folder with the same name gets created and all files are restored in that.
Permissions cannot be automatically restored from the snapshot but can be manually applied by referring to the "Folder permissions report" for the renamed folder.

Snapshot Restore Audits

Snapshot-based restoration can potentially change a lot of files, rename folders, etc. At the same time, customers should also be able to know who initiated the snapshot mount and restoration job, and it might be even more important if different admins initiated different restoration jobs.

We extended snapshot metadata and snapshot restore service to audit the below events:

Snapshot preview request
Snapshot unmount preview request
Snapshot restore initiated request
Snapshot restore completed event
Download files from snapshot preview
Preview files from snapshot preview

Admins can navigate to the reports section and see new reports under Audit reports, where they will be able to create, delete, schedule reports, and view or export data fetched under reports.

‍Production Readiness

The system collects many helpful metrics we use to track its performance, including metrics on:

Snapshot added/deleted from inventory
Time taken for mount/unmount requests (in seconds)
Number of preview requests from snapshots
Time taken for restore of files and folders

Finally, here are some additional stats about how the system operates, including figures on scalability and performance:

106 metadata snapshots are taken every four hours across data centers
Mounting 1 TB snapshots takes around five mins.
Unmounting of snapshots is pretty fast and finishes in a few seconds
A 10 TB (339k files) Metadata restore took around two hours sequentially.

Ultimately, We have ensured that if our customers are at risk of losing their data, we have a very simple-to-use, fast, and strong system for recovering data.

‍