What is Data Anonymization?
Data anonymization is the permanent and complete alteration or removal of sensitive or personally identifiable information (PII) from a dataset. PII includes, but is not limited to, any information that may lead to the identification of an individual, such as an address, telephone number, image, date of birth, or social security number.
Data Anonymization at a Glance
Data anonymization is:
- Process of removing private or confidential information from raw data
- Results in anonymous data that cannot be associated with anyone or anything, such as individuals, organizations, or projects
Why perform data anonymization?
- Private activities
- Sensitive information
How to perform data anonymization:
- Custom anonymization
- Directory replacement
- Masking out
- Nulling out
- Number and date variance
- Scrambling or Shuffling
Interest in data anonymization continues to increase as privacy concerns grow. Because data anonymization allows for de-identifying original confidential data, it helps organizations meet strict privacy mandates from private, industry, and government rules regarding data protection. In addition, if data is anonymized and an organization suffers a breach, there is no requirement to report it, because no sensitive information has been leaked.
Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymized, the anonymization must be irreversible.General Data Protection Regulation’s directive on data anonymization
The process of data anonymization converts sensitive information into a format that can no longer be associated with the source data. In most cases, once this data is stripped of sensitive or personally identifying elements, these elements can never be re-associated with the original data.
Data Anonymization Techniques
The process of data anonymization uses various de-identification techniques to make sure that sensitive information is no longer identifiable.
Data is stored in aggregate for summarization. For example, individuals’ dates of birth are not included in a dataset, but the number of people of each age group is known. Aggregation is often used for data anonymization when information is being sold.
This data anonymization method involves creating and implementing a unique technique or a combination of several techniques using scripts or an application.
Makes changes to the sensitive data but maintaining consistent relations between other values. The data is pseudonymized in this way. To anonymize, the separately stored information that identifies the sensitive information must be deleted.
Encrypting data makes it unreadable to anyone without the decryption key, rather than removing sensitive data.
Data is grouped to obfuscate it. Examples of generalization are using only states rather than cities or area codes rather than the full numbers.
With hashing, data points are replaced with a hash (i.e., an alphanumeric string).
Sensitive data is removed and deleted from the data set, with those data elements becoming null values.
Data elements are altered slightly. Therefore, this data aggregation technique only can be used if accurate data is not needed.
Substitutes the identifying data with a placeholder, but the data can be re-identified with the right information.
Involves noise addition (i.e., adding or multiplying randomized numbers) to express sensitive data elements imprecisely.
Scrambling / Shuffling
With scrambling or shuffling, the letters or digits in the sensitive data are mixed.
Disadvantages of Data Anonymization
Allowing for Re-Identification
Performing data anonymization in a way that can be reversed is inherently risky. Having a way to reverse the de-identification leaves open the possibility of an unauthorized user being able to reveal sensitive information that was thought to be protected.
Differing Criteria and Processes for Data Anonymization
Varying techniques for data anonymization yield differing results when de-identifying sensitive information in datasets. In addition, oversight of the processes used to perform these methods varies, which can also impact the efficacy of the techniques.
Lack of Regulation for Data Anonymization
Performing data anonymization on some information can cause problems related to compliance and legal requirements. For example, some may have rules about producing certain records that could be impossible if they anonymized, such as:
- Data protection laws—local, state, federal, and international
- Export control regulations
- Industry compliance regulations
- Internal data security rules
Loss of Data Quality
Some data anonymization techniques devalue data by diluting it so much that it is difficult for the data to be used.
Data Anonymization and GDPR
When considering General Data Protection Regulation (GDPR) compliance, note that anonymized data is out of the scope of personal data. According to Recital 26 GDPR, “…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
Based on this, it could be argued that either data anonymization or data pseudonymization would meet the requirements for protecting PII. Data anonymization or data pseudonymization decisions made for GDPR compliance come down to an acceptable level of risk relative to effort.
Data pseudonymization techniques are more straightforward than those for data anonymization and can achieve data protection set forth by GDPR. But, there are risks that the data could be de-identified. Data anonymization is more complicated, but virtually eliminates the risk of PII data being compromised.
Data Anonymization vs. De-Identification
Using PII or electronic medical records (EMF), for example, the differences between data anonymization and de-identification are:
|Any links between a person and PII data have been irreversibly broken, making it virtually impossible to re-establish the identity of a person in the original record.||Personal identifiers in a record have been extracted, and it would be very difficult to re-establish the identity of a person in the original record.|
The biggest difference between de-identification and data anonymization is that there is no re-identification of anonymized records, because the links back to the subjects are irreversibly broken. However, with de-identification, the people and associated records can be put back together.
There are legal implications for choosing data anonymization or de-identification. An example is with the use of medical records and human tissues in biomedical research, which is controlled by both The Common Rule (Title 45 Code of Federal Regulations, Part 46, Protection of Human Subjects) and the Standards for Privacy of Individually Identifiable Health Information, Final Rule (part of HIPAA).
In this scenario, HIPPA only requires de-identification, but the Department of Health and Human Services requires data anonymization. This is just one of many examples of the challenges and considerations with data anonymization vs. de-identification.
Data Anonymization Best Practices
Data anonymization best practices start with the designation of a person or team to manage and evaluate requests as well as develop related forms, processes, and documentation. This role’s responsibilities should include working with requesters to:
- Understand the reason for the request.
- Analyze the data and identify what data elements need to be anonymized.
- Evaluate the challenges and risks.
- Gather information related to required resources and budget.
- Ensure that approved projects are successfully completed.
Balance Data Anonymization Risks and Benefits
The use case should drive decisions about the best way to handle data anonymization. This will direct the right level of data anonymization (i.e., if data pseudonymization is a viable option or not).
A balance needs to be struck between the degree of risk involved in re-identification and how the data will be used. This is because the more data is anonymized, the less value it has for analysis. When choosing the best data anonymization technique, always consider laws and regulations, the nature of the information, and its intended use.
Egnyte has experts ready to answer your questions. For more than a decade, Egnyte has helped more than 16,000 customers with millions of customers worldwide.
Last Updated: 25th August, 2021