The Real Science Behind Data Quality

November 6, 2018

We’re constantly reminded of just how much data impacts our lives, from influencing elections around the world to monitoring employees at work. But there are many more situations we rarely hear about, like the competitors in the Esport championships using data to win a multi-million dollar prize. As technology writer Andrew Wooden puts it, “data has become, quite literally, a game changer.”

To be a game-changer with data, it needs to be high-quality data. Data quality is reflected by a set of values and their ability to serve a purpose in specific contexts. At minimum, data quality should be:

Relevant: directly relatable to the outcomes required of an analysis
Accurate: coincide with what is being measured and be free of mistakes
Complete: reflect all of the pieces of the challenge trying to be solved

However, assessing data quality is not always simple. There are numerous variables to consider. To make it easier to fine-tune these considerations for all business requirements, I’ve broken them down to the main four:

Gathering data

The first step of any successful project is determining the right data to use and figuring out where to find it. Since there can be multiple systems of records for attributes in a single row of data, it’s important to use a data dictionary to identify attributes that are available, along with their descriptions and uses. If a data dictionary doesn’t exist, then creating one should be the first step.

Gaining insight with data analytics

When sourcing data for a dataset, we want to optimize the set of extracted attributes as they’ll need to be cleaned, enriched, and monitored for changes if used in a model. As an example, attributes need to be analyzed for “information density,” filtering out anything that doesn’t add valuable information. One way to measure an attribute is by evaluating its information entropy, which will be lower for attributes like “Job Status” and higher for attributes like “Job IDs;” possibly indicating lower importance for sourcing “Job Status.”

To further optimize extracted attributes, the relationship should be analyzed for coupling and interdependence. Conditional information entropy can be used to measure interdependence between attributes, allowing more focus to be placed on important attributes so that dependent attributes can be eliminated.

Entropy measures can also be used when comparing attribute values across data slices (say, by customer) to better understand available data. This is valuable when data is complex or when complex post-processing is used to enrich data, resulting in incorrect mappings that apply default values. As an example, for a given group we may expect an even distribution across a set of users, but a few users may dominate all interaction between a specific group. This can be valuable information when building a good model.

Managing and manipulating data

Along with taking measures to protect and manage data, you may be required to manipulate data in order to increase its quality. When doing so, certain data attributes may need to be anonymized and enriched. In these situations, consider using these techniques:

Generalizing numeric values: Taking values representing age and converting them into range values such as [0,20], [20,40] or [40,60].
Generalizing categories. Taking employment types such as municipal, local, state, and federal, and defining them in broader terms such as government.
Randomizing values: Things like social security numbers (SSN), file and folder names.

Along with anonymization, it may also be necessary to enrich extracted attributes, especially because it’s rare to get all required attributes from source systems when building a good model. Enriching attributes may be as easy as a simple geo-lookup for IP addresses, or as complex as annotating images for subsequent classification.

There are outsourcing services that can help simplify the process of data enriching, including Amazon’s Mechanical Turk, but these can introduce privacy issues if personal information is shared with anonymous annotators. Another option is to use internal, controlled annotation workflows.

Monitoring a model

Once a model with satisfactory performance has been created, the number of attributes that needs to be sourced can be reduced by figuring out the importance of inputs (features) - especially for linear models. This process may not be easy (or even possible) for all types of models, as the impact of individual features can depend on other features in non-linear ways.

After a model goes live, it’s important to continuously monitor its inputs to ensure that the “shape” of the attributes does not deviate significantly from the training phase. Various statistical and entropical measures can be used to track this automatically. Without monitoring, a deployed model will drift away from its original goals and may eventually start returning incorrect results.

There are huge opportunities in data, but only for those who take advantage. Don’t let data overwhelm or prevent you from taking action, because there are only four parts to getting quality data.

For information on how to analyze files within a content repository to get better quality datasets, take a look at our blog on how machine learning techniques impact file analysis.

The Real Science Behind Data Quality

Gathering data

Gaining insight with data analytics

Managing and manipulating data

Monitoring a model

Share this Blog

Don’t miss an update

The Real Science Behind Data Quality

Gathering data

Gaining insight with data analytics

Managing and manipulating data

Monitoring a model

Share this Blog

Building a Scalable Permissions Service: Overcoming Challenges in Access Control

Getting Ready for ICH E6(R3): What You Need to Know About Data Governance

Revamping Egnyte's Sharing & Permissions: A User-Centric Journey

Don’t miss an update