Data quality is the measure of the condition of data and its readiness to serve an intended purpose. The measurement is based on data quality dimensions—accuracy, completeness, consistency, integrity or validity, timeliness, and uniqueness or deduplication.
What Is Data Quality?
Data quality encompasses the planning, implementation, and control of protocols that apply quality management techniques to data. These systems and processes related to data governance are used to ensure that data quality is maintained at a high level to optimize its usability throughout organizations. By comparison, poor data quality is inaccurate, incomplete, or Inconsistent information.
Data quality is of increasing importance as it has become the mainstay of data analytics used to help organizations make informed decisions. Measuring data quality levels helps organizations make ongoing improvements by identifying errors requiring resolution and assessing how and if data serves its intended purpose. Systems can be used to automate error resolution or alerts sent when manual intervention is required.
Uses of data quality include the following.
- Enhance organizational efficiency and productivity.
- Ensure that proof of compliance is readily available.
- Increase the value of an organization’s data and the opportunities to use it.
- Protect the organization’s reputation.
- Reduce the risks and expenses associated with poor quality data.
Data quality is directly linked to the quality of decision making. Good quality data provides better leads, better understanding of customers and better customer relationships. Data quality is a competitive advantage that D&A leaders need to improve upon continuously.Melody Chien, Senior Director Analyst
Data Quality Dimensions
Data quality dimensions categorize different types of measurements. Providing an agreed-upon set of metrics for assessing the level of data quality in different operational or analytic contexts helps organizations:
- Define rules that represent the validity expectations
- Determine minimum thresholds for acceptability
- Measure data quality against acceptability thresholds
Data quality dimensions include the following six core quantitative and qualitative properties:
1. Accuracy—the degree of agreement with an identified source of correct information
2. Completeness—confirmation that all data that should be present is present and the level of data that is missing or unusable
3. Consistency—the level to which values and records are represented in the same way within and across datasets
4. Integrity or validity—the level of data matching a reference and the degree of data corruption
5. Timeliness—the degree to which data reflects the correct point in time
6. Uniqueness or deduplication—the level of duplicates and non-duplicates
Accuracy refers to the number of errors in the data and measures to what extent recorded data represents the real-world object. In many cases, accuracy is measured by how the values agree with an identified source of correct information.
The sources of correct information can include a database of record, a similar corroborative set of data values from another table, dynamically computed values, or a manual process of contacting sources of truth to validate accuracy. Data provenance (i.e., where it originated) and data lineage (i.e., how it has been used) are often part of the accuracy data dimension as certain sources are deemed more credible than others.
Accuracy is challenging to measure and monitor, because it requires a secondary source of corroboration, and the real-world information can change over time (e.g., age, address, color, size). Once an acceptable level of data quality is reached, data governance can be used to maintain it and ensure that it does not degrade.
Completeness means that all data elements have valid values (i.e., not null) assigned to them and have not been left blank (i.e., null). There are three ways that completeness is characterized.
1. Asserting mandatory value assignment—the data element must have a value.
2. Expressing value optionality or only forcing the data element to have a value under specific conditions.
3. Indicating that the terms of data element values are inapplicable, such as providing a waist size for a pair of shoes.
The measurement of completeness determines if there are gaps in data, and if so, where. It also assesses the importance of the missing object in the processes where the data will be used. In some cases, gaps are acceptable, such as for fields that are optional or if one or another field is acceptable (e.g., email or phone number, home or mobile number).
Data completeness can also be measured for whole records (e.g., complete means a record for every customer) as well as for attributes (e.g., complete means all mandatory data is in the record, but not all optional data).
Consistency is based on a set of predefined constraints that are established as a set of rules and is measured by the synchronicity of different systems storing information about the same object. The rules specify consistency relationships between values of attributes across a record or for all values of a specific attribute. Consistency protocols address the many ways that process errors are replicated across different platforms, sometimes leading to data values that may be consistent even though they may not be correct.
Consistency also refers to data values in one dataset being consistent with values in another dataset. A positive measure of consistency would result when two data values drawn from separate datasets are in sync with each other. It is important to remember that consistency does not necessarily imply accuracy.
Integrity or Validity
Data integrity, also referred to as validity, determines to what extent the data conforms to various requirements placed on attribute values (e.g., format, type, range). It is measured as a comparison between the data and the metadata or documentation for the data item.
Validity checks are used to verify that the data conforms to a particular attribute and measure the extent of conformity. Integrity, or validity, ensures that the data is formatted to be accepted and processed by systems.
Timeliness reflects the degree to which data represents reality at a specific time and is impacted by how often data is refreshed. It is measured in relation to a set schedule or to the occurrence of an event, such as new information entering a CRM every day in real-time or from scheduled, manual imports. Timeliness accounts for the degree to which data represents reality from the required point in time to provide users with the most relevant data for their intended uses.
Timeliness also refers to the time expectation for accessibility and availability of information. In this case, timeliness is measured as when information is expected and when it is readily available for use. It focuses on the synchronization of data updates to application data with the centralized data source.
Uniqueness or Deduplication
The dimension of uniqueness means that no object exists more than once within a dataset or record. Uniqueness, or deduplication, is measured by how much duplicate data exists.
Objects in data sources should be captured, represented, and referenced uniquely within the relevant application architectures. When there is an expectation of uniqueness, data instances should not be created if that entity has an existing record.
The uniqueness dimension can be monitored in two ways:
1. Static assessment—analyzing the dataset or record to determine if duplicate objects exist
2. Ongoing monitoring—providing an object matching and resolution service
Improving Data Quality
As a first step toward improving data quality, organizations often perform data asset inventories to assess data according to dimension measurements. Another common step is to establish data quality rules that specify required quality levels in data sets.
Ten additional steps to ensure that data quality standards are maintained are:
1. Assign accountability for data quality
2. Consider how data will be integrated and distributed with consideration for connection between business processes, key performance indicators (KPIs), and data
3. Create a plan for data governance, with a focus on identifying and correcting errors
4. Define minimum standards for data quality
5. Develop data collection objectives and implement a plan that includes continuous improvement
6. Instill in everyone in the organization the importance of maintaining and improving data quality
7. Monitor critical data assets, such as master data, and correct issues in a timely manner
8. Priorities datasets for quality improvement efforts
9. Set data quality standards
10. Use data profiling to describe data using metadata
Tactical ways to maintain data quality include creating, implementing, and enforcing the following steps:
- Identify and eliminate duplicates, providing a means to merge records as needed
- Maintain data entry standards that are based on an agreed-upon set of guidelines
- Make fields that capture essential data mandatory
- Prevent duplicates using tools to detect potential duplicates with exact and fuzzy matching capabilities
- Take advantage of address management tools to ensure consistency in address format and validity of data
- Use options sets or fields where there is a defined list of values
Data Quality vs. Data Integrity
Data integrity is a subset of data quality. Data quality refers to the characteristics that determine the reliability of information.
In contrast, data integrity refers to the characteristics that determine the reliability of the information in terms of its validity and accuracy. There are two varieties of data integrity—physical and logical validity.
Physical integrity is the protection of the completeness and accuracy of the data. There are four types of logical integrity—entity integrity, referential integrity, domain integrity, and user-defined integrity.
Data Quality Assurance vs. Data Quality Control
Data quality assurance and data quality control are often incorrectly used interchangeably. The difference boils down to prevention and detection.
Data quality assurance prevents poor-quality data from being entered into a system or created through integration with compromised data. It is usually applied before data acquisition. Often, data quality assurance also refers to the systems and processes that prevent poor data quality.
Data quality control refers to the detection of errors or inconsistencies in datasets—usually after being added to a data store. It includes the methods or processes that are used to determine whether data meet overall quality goals and defined data quality criteria.
Why Data Quality Is Important
There are many reasons why data quality is important. Having high standards for data quality along with the systems and processes to support it fundamentally improves all operational areas of an organization, leading to better decisions and resulting outcomes. Data quality drives better analytics, which benefits organizations in a number of ways, including:
- Decreased risk
- Developing products or messages that appeal to the right people
- Finding potential new customers
- Gaining a competitive advantage
- Improved relationships with customers
- Increased profitability
- Make more informed decisions
- Target audiences more effectively
- Use data more easily and effectively
Conversely, poor data quality can have significant negative consequences for organizations. Poor data quality is well-known to lead to operational tangles, inaccurate analytics, and negative outcomes.
Poor data quality is also expensive. According to Thomas C. Redman from “Seizing Opportunity in Data Quality” in the MIT Sloan Management Review, “the cost of bad data is an astonishing 15% to 25% of revenue for most companies.”
Examples that Illustrate Why Data Quality Is Important
The more quality data a machine learning algorithm has, the faster it can produce results, and the better those results will be.
Data quality impacts an organization’s ability to demonstrate compliance. If data is disorganized or poorly maintained, it is difficult to prepare the necessary compliance reports. On the other hand, organized, high-quality data makes compliance reporting not only easier, but also more reliable.
For Decision Making
If there is a lack of trust in data, management will be reluctant to use it to make decisions. However, high data quality standards provide a level of trust that data can be a valuable decision-making tool.
Data Quality Is Worth the Investment
Developing and enforcing data quality standards is no easy feat. It takes time, effort, and money to avoid poor data quality.
Poor data quality should not be taken lightly, as it poses significant risks and costs. It leads to negative consequences that range from wasted time to bad decisions.
Time and again, the value of data quality has been proven. Without question, it is worth the investment. Investing in data quality ensures that the right information is accessible by the right people and systems when needed, yielding positive outcomes.
Egnyte has experts ready to answer your questions. For more than a decade, Egnyte has helped more than 16,000 customers with millions of customers worldwide.
Last Updated: 25th January, 2022