Data Standardization: How It’s Done & Why It’s Important
What Is Data Standardization?
Data standardization is the process of creating standards and transforming data taken from different sources into a consistent format that adheres to the standards. These data standards cover everything from capitalization (e.g., all caps), punctuation (e.g., Ave or Ave.), acronyms (e.g., VP Marketing, V.P, of Marketing), non-alphanumeric characters (e.g., phone numbers—123-456-7890 vs. 1234567890), and values in incorrect fields.
Data standardization is critical to facilitating and improving the use of data, especially as related to data portability (i.e., the ability to transfer data without affecting its content) and interoperability (i.e., the ability to integrate two or more datasets). Three key impediments to data portability and interoperability are:
- Metadata uncertainties
- Data transfer obstacles
- Missing data
Data cleansing and data standardization are often, inaccurately, used interchangeably.
- Data cleansing is the process of improving the quality of data by discarding or correcting incomplete, irrelevant, or inaccurate data.
- Data standardization is focused on transforming the data to a standard format to correct and harmonize it as well as to remove any duplicates.
Why Is Data Standardization Important?
Data standardization is an important function, because it provides a structure for creating and maintaining data quality by:
- Defining how data should be formatted
- Eliminating extraneous data
- Identifying data errors
Data standardization helps reduce challenges related to poor data quality, including increased operational costs, unreachable customers, and misinformed decisions. Taking steps to implement and maintain data standardization provides a number of benefits, including that it:
- Creates a single view
- Enables cross-functional analytics to drive improved insights
- Enhances productivity
- Ensures a clean, trusted master database that can be governed
- Harmonizes data across systems
- Makes data shareable and leverageable across an organization
- Provides data transparency
- Reduces costs
How to Standardize Data
The following data standardization steps can be performed using Excel or third-party, purpose-built tools. After data has been audited and cleansed, processes and systems should be established to maintain data quality and enforce data standardization rules. Third-party tools are helpful not just for the initial data standardization, but also for maintaining data quality.
Audit and Evaluate Data Sources
Consider standardization issues and other data-related issues to determine where fixes are needed to ensure that data is:
- Correctly formatted (i.e., standardized)
As part of the data standardization audit, work with internal stakeholders to identify data that needs work. The audit should cover:
- Data sources—internal and external
- Frequency of data generation from each data source
- Teams that manage and use each data source
- When changes are made to the data
Declutter Data Sources
Create a set of standardized criteria for data that is stored to remove and prevent the population of clutter and duplicate records. It is often easier to standardize the data and then remove extraneous data, as clean data is easier to evaluate. Common data issues that should be addressed as part of data standardization are:
- Duplicate data
- Irrelevant data
- Redundant data
- Inaccurate data
- Low-quality data
Assess Data Collection Methods
A key to data quality and data standardization is understanding how low-quality data gets into the database and determining where data collection methods can be fixed or optimized. Considerations when assessing data collection methods include:
- Internal and external data input forms—be sure that form fields have validation in place
- Third-party data imports
Define Data Standards
Data standardization requires that standards be defined according to a set of rules for each field. Examples of commonly used standards for fields include:
- First names should be capitalized and contain no spaces, numbers, or extra characters.
- Phone number formatting should be consistent, following a single format (e.g., 1-123-456-7890 or 1-(123)456-7890.
- States should be entered using a consistent format (e.g., Louisiana or LA).
- Emails should use the standard email format of “firstname.lastname@example.org.”
- Website URLs should include the full address, i.e. https://www.domain.com, and not domain.com or www.domain.com.
- Job titles should either be abbreviated or spelled out (e.g., CEO or Chief Executive Officer, V.P. of Marketing or Vice President of Marketing).
When data standards have been established, data standardization issues should be addressed throughout the relevant databases. There are two main categories for data standardization:
- 1. Source-to-target mapping—to specify data elements that are used in applications
- 2. Reconciliation—to compare different sets of data to confirm that they aligned
Standardizing Data in Excel
The Excel STANDARDIZE function is available under Excel Statistical functions. It is used to return a normalized value, also known as a z-score.
The z-score is based on the mean and standard deviation. A z-score, or standard score, provides a way to standardize scores on the same scale by dividing a score’s deviation by the standard deviation in a dataset. The resulting z-score measures the number of standard deviations that a given data point is from the mean.
The average of every z-score for a dataset is zero. A negative z-score indicates a value less than the mean, and a positive z-score indicates a value greater than the mean.
=STANDARDIZE(x, mean, standard_dev)
The STANDARDIZE function uses the following required arguments (i.e., the values that functions use to perform calculations).
X is the value that we want to normalize.
Mean is the arithmetic mean of the distribution.
Standard_dev is the standard deviation of the distribution.
A few notes about the STANDARDIZE function:
- #NUM! error occurs if the given standard_dev argument is less than 0.
- #VALUE! error occurs if any of the given arguments are non-numeric.
Excel AVERAGE Function
The AVERAGE function can be used to calculate the arithmetic mean or average of a dataset. In this case AVERAGE(number1,[number2],…).
Users have two main options for calculating the standard deviation of a dataset based on the size of the dataset:
- STDEV.P: STDEV.P(number1,[number2],...) //Use if you are using an entire dataset
The Excel STDEV.P function calculates the standard deviation for a sample set of data. STDEV.P calculates standard deviation using the “n” method, ignoring logical values and text.noring logical values and text.
- STDEV.S: STDEV.S(number1,[number2],...) //Use if you are using a sample of data
The Excel STDEV.S function calculates the standard deviation for a sample set of data. STDEV.S replaces the older STDEV function with the same behavior.
Excel SMALL Function
The Excel SMALL function returns a numeric value based on its position in a list when sorted by value in ascending order. SMALL can return the “nth smallest” value (1st smallest value, 2nd smallest value, 3rd smallest, etc.
Excel LARGE Function
The Excel LARGE function returns a numeric value based on its position in a list when sorted by value in descending order. LARGE can retrieve the “nth largest” value – 1st largest value, 2nd largest value, 3rd largest, etc.
Excel RANK Function
The Excel RANK function returns the rank of a numeric value when compared to a list of other numeric values. RANK can rank values from largest to smallest (e.g., top sales) as well as smallest to largest (e.g., fastest time).
Excel RANK.AVG Function
The Excel RANK.AVG function returns the rank of a number against a list of other numeric values. When values contain duplicates, the RANK.AVG function will assign an average rank to each set of duplicates.
The OMOP Common Data Model
The Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) addresses the issues that arise when healthcare systems and providers use different names for a particular data field, which makes evaluating and comparing data across systems almost impossible. Created by the Observational Health Data Sciences and Informatics (OHDSI), a collaborative, multi-stakeholder, interdisciplinary effort, the OMOP CDM brings out the value of observational health data by enabling large-scale analytics.
OHDSI maintains an open-source library of analytical tools for research and performance measurement using the OMOP CDM. Standardized structured query language (SQL) queries are shared in a common open-source repository, and detailed data documentation is freely available online.
The concept behind this approach is to transform data contained in disparate databases into a common format (i.e., data model) and to use a common representation (i.e., terminologies, vocabularies, coding schemes). This allows researchers to perform systematic analyses using a library of standard analytic routines that have been written based on the common format.
Data Standardization—Eliminate Debris and Enjoy the Data
Consistency and cleanliness are the hallmarks of data standardization. Extraneous, incorrect, and inconsistent data is transformed into clean, quality data that can be easily shared and analyzed. Data standardization and subsequent enforcement of data standards ensure that organizations are able to use data effectively and that storage capacity is optimized by eliminating debris.
Egnyte has experts ready to answer your questions. For more than a decade, Egnyte has helped more than 16,000 customers with millions of customers worldwide.
Last Updated: 7th January, 2022