What is Unstructured Data?
Unstructured data is a collection of different types of data that are stored in the file format they were created in and not organized into a well-defined schema. Usually text-heavy, unstructured data cannot be stored in cells or in a file structure, such as a CSV (comma separated value) or a tab-delimited text file.
Though unstructured data has a native, internal structure that is based on the application that created it, there is no data model that organizes it in a predefined way; it is stored in its native format.
Gartner defines unstructured data as content that does not conform to a specific, pre-defined data model. It tends to be human-generated and people-oriented content that does not fit neatly into database tables.
Unstructured Data vs Structured Data
Amidst the information explosion, data falls into two distinct categories—structured and unstructured. Each has distinct attributes which drive usage.
Unstructured data characteristics
- Various formats stored in native file format
- Stored in data lakes or non-relational databases or (i.e., NoSQL databases)
- Impossible for people to search, requiring processing for algorithms to understand
- Requires more storage space than structured data
- Requires data science expertise
Benefits of unstructured data
Data is stored in native format, which provides access to a wider variety of more adaptable data.
Data accumulation rates are faster, because anything can be collected without the limitation of predefining the data.
Option to store data in cloud data lakes that offer massive storage.
Structured data characteristics
- Predefined, searchable formats in rows and columns
- Stored in data warehouses or relational databases (RDBMS)
- Easy for people and algorithms to search and understand
- Requires less storage space than unstructured data
- Self-service access
Benefits of structured data
Easily consumed by machine-learning algorithms, because it is highly organized.
Easy to manipulate and query data.
Accessible to business users without requiring an in-depth understanding of data types and relations.
More analysis tools have been successfully used and tested, giving data managers more options.
Unstructured Data Examples
An umbrella for all data that is not structured, unstructured data includes a vast array of file types that are generated by both people and machines.
Text and documents, including:
- Records (e.g., medical, school)
Server, website, and application logs, including:
- Lists of activities performed
- Information about website visitors—who visits, where they came from, and what they did on the site
- Records of events that occurred within an application
- Chat logs
Sensor data from devices, including:
- Temperature sensors
- Proximity sensors
- Water quality sensors
- Chemical sensors
- Gas sensors
- Smoke sensors
- Level sensors
- Image sensors
Image files, including:
- JPEG (or JPG)—Joint Photographic Experts Group
- PNG—Portable Network Graphics
- GIF—Graphics Interchange Format
- TIFF—Tagged Image File
- PSD—Photoshop Document
- PDF—Portable Document Format
- EPS—Encapsulated Postscript
- AI—Adobe® Illustrator Document
Video files, from a variety of sources, including:
- Surveillance cameras
- Zoom meetings
Audio files, including:
- Uncompressed audio formats—e.g., WAV, AIFF, AU, or raw headerless PCM
- Lossless compression formats—e.g., FLAC, Monkey's Audio (filename extension .ape), WavPack (filename extension .wv), TTA, ATRAC Advanced Lossless, ALAC (filename extension .m4a), MPEG-4 SLS, MPEG-4 ALS, MPEG-4 DST, Windows Media Audio Lossless (WMA Lossless), and Shorten (SHN)
- Lossy compression formats—e.g., Opus, MP3, Vorbis, Musepack, AAC, ATRAC, and Windows Media Audio Lossy (WMA lossy)
Email files, including:
- EML format that contains raw message content
- MBOX files that are "mailbox" files which store the contents of mail folders
- MSG format which is the Microsoft® format for saving files, instead of EML
Social data created via:
- Social channels—e.g., Facebook, TikTok, Instagram, LinkedIn, WeChat, YouTube
- Microblogs—e.g., Twitter, Tumblr, Pinterest, Reddit, Sina Weibo
- Review sites
The primary challenges of unstructured data are related to the five Vs described by IBM.
- 1. Volume
The amount that is created.
- 2. Velocity
The speed at which it is produced and processed.
- 3. Variety
The heterogenous sources of unstructured data.
- 4. Variability
Correct data Interpretation depends on context.
- 5. Value
The insights and related actions that can be derived from unstructured data, or not, if the data is of no value.
Importance of Unstructured Data
While lacking the disciplined, orderly nature of structured data, unstructured data has proven to be both powerful and valuable.
Often viewed as a problem to solve, unstructured data offers unlimited opportunities to make improvements, gain a competitive advantage, increase efficiencies, and drive change—with data-driven insights that deliver results.
The following use cases demonstrate the value that can be extracted from unstructured data.
Unstructured Data from a Customer Review
Unstructured data from a customer review reveals details behind a rating. For instance, a reviewer provides a two-star product review, but in the comments says that the product was broken when it was delivered and the call with customer service was first rate. The comment also notes that the store’s hours are inconvenient, which is another reason why the reviewer gave two stars.
From the two-star product review, it would appear that the product was inferior. However, the two-star rating was related to the condition of the product when delivered and the physical store’s hours.
Unstructured Data in Images
Unstructured data in images can be coupled with structured data to provide more robust information. An example of this is in an online catalog.
Data from photos and product data are often disconnected. Leveraging unstructured data, elements like the color of a product shown in a photo can be connected with other specifications. In this case, someone searching for a red vase will find it immediately, rather than having to click through pages of different colors.
Unstructured Data from Documents
Unstructured data from documents, such as invoices and sales receipts, are usually poorly tagged, and include only items sold, date sold, and amount of sale. And it is the same in the case of returns.
Because this information is locked in documents and often not correlated, it is nearly impossible to find out which items are actually performing well. By extracting and analyzing this unstructured data, a true determination of performance is possible.
Unstructured Data Analytics
The key to unlocking the value of unstructured data lies in advanced analytics, creating opportunities to extract value by including previously unwieldy data. By adding unstructured data to advanced analyses, trends and connections are revealed and correlated to show relationships.
Storing Unstructured Data
Unstructured data is commonly stored in NoSQL (non-relational) databases and data lakes.
Popular NoSQL solutions for unstructured data includes Hadoop Distributed File System (HDFS), MongoDB, CouchDB, Cassandra, HBase, Redis, Riak and Neo4J. These purpose-built platforms allow large volumes of unstructured data to be processed, stored, and managed—without a common data model and single database schema, as is used for structured data.
Data lakes store unstructured data in its native or raw format. Unstructured data stored in data lakes includes output from systems, sensors, applications, and social media.
Data lakes are often housed in cloud storage, such as Amazon Simple Storage Service (S3), Microsoft Azure Data Lake (ADLS), and Google Cloud Storage. Whereas data warehouses store processed data within the confines of rigid structure, data lakes store unadulterated data.
Analyzing Unstructured Data with Advanced Analytics
Once unstructured data has been aggregated and stored, it can be analyzed with advanced analytics. The most commonly used types of advanced analytics are:
- 1. Descriptive analytics—What has happened?
Descriptive analytics analyzes both real-time data and historical data. The primary objective of descriptive analytics is to identify the reasons behind whatever has happened.
- 2. Predictive analytics—What could happen in the future?
Predictive analytics use previous trends and patterns to identify the likelihood of something happening in the future. This includes:
a. What will happen next, if ___
b. Root-cause analysis
c. Data mining
f. Potential impacts
- 3. Prescriptive analytics—What should the response be?
Prescriptive analytics determine the best options for what to do next in a given scenario. The focus is on action—achieving the best outcomes and identifying uncertainties to inform better decisions. Prescriptive analytics are based on descriptive and predictive data sources.
- 4. Diagnostic analytics—Why did something happen?
Diagnostic analytics uses techniques, such as drill-down, data discovery, data mining, and correlations to identify anomalies and relationships in data. It takes descriptive analytics further to find the “why” behind whatever has happened.
Advanced analytics tools and techniques for unstructured data include:
- Machine learning
- Artificial intelligence
- Data and text mining
- Complex event processing
- Semantic and graph analysis
- Pattern matching
- Data visualization
- Sentiment analysis
- Network and cluster analysis
- Multivariate statistics
- Neural networks
- Natural language processing
Unstructured Data Analytics Use Cases
Unstructured data from social media is analyzed to:
- Improve customer relationship management processes
- Enable more targeted marketing
- Gauge customer satisfaction with sentiment analysis
Predictive maintenance analytics
Unstructured data from IoT sensors is analyzed to:
- Detect equipment failures before they occur
- Monitor equipment in remote, lights-out facilities
- Highlight usage trends and identify capacity limitations
Security for Unstructured Data
Unstructured data is not only desirable to organizations; it is also the target of cybercriminals and malicious insiders.
Since valuable, and often sensitive, data is commonly extracted from structured data sources and saved in unstructured files, data lakes and NoSQL databases must be secured with the same diligence as traditional data stores.
Unstructured Data Protection Challenges
Many important sources of unstructured data reside in email and documents that are saved to network shared drives. As hard as it is hard to manage, it is equally difficult to secure.
The challenges of related unstructured data protection include:
- Maintaining compliance with corporate and regulatory governance
- Defending against insider threats
- Identifying the location
- Gaining access to the information locked in unstructured data
- Protecting intellectual property
- Understanding what applications contain sensitive data
- Controlling how unstructured data is shared—internally and externally
- Developing effective discovery processes to identify and track unstructured data that contains valuable or sensitive information
Unstructured Data Protection Considerations
Unstructured data security should consider the following questions so the answers can be incorporated into security policies.
- Who views or collects the unstructured data?
- Who can modify the files in the unstructured data store?
- What access controls are in place?
- What is the disaster recovery plan?
- What content should be encrypted?
- Is the network used to transfer unstructured data secure?
- What training programs are in place related to the production, use, sharing, and storage of unstructured data that contains valuable or sensitive information?
- Does the existing security policy effectively balance usability with data protection?
Securing unstructured data must be a key part of every organization’s security framework, as it is susceptible to data breaches and other forms of cyberattacks; being “too big to handle” is not an excuse for lax security.
An “Ounce of Prevention” for Unstructured Data
An ounce of prevention is worth a pound of cure – especially with unstructured data. The data security required to effectively protect it at the same level as other digital assets is significant. But there are simple ways to support the security of unstructured data, including:
- Know what unstructured data is being stored in NoSQL databases or data lakes.
- Partition storage and implement granular access controls.
- Schedule regular searches through storage to check for valuable and sensitive information.
- Remove and destroy unstructured data that is no longer needed. Although much of the value of unstructured data that is stored has a limited lifespan, it is rarely deleted from shared drives and devices. This clutter not only increases overhead expenses, but increases the challenges related to security.
Egnyte has experts ready to answer your questions. For more than a decade, Egnyte has helped more than 16,000 customers with millions of customers worldwide.
Last Updated: 13th August, 2021