Today, most organizational information is stored in relational databases, multi-dimensional databases or data warehouses. These stores hold structured data in a pre-defined schema, and the value is extracted by running queries and analytics on this data. However, these systems miss out on the information within large volumes of unstructured data, for example, image, voice, GPS, system logs and textual data.
Of all forms of unstructured data, texts are important for two reasons. First, the vast majority of human communication occurs in natural language and is widely captured as text data. Second, it is the most common form of unstructured data in organizations. Examples include text contained in e-mails, chat messages and legal documents. Because this data is neither stored in a predefined schema–nor there are any restrictions on the content, type, or length–the information contained in such sources is not readily comprehended by machines. Therefore, extracting information and consequently, the value from such sources is challenging. The practice of extracting high-quality and/or actionable insights from text data is called text analytics. In this article, I will broadly discuss the applications of text analytics.
"Sentiment analysis is important because studies have shown that accurately measured sentiment correlates with future sales. A related application is in understanding customer predispositions"
What can we do with text analytics?
Generally speaking, we can exploit patterns in language, understand the author, and understand the content of text using text analytics. Often, the outcomes of these analyses are further used in generating actionable insights from the data.
Well-known applications that rely on “patterns in natural language” include virtual assistants with speech recognition capabilities such as Google Assistant; search phrase recommendations implemented in search engines; and spelling and grammar correction systems. Less-known applications include data standardization systems used in organizations that offer analytical products based on the data sourced from multiple furnishers. When unstandardized, free-form text data is collected from multiple sources, the first step is to standardize it.
For example, at Equifax, we procure vehicle registration data from a vendor to develop products that are used in performing lost sales and market share analyses. The vendor sources the data from different Departments of Motor Vehicles (DMV) across the country, where at each DMV multiple clerks manually input the data into the system. According to this data, there are more than half a million lenders in the U.S. As a credit bureau, we know that there are only about 30K lenders in the U.S. The discrepancy comes from the fact that there are thousands of variations for each lender (for example, Wells Fargo appears in more than 3,000 variants such as Wels Fargo, Wells Faargo etc.) in the raw data and standardization of this text field enables our customers to perform accurate analyses.
As a second example, when job titles we received from our members of The Work Number program were standardized, we were able to get accurate salary distributions for a given job title. An example pattern learned from this data is: “tech” appearing before “information” generally means “technology”; whereas “tech” appearing after “senior” generally means “technician” and “tech” appearing before support refers to “technical.” The raw data contained more than 4,000 variations of the job title “Store Manager,” making answering simple questions such as “how much money does a store manager make in a particular city” extremely challenging unless the job title was standardized.
Perhaps the best-known application of “understanding the customer” through text analytics is sentiment analysis when applied to the voice of the customer. Sentiment analysis is important because studies have shown that accurately measured sentiment correlates with future sales. A related application is in understanding customer predispositions. For example, insights that might be gained from analyzing social media data, such as “mid-range hotel customers are price sensitive and care about service more than amenities” help to guide strategic decisions on future investments. Understanding customers can either be performed at an individual or collective level and often focus on the emotional or subjective aspects of the customers.
Conversely, applications for “understanding the content” of the text focus on factual rather than emotional aspects. For example, automatic routing of customer service e-mails to appropriate department is a well-known application of document classification. Another popular application is in the legal field where closely related case files need to be retrieved and analyzed. A tempting application is to generate sales leads based on social media messages. For example, when people state that they are getting married or moving to another town, it is likely they will be looking for a place to live and need realtor and mortgage services. However, my experience from analyzing social media content has shown that very small fraction of the social data is commercially useful, mainly because relatively few people generate social media content (even though many people consume the content), and very few social identities can be linked to real identities. The small number of leads might still be useful if the customer lifetime value is high (for example, mortgage) or if the business operates in a saturated market (for example, mobile phone service).
In addition to the above-mentioned use cases, text analytics can be used to generate actionable insights. For example, my analyses of social media messages directed at a large retailer have shown that the customers felt ignored when they walked into the stores and that the restrooms were not clean. Based on the above insights, the retailer can set better expectations from their store associates and janitorial services vendor.
In summary, unstructured data in organizations is being underutilized and extracting value from it is not a trivial process. Text data is the most common form of unstructured data in organizations, and expertise in text analytics will help in extracting high-quality, actionable insights from the data by exploiting the patterns in the language, understanding the author and, finally, by automatically understanding the content.