Unigrams is a qualitative data analysis platform designed to help researchers and analysts quickly understand the demands of customers, the concerns of staff, and the culture of their organization. While many of its users will already have a strong grasp on how NLP works, we want to make the process as accessible and transparent as possible. This guide will walk through the steps our analysts take in Unigrams so that it is clear how Unigrams works. This will prevent Unigrams from following a “black box” model where we conceal our methods from our users.
Creating a vocabulary of important words
For any analysis of qualitative data, the first step must be to identify which words in the text are important, also known as creating a vocabulary of words. A vocabulary in the context of Natural Language Processing is a set of unique terms which can range in number from one word to a group of words. For example, if we have a document that contains the phrases "He is a good guy" and "She is a good girl," the vocabulary would be "he," "is," "a,” "good," "guy," "she," and "girl.” To identify a vocabulary:
1. First we extract all the words from the corpus, or body, of the documents
2. We remove any stopwords such as "is", "was", "the", etc. that add little to no value in understanding the broader content of the document.
3. We filter out any words that occur in less than 50 documents. If a word is too infrequent, it likely won’t have a big impact on the documents’ overall meaning.
4. We filter out any words that occur in more than 90% of the documents. While this may seem counterintuitive, if a word is too frequent it loses its impact on the meaning of the document, much like stopwords, which are also very frequent.
5. The words that remain after this filtering become our vocabulary across all the documents because we want to make sure we are only considering important words that represent the corpus well.
6. For these words we calculate how many times the word occurs in a document. We perform this calculation for all the documents and record the results.
Now we have our vocabulary, we need to start representing the relationship those words have to each other. The first step in this process requires embedding the vocabulary. An embedding is a low dimensional learned vector that looks something like [0.12, 0.45, 0.56] and is a more compact representation of elements such as text and categories. In other words, each number may represent a word in the vocabulary, or a category that word belongs to. A vector is an array, or series, of one or more numbers such as [1,2,3] or [4.5, 2.3]. Dimensionality in statistics refers to the number of attributes of data. E.g. the embedding vector [0.12, 0.45, 0.56] has 3 dimensions. In summary, embedding the vocabulary transforms it into a series of ordered numbers that preserves the relationship the vocabulary’s words have with each other.
We use the `universal_sentence_encoder` model to generate embeddings of length 512 [A picture representing this would also be good] for all the words in the vocabulary. This embedding represents the vocabulary’s semantics, or how the words interact with each other, which helps us understand their meaning. The Universal Sentence Encoder is a pre-trained model that can generate embeddings from text that can be used for different tasks like text classification and semantic similarity. For example, if we pass along the sentence - "The teacher made some really good points," the model will generate an embedding like [0.2, 0.3, 0.4, ...] that captures the semantics of the sentence. A pre-trained model is a machine learning model that someone has trained on data from scratch and can be used out of the box to solve similar problems. One huge application of pre-trained models is to generate text embeddings.
Now that we have embeddings for the vocabulary, we will make a similar embedding for each individual document. The process is very similar to creating the embedding for the vocabulary: we use the same `universal_sentence_encoder` model again to generate embeddings of length 512 for every document. This gives us a greater understanding of the relationship between each word in the vocabulary by contextualizing it within each document.
Reduce dimension of document embeddings
Now we have our embeddings, but embeddings with 512 dimensions are too long for a clear analysis. Because of this, the next step is to reduce the dimensions of the embedding. Dimensionality reduction is the transformation of data from a high dimensional space into a low dimensional space such that the low dimensional representation retains a good amount of the properties of the original data. Basically, it is simplifying the data by reducing it to its most important elements. This is useful while doing intensive computations on the data which takes less time with a lower dimensional vector compared to a higher dimensional vector.
For the 512 long embeddings, we use an algorithm called Uniform Manifold Approximation and Projection (UMAP) to reduce the dimensions from 512 to 5. UMAP finds a lower dimensional embedding by searching for a low dimensional projection of the data that has the closest possible structural similarity. So for example, data points that were close to each other in the high dimensional embedding should still be close to each other in the low dimensional embedding and vice versa. From there, in our processes, we use cosine similarity as the distance metric for UMAP. Cosine Similarity is the measure of similarity between two vectors. It is calculated as the dot product of two vectors divided by the product of their lengths.
We now have embeddings that have fewer dimensions. We now use an algorithm called Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN) on the lower dimensional embeddings of all the documents to generate clusters. Clustering is basically grouping similar sets of data together, where every group is a cluster.
HDBSCAN transforms the space based on the density or sparsity of the data. It then builds a minimum spanning tree of the distance weighted graph . After that it constructs a cluster hierarchy of connected components and then condenses or merges the clusters based on minimum cluster size. After that it extracts the stable clusters from the condensed tree.
We use `euclidean distance` as the distance metric in HDBSCAN because we have used `cosine similarity` to generate the lower dimensional data. Euclidean distance calculates the distance between two real-valued vectors. It is calculated as the square root of the sum of the squared differences between two vectors, and is also known as the hypotenuse: the longest edge on a right-angled triangle.
We pass all the sentence embeddings generated after reducing the dimensions into HDBSCAN which tells us which cluster a sentence or document belongs to.
Calculate Topic Vectors
Now we have all of our embeddings simplified and grouped into clusters, we will calculate the topic vectors. Topic vectors give us the overall picture of what the main messages of the text are, pointing towards the end result of our analysis.
Calculating the topic vectors is ultimately another simplifying process. For every cluster we select all the documents and their embeddings in that cluster. We then take the average of their embeddings to create a new embedding that is representative of the entire cluster, which we call the topic vector
Calculate Topic Words
For each topic vector we calculate the cosine similarity between itself and all the vocab embeddings and then rank the words by that score. The corresponding word of the most similar vocab embedding being the most important topic word for that particular topic. Once again, this will have a significant impact on our analysis on the meaning of the text.
Intertopic Distance Map
The Intertopic Distance Map is a visual representation of the topics in the clusters in a 2 dimensional space. The size of each of the topic circles is directly proportional to how many words belong to that topic cluster. So the more frequently a word appears, the larger its circle will be. The circles are plotted using a scaling algorithm so that topic clusters that are closer together have more words in common.
The bar chart shows the most salient terms. It indicates the total frequency(count) of the term across the entire corpus. Saliency can be thought of as a metric used to identify the most informative or useful words for identifying topics in the entire collection of texts. Higher saliency values indicate that a word is more useful for identifying a specific topic. When a specific topic cluster is selected the bar chart displays the most important words in that cluster. It displays the importance of words in a topic over the overall importance of the word in the entire corpus. For example, if the topic is “colours,” the word “blue” will be displayed in relation to its importance to the topic rather than its importance to the entire document it came from.
About the Authors:
Nilan Saha is a Machine Learning Engineer. He has extensive experience building ML and NLP driven products for different companies in the space of social media, education and healthcare. He has a Masters in Data Science with a specialization in Computational Linguistics from the University of British Columbia. He is also a Kaggle Kernels and Discussion Expert.
Lorina MacLeod is the digital copywriter at Kai Analytics. She is a soon to be graduate of Thompson Rivers University, pursuing a Bachelor of Arts with a major in English literature and a minor in modern language and global studies.
To learn more about Unigrams and try it out for free, please visit: www.unigrams.com