How to Use Text Analysis Techniques to Bring Qualitative Data to Life
Updated: Apr 30
Qualitative data. What is it? Is it useful? How can I use it? As data is used more and more as the driving force behind decision making, many leaders and analysts wonder how they can move beyond the quantitative data from their websites, financial software, and departmental reports to learn about the thoughts, feelings, and personality of their stakeholders. This is where qualitative data shines. While quantitative data deals with the “what” and “how many” of a situation, qualitative data describes the situations’ qualities and characteristics. So, if quantitative data tells leaders and analysts what they should be doing, qualitative data tells them how they should do it. Or as that classic dating advice goes; it’s not about what you say, it's all about how you say it.
What is it?: Qualitative data describes qualities or characteristics and is usually collected through surveys, interviews, and/or observations.
Is it useful?: Yes. Using qualitative data gives leaders’ decisions context and nuance.
How can I use it?: Read on.
Most of the qualitative data used by businesses, governments, researchers, and educational institutions is text data (surveys, reviews, interview transcripts, etc.). So, it is fitting that the style of analysis used to handle this kind of data is called text analysis. Since modern data sets can be massive, analysts logically use computers for support, and the field of computer science that supports text analysis is called Natural Language Processing (NLP) or computational linguistics. NLP is an umbrella term for text analysis methods described as; the process of breaking down human language into a form a computer can understand.
In the era of AI and automation, the beautiful thing about language is that it’s inherently human in nature, and so analysts can balance art and science to find the subtleties in unstructured data.
This article will walk through a basic NLP data pipeline and describe some of the preprocessing elements unique to NLP, break down some of the techniques that we use at Kai Analytics to process the survey data of higher education institutions, and address some of the areas where NLP can be improved. The article will conclude by describing some of the other applications of NLP not covered here.
Note: If you want to learn a bit more about some hot topics in NLP you can check out this article of ours. 5 Text Analysis (NLP) Buzzwords for Market Research (kaianalytics.com)
Note: If you think these techniques might be useful and want a simple way to add them to your own analysis toolkit, check out Unigrams. It's our very own qualitative analysis tool designed specifically for Higher Education. Click the logo to find out more.
On this page:
Key Benefits of Text Analysis, or NLP
Once an analyst has decided to use qualitative data to inform their decisions the next question they often ask is “do I really need the computer; after all, I can just read these comments myself”. Or more simply “why should I use NLP?” First, take a look at what is at stake.
The cost to recruit a single student in the US can be as high as $2,357 USD, only for an estimated 30% of students to drop out after only one year. This costs universities and colleges approximately $37 Billion US Dollars a year and means that 30% of first-year students will drop out with a year’s worth of debt and no degree to show for it. The central issue when tackling student attrition is understanding why those students chose to drop out, because that may reveal a common theme the school can address. Traditional methods for retaining students focus heavily on quantitative measures such as average GPA, attendance record, or level of satisfaction, but often the results from these findings are prescriptive in nature and do not address the underlying concerns and challenges a student may be facing. Qualitative data provides a more descriptive understanding, and if an analyst decides to use qualitative data to tackle this issue, they’ll want to make sure they do it right.
NLP can sometimes get a bad rap for being nothing more than word clouds. In reality, word clouds are only a small part of the picture; NLP can also be used to analyze sentiment and understand themes, whilst breaking them down to understand issues in important population segments. All these methods can be represented graphically to make interpretation intuitive and communicating results straightforward. This gives the analyst a more personal, nuanced understanding of their target population. Kai Analytics used some of these more advanced techniques at Bastyr University, where giving students and faculty the feeling they were being listened to increased approval rates to 91%, and reduced attrition by 4%, saving an estimated $259,000 USD in revenue compared to the year before.
An important benefit of leveraging NLP is its ability to help reduce personal bias. Anytime a person reads something it is filtered through their own experiences and opinions before they understand it, and so naturally, some sort of bias will occur. Some institutions will try to address this problem by assigning two analysts to the same task, but this takes twice as long and costs twice as much. Computers can be another way to reduce personal bias, as long as care is taken to train the model well. This issue is discussed in more detail below; but, by using NLP to minimize personal bias, analysts can save on costs and secure buy-in for their informed recommendations.
At many universities and colleges, course evaluations are a primary way of measuring the success of courses and are used as an aid to learning assessment. But a separation needs to be made on the quality of the course/program and not personal attacks on the instructor. NLP helps to tackle this problem by automatically masking names, sensitive numbers, and protected groups or organizations. In another case, Kai Analytics was tasked with helping to solve a serious campus climate problem, one that touched on issues of diversity, equity, bias, and discrimination. Addressing this issue required genuine input from people who felt unable to voice their opinion. By using NLP techniques Kai Analytics was able to filter personal information out of survey responses and protect the privacy of those who participated. As a result, the client was able to launch a new campus climate strategy that was widely supported by those who lived, worked, and learnt on campus.
In light of growing concerns around campus safety, NLP can also be used to quickly identify and flag concerns around Title IX (e.g., discrimination, sexual violence, harassment, retaliation and hostility) violations on campus. This can enable staff to quickly respond to serious problems.
NLP can also increase operational efficiency, which does more than just save on costs. Analysts know that cleaning data is half the battle and that attempting to read, categorize, and analyze thousands of student responses can feel very daunting. In a Kai Analytics survey of institutional research and assessment professionals, respondents reported spending up to 2 weeks analyzing student comments with a team of 3-4 analysts. In comparison, an NLP program can do this preprocessing in a matter of seconds, giving analysts more time and emotional bandwidth to interpret results.
The techniques discussed below help analysts identify unique student personas among the population, by grouping response themes and segmenting the survey population. This can help analysts hear the voices of students who may normally be drowned out by their louder peers and discover that more people feel this way than they thought. This kind of data can help analysts make recommendations that solve problems affecting key groups in the student population and problems that students didn’t know they cared about. At Thompson Rivers University this revealed key differences in the resources needed by on-campus and off-campus students, even though they had responded to the same survey. These results showed the University how they could support these different groups to improve their overall alumni success rate.
So, NLP can be used for more than word clouds. For instance, it can be used to reduce personal bias, improve operational efficiency, reduce employee fatigue, and improve the quality of recommendations.
What is an NLP Pipeline?
All data science is performed around a data processing pipeline, and text analysis is no different. Since Natural Language Processing (NLP) is the name given to the field of text analysis used here, this is called an NLP Pipeline. There are many ways to wield this tool. This article goes through each step of the NLP pipeline, breaking them down into its different components, and the techniques that can be used to perform different analyses. This article will describe the journey of a qualitative dataset in the higher education space as it travels through the NLP pipeline and how it is used to create recommendations that guide effective policy.
Working with Survey Data
Survey data is one form of qualitative data, and a specialty of Kai Analytics. However, this pipeline could be used to analyze any sort of text data, from tweets and reviews to focus group notes and interview transcripts.
When an analyst is working with survey data, the go-to format to work in is .csv. While modern survey platforms offer export formats like excel, .csv, or even SSPS, .csv is standard. This is because the plain nature of .csv typically handles written responses better. For example, survey respondents will often make lists, which they identify by putting a dash in front of the statement.
-The content was interesting.
-The marking was fair.
If the analyst were to use an excel format, Excel would interpret “- “as a negative sign and expect a numeric value. This would cause Excel to generate a #NAME? error. The analyst could, in theory, manually place an apostrophe in front of each dash, but it’s exactly that kind of manual work that they are trying to avoid by using NLP.
Now that the data has been collected and exported, preferably in a .csv format, it is time to do some pre-processing. These steps prepare the qualitative survey data for the analysis techniques used later. Pre-processing is all about turning messy human thought into a format that a computer can understand and analyze. This is done by cleaning up the data, breaking down sentences into parts, removing stop words (like, me, I, you, the), tagging grammatical structures to give the program context (I like to bike vs. road biking is like running a marathon), and simplifying the data to make analysis easier.
Step 1: Basic Data Cleaning
In this first step, the analyst will perform some basic data cleaning steps, such as accounting for blank responses, removing duplicates, and spell checking. The idea is to reduce noise in the data set so that analysis is more efficient. The problem with passionate survey responses is that, usually, they are not passionately edited. So that needs to be taken care of before any meaningful analysis can begin. This can be done in a couple of ways. The most obvious is to use the spell check function in a word processor and go through the errors one by one. This works for small datasets but isn’t scalable. To deal with a larger dataset, an analyst can use either a rule-based or deep learning approach. These work by establishing a benchmark for what is correct, and then having the program automatically correct other mistakes. These approaches are covered in more detail below under “Sentiment Analysis”, but ultimately this is just another way to do a spell-check.
Note: If you are reading this article before launching your survey instrument; you can check out this article we wrote to get some tips on designing your survey questions for culturally diverse populations. You should also read our article on how to automate your survey by using Python, Qualtrics API, and Windows Task Manager. It could save you a ton of time.
Step 2: Tokenization
The first step is for the analyst to break down sentences into individual words (or tokens!).
So, this list of responses:
Would be broken down into this:
Step 3: Removing Stopwords
When humans write sentences, we write them in such a way that they sound good and make sense when we say them. We use words like “the” to give our sentences structure, but these words are not really necessary to derive the subject, theme, or sentiment of the sentence, and on a macro scale they will clutter up data. Analysts call these Stop Words, and they remove them.
Not all stop words clutter the data. This is where context comes into play. Depending on the situation; “not” and other negation words are crucial for sentiment analysis, but they would still clutter results. So, analysts must review the list of stop words and add or subtract from it as necessary.
Step 4: Tagging Parts of Speech
This step is crucial to the preprocessing phase as it allows the computer to understand the context of sentences despite the absence of stop words. Part of Speech tagging is where the analyst assigns “tags” to each word in the sentence, like in the example below. These tags are crucial in the next step of the preprocessing phase, Lemmatization.
Step 5: Lemmatization
Lemmatization is the process of getting to the dictionary root of each word. For example, biking becomes bike, [he] bikes become bike, and so does [the]bikes. So bike, bike, bike. Hold on that doesn’t make sense, now they all look the same. This is where step 3 comes in. “He bikes” and “The bikes” obviously mean two different things, communicated by the determiner words: he, & the. But since those words were removed in step 2, the meaning is lost. So, the analyst returned the meaning by adding Part of Speech tags in step 3, so “He bikes” would be given a verb tag, becoming “bikes (VBG)”. Now when these words are lemmatized in step 4, the part of speech tag stays with it while the word is reduced to its lemma. Now the example looks like biking, [he]bikes, and [the]bikes; become bike (VBG), bike (VBG), bike (NNPS). This allows the program to analyze all these words as single items while retaining the context for techniques like sentiment analysis.
Text Analysis Techniques
Now that the qualitative data has been cleaned and preprocessed it’s ready to tell the analyst some important things about the people who created it. The story of the data is told by the techniques used to analyze it. What follows are some useful NLP techniques for analyzing qualitative survey data, each with a description of the theory, an explanation of the results, and the strengths and weaknesses of each one.
N-Grams form the core of NLP and are the foundation of many of the other techniques in this article. While most techniques build on the N-Gram in some way, N-grams themselves can still tell an analyst some useful surface-level statistics about the qualitative data they are analyzing.
An N-gram is simply a way of breaking down text data into manageable pieces. The N is equal to the number of entities in the gram. n = number of words in the gram.
How to Interpret Results
The results of N-gram analysis can be used for many things, but here is a simple application that uses the N-gram to build a network graph, and a better word cloud. If you are interested in the python code used to perform this analysis, as well as all the preprocessing steps above, check out this video of our CEO Kai Chang presenting this concept.
In this application, the analyst will use a Bigram. Bigrams are the “goldilocks zone” for most analyses, as they offer more context than a simple word on its own but are not a noisy as a Trigram.
Break the data apart into Bigrams.
Count the number of times each Bigram occurs to show how often the idea is repeated throughout the data.
These results can be displayed with a network graph, like the one shown below. In this style of chart, words are displayed on their own, with the lines that join them showing the Bigram relationship. The thickness of the line shows the “strength” of the relationship, or how often that Bigram was repeated throughout the dataset. This chart also shows multiple viewpoints on the same topic. The course has links to great, good, and excellent, so it could be interpreted that this course is having a positive impact on respondents.
An easy technique to understand and implement
Easily built on to perform more complicated analyses
If the response data covers a wide range of topics, N-grams won’t provide enough clarity to interpret and identify major themes. Solution: Segment your dataset.
Topic modelling, sometimes called thematic extraction, is an NLP technique that statistically uncovers themes or topics from a large body of text. Topic modelling is very effective for understanding what the main ideas are in a body of text, and how those ideas are related to one another.
Like most NLP techniques, Topic modelling starts by using N-grams to break down the text and then converting the words (tokens) into numbers; so that the computer can understand and store the data. This process of converting words to numbers is called vectorization, pictured in the graph on the left. Vectorization can be compared to giving each word a position in 3D space, called a vector. In a large document, or over a large dataset, words with more similar meanings (semantics) will appear closer together in this 3D space. By measuring the distance between words, analysts can create a matrix of words with a similarity score for each pair of words, otherwise known as word embeddings. In the graph on the right, Australia appears closer in relation to Canberra and Peru closer in relation to Lima. The principle of vectorization is also important to understanding bias in Machine Learning, which is discussed later.
Vectorization is flexible enough to be applied on larger scales. Remember that N-grams can hold any number of words (n = number of words in the gram), so sentences can be vectorized as well. Or whole articles, if these articles were assigned semantic meaning. In this way, an analyst could analyze all Wikipedia entries or New York Times articles to see which topics are covered most often, which topics are talked about together the most, and how the publication talked about various issues. This could provide valuable insight into how our society perceives different issues, or how brands position themselves. But how does a computer go about learning the distance between words?
Take the example of a student, Sarah, writing a course evaluation. When Sarah is asked about the course, she might write about both final exams and course grading. This prompts the model to associate Sarah’s evaluation with the topic's “exam” and “grading”. The model will repeat this process with every student in the class, and see what topics are most often associated with each other.
To use an analogy, picture the analyst as a dating coach tasked with leading a speed dating event. Their goal is to match the highest number of couples with shared interests. As the coach, they dictate how many rounds, or iterations, will take place. Each round randomly pairs up two singles to talk. Now for the sake of this example, imagine that all these conversations are about course evaluations and that shared interests are found by grouping words into topics. If a topic makes sense, then that couple would successfully match.
In the first round of conversation, the words that match up may be professionals, material, course (Topic 1); work, articles (Topic 2); and love, textbook (Topic 3). Clearly, these topics don’t make sense because the words inside them don’t share similar semantic meanings. So, the dating coach continues to shuffle singles around for rounds 2, 3, 4, 5...all the way up to n.
After the nth round of speed dating, or after every word has been compared to every other word in a giant matrix, words that have the closest semantic meaning will end up grouped together like a single person finding someone who shares their interests. In the successful round, the words that match up may be course, relevant, useful (Topic 1); material, textbook, articles (Topic 2); and work, professionals (Topic 3). These topics finally make sense, and the singles can leave the event as couples with shared interests.
When a dating coach meets a couple, they may instinctively know if and why a couple works or not. This is called domain knowledge, and the analyst will use theirs to understand that Topic 1 is probably talking about courses that students found relevant and useful, Topic 2 is about course materials, and that Topic 3 may have to do with jobs and alumni success.
It’s important to note that topic modelling uses a probabilistic distribution, so some words might choose to spend weekdays with one Topic and weekends with another!
How to Interpret Results
One method of visualizing topic modelling results is to display the distance between topics on a 2D plot. This is called an Inter-Topic Distance map, a form of the Latent Dirichlet Allocation visualization. The size of the circles represents how often the topic comes up in the text. The distance between the circles shows how related the topics are. If 2 circles were to overlap, then they would be closely related. In the graph above, topics 2 and 4 share quite a bit of overlap. So, comments in theme 2 will likely share some similarities with theme 4. Perhaps theme 2 is about great courses, and theme 4 is about textbooks. So, an analyst could surmise that the quality of the textbook has a significant impact on students' perception of the course. The analyst could then dig deeper into the themes to see what about textbooks was closely related to great courses, and then recommend that those practices be implemented to improve course ratings.
A quick way to understand major topics or themes in a large amount of data
Still requires domain knowledge to understand what the topics are about.
The number of topics needs to be estimated using a coherence score, statistical test or best practice.
Sentiment Analysis, or opinion mining, is a technique used to determine how respondents feel about a subject. At its core, sentiment analysis shows whether data is positive, negative, or neutral. In more advanced forms it can be used to recognize basic feelings and emotions (happy, grateful, sad, angry), the urgency of a statement (urgent or not urgent), and the intention of a respondent (enrolling or dropping out). There are 3 basic approaches to sentiment analysis used in NLP that rely on different levels of machine learning. These 3 approaches apply to many more NLP techniques, as the approaches are really just different ways of leveraging machine learning to achieve text analysis results.
In this approach, rules are established by the analyst to tell the computer the meaning behind grammatical structures. These rules take the form of the techniques used in preprocessing, such as stemming, tokenization, and part of speech tags.
The analyst creates a large dictionary of polarized words, maybe a list of positive words and a list of negative ones, or words that describe anger or joy.
The program counts the number of words and tallies up their polarity.
The aggregated score shows the overall sentiment of the response. Other methods can help create weighted averages that account for the length of the comment or the use of strong words.
The obvious positive of the rule-based approach is its simplicity to understand and implement but be warned. As rules are added models get more complex and so systems require fine-tuning and maintenance to operate, an inefficiency often expressed in dollars.
Machine Learning Approach
This approach leverages machine learning algorithms to counteract some of the long-term maintenance costs and complexity of a rule-based approach, as well as to give the writer of this article something to talk about at dinner parties. This article won’t delve into the science behind machine learning models, but the core concept is simple.
A machine learning model is fed training data, which in this case would consist of comments alongside an overall score. Kind of like an online review and its star rating. These “correct answers” allow the model to be trained to categorize words and what different grammatical structures look like (concepts like irony and sarcasm are the holy grail for these models). The model is then fed raw data, and it uses what it has learned to categorize words and ultimately analyze sentiment.
The obvious strength of this approach is its ability to operate independently; however, this learning takes time, and machine learning algorithms can have spotty results, especially at the start. They are also vulnerable to bias in the training data, online trolling, and review bombing. Picture a naive toddler.
This approach combines rule-based and machine learning approaches to get the best of both worlds, reducing the maintenance costs of a complex rule-based approach while still giving analysts enough control to increase the accuracy of machine learning approaches. An example of a program like this is our tool Unigrams (bear with me). Unigrams is built on the rule-based systems developed by our analysts to analyze higher education data and utilizes machine learning algorithms to take these systems a step further and build’s its own domain knowledge of word categories and grammatical structures. In the case of Unigrams, this domain knowledge is specific to the words that are important to professionals in Higher Education. For example, at the University of Victoria "SUB" refers to the Student Union Building and not a popular sandwich from the cafeteria. An analyst could tell Unigrams to store this rule in its domain knowledge to make analysis more efficient, but could just as easily remove it without breaking other rules.
So, analysts can use machine learning models to immortalize the things they’ve learned in a computer that can do monotonous tasks, like preprocessing, in seconds. The analyst can then use the machine learning model in new situations they encounter or give it to team members who are solving similar problems. By doing this the analyst can save the time they would usually spend setting up data and use that time to try new methods, solve new problems, or go try that new SUB they keep reading about.
How to Interpret Results
The results of sentiment analysis are easy to read. Since sentiment analysis ranks phrases on a polar scale, bar graphs like the one below can be used to show the distribution of responses, with the average highlighted to show overall sentiment. However, it is important to remember that these are the feelings of the group overall and may not show the thoughts and feelings of various subpopulations. To do that the analyst must segment the data, discussed below.
Gives analysts context for the feelings around a topic.
A quick way to assess the overall feelings of respondents toward a particular topic.
Models struggle to handle implied meaning such as sarcasm.
Works best if the questions are narrowly focused. For example, “tell us what you liked or disliked about this course,” works better if it is split into two questions.
Segmentation in this context refers to the technique in statistical analytics of separating data according to the segment of the survey population it came from. While segmentation itself isn’t an NLP technique, it is very useful in the field of survey analytics since it can be used to break up survey respondents into different subpopulations. The techniques above can then be applied to these subpopulations to discover the thoughts and feelings of different groups on campus, making recommendations more in line with strategic goals.
How to Interpret Results
The method used to interpret results will vary from technique to technique. In the graph below results are split up into different personas, based on demographic information, to see which groups said what. The larger the bar, the greater the number of people from that group whose responses fell into that category.
Segmentation cannot exist in a vacuum. To segment respondents, the survey needs to ask some demographic questions (age, sex, gender, race, sexual orientation, or more specific questions like housing: On-campus or off-campus?). These responses will then be used to match text responses to the groups that gave them. This is one of the reasons why privacy is so important in NLP. If an analyst expects respondents to fill in personal information to improve the quality of analysis, then that analyst must be able to guarantee to the respondent that their information will remain private and will only be used to aggregate responses.
Gives the analysis context
Gives analysts greater insight into the needs and desires of different sub-populations.
Limited by the socio-demographic questions asked to respondents.
Not appropriate for small populations where respondents risk become identified.
Bias in Natural Language Processing and Computational Linguistics
While NLP is very good for reducing the personal bias of the analyst reading the comments, the field faces some challenges with biases in the machine learning models. These biases are not maliciously programmed by the computer scientists who write these models, they are learned by the model itself from the training data it is fed.
Most of the time, the training data given to a machine learning model takes the form of some text written by human beings for human beings. As such, it contains the bias of the people who wrote it. The model then learns these biases as correct. Whether it is beautiful or ugly, the bias in our machine learning models is a reflection of ourselves, and the society that shapes us. By seeking to understand the bias in our models, we can learn where and how we can improve to make our society a fairer and more equitable place for all peoples.
For example, GPT-3, a model developed by Open-AI, developed inherent biases related to race, gender, and religion. Some of the gender bias was related to occupation. While words like “king” and “queen” are associated with specific genders, words like “computer programmer” and “homemaker” are not. However, as the model was fed training data it learned that:
Man is to Woman as King is to Queen
Man is to Woman as Computer Programmer is to Homemaker
While the first statement is correct, the second is not, and is problematic if the model is applied to real-world problems, such as hiring.
The principal at work here was first mentioned in the discussion of Topic Modeling above. That discussion included an explanation of how models store words with similar meanings closer together in a process called vectorization. In the case of Bias, words that should not have similar meanings are stored close together in error. The graph on the right displays proper storage for the gendered examples of Man is to Woman as King is to Queen. The graph on the left shows the improper storage of non-gendered words computer programmer and homemaker.
There are some solutions. Since the issue stems from the training data, there are before and after approaches to solving the problem.
De-Bias the data set. This includes removing problematic data sets from use but also compensating for bias. For instance, if a data set included the statement “He was a great computer programmer” then the statement “She was a great computer programmer” would be added to balance the data set. This way, the model learns equal weight for each word.
De-Bias the model. In this method separate algorithms are written that can identify and modify biased statements in the model's memory. Algorithms like the Double-Hard Debias algorithm are showing promising results and are favoured as they allow already running machine learning models to be De-Biased.
So, while Bias in machine learning models poses a formidable threat to their widespread adoption, steps are already being taken to address these problems and correct them. However, there is still a long way to go before a machine learning model can be reliably used for life or death situations, like predicting crime.
NLP in the World
This article described a process we use at Kai Analytics, employing NLP techniques to analyze a large amount of qualitative data from survey responses in order to derive insight about what stakeholders care about and use that insight to inform recommendations. But many of these core NLP techniques can be layered with other technologies, like ChatBots, AI, social media, or Word Processors to create other exciting tools. After all, Natural Language Processing (NLP) is nothing more than the process of translating human thought into something a computer can comprehend, communicated through the medium of the natural language.
Great developers can also sprinkle NLP concepts into their code to vastly improve user experience; like the way that Grammarly slowly learns to recognize how the writer wants a piece of writing should sound, and then makes recommendations so the writer can better achieve their goal.
To the left is the window that pops up when starting to edit a new document in Grammarly. Now, Kai Analytics is in no way associated with Grammarly so we can’t say for sure what is going on here. But it isn’t hard to spot the similarities between these “goals” and the sentiment analysis discussed above. In fact, those “experimental” disclaimers sure do look like someone collecting training data for a machine learning model.
Models like the one used by Grammarly are difficult to program and take a great deal of time and depth of understanding to achieve. But that doesn’t mean that the average developer or analyst cannot start to apply these topics in their own code. To learn how the NLP pipeline discussed above looks in a python use case, check out this video of our CEO Kai Chang presenting on Topic Modeling, and walking participants through some of his own python code.