Google Sheets Tutorial: Parsing Languages with Google Translate
Updated: Feb 11, 2022
Researchers who design modern global survey studies with open-ended questions will often face the challenge of processing response data that is cross-cultural or multilingual in nature. In this tutorial, we show an easy qualitative data pipeline that uses free methods in Google Sheets.
Not all of us are bilingual, much less polyglots, so when our open-ended response data comes in multiple languages it can be difficult to parse and clean. This situation is typically addressed by sorting data according to language, and then passing it on to language experts who will translate and code it appropriately. This still leaves the problem of sorting thousands of rows of data into languages that you don’t speak yourself.
Google has several tools that can help with this. Everyone knows Google Translate, which has improved recently due to the massive amount of data available for training; however, you may not be familiar with using Google Translate directly in Google Sheets.
In this tutorial, we’ll leverage Google’s NLP model through two formulas in Google Sheets. Learning to quickly parse languages in Sheets will significantly reduce pre-processing time and allow you to focus on analysis and reporting.
In our example, we will use over 1,500 Amazon reviews of the Apple Airpods. You can access the full sheet here: https://docs.google.com/spreadsheets/d/1lT6TZJND1-sGbRpAxOTYXlUt097d2o190zThWy6e9_c/edit?usp=sharing
Using =DETECTLANGUAGE to Parse Languages
The first step is to create a column beside your comments data and type the formula:
Where cell is the reference cell, for example, A28.
The result returns a 2 letter language code known as the ISO-639-1 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)
From Wikipedia, the ISO-639 is a standardized nomenclature used to classify languages. Each language is assigned a two-letter (639-1) and three-letter (639-2 and 639-3) lowercase abbreviation. So “en” is English, “fr” is French, “es” is Spanish, and so forth. The Wikipedia link has a list of all languages.
We can see that most of the reviews are in English, but there are a handful in Spanish; and one of Bulgarian, Catalan, French, Portuguese, and Norwegian. In larger datasets, this kind of information also gives you some insights on your customers and where they are from.
Let’s look at some examples of the results:
Google Translate can Help with Data Analysis
As you can see, Google Translate can detected the language pretty well. In the next step, we will also translate the comments to English: Above, you can already see the results on the right hand side in column D.
To translate the text, we use the Google Translate function in Google Sheets. It is simply:
= googletranslate(text, original language, target language)
In our example, it looks like =googletranslate(A28, B28, C28)
Now it is important to note there are some big limitations to this process. The language detection function tends to work better with text comments that are longer than just one or a few words. This is because the machine translation needs some context, as well as the fact that certain words in English that have the same spelling in French or Spanish, usually all coming from a Latin root. These words are called cognates and miscategorizing them could skew data if there is nuance between meanings.
Furthermore, as you know from coursework in university, using Google Translate is a sure fire away to score 0 on your French assignment. Provided that the resources are available, it is best practice to use these functions to first detect the language to sort the comments and then send the results to a native speaker for parsing and tagging. However, if you do not have that option and the results were pretty clean because the grammar structure was not overly complex, it is possible to use the translation software as part of your text analysis pipeline for tagging or sentiment analysis.