Pre-Trained NLP Models for Asian Languages
Updated: May 18, 2022
May is Asian Heritage Month in Canada (Asian American and Pacific Islander Month in the USA), a month where we celebrate the contributions Asian-Canadians have made and continue to make. While many NLP researchers focus on English, there are some fantastic tools available that are built for other languages that deserve to be recognized, including Asian languages, which often have beautifully complex scripts and unique linguistic features. In the following list, we share some of these tools with you, just scratching the surface of the amazing resources that exist for free online.
fastHan is an open-source toolkit for NLP in Chinese. It includes Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. fastHan is based on BERT (Bidirectional Encoder Representations from Transformers) and has strong transferability and allows users to label their own data to fine tune the results.
While the majority of NLP packages focus on a single languages, many South Asian NLP resources, including IndicNLP, focus on multiple languages spoken in the geographic region including and surrounding India, Sri Lanka, and Pakistan. IndicNLP is an NLP ecosystem with multilingual tools in over ten languages including Hindi, Punjabi, and Tamil.
Created by Megagon Labs and the National Institute for Japanese Language and Linguistics (NINJAL), GiNZA is an open-source Japanese NLP Library. GiNZA is supported by two crucial technologies: SpaCy, an NLP framework with machine learning capabilities, and SudachiPy, a morphological analyzer that enacts tokenization. The MIT license GiNZA operates under allows anyone to use it, and it it easy to download.
KoNLPy is an excellent Python NLP tool for the Korean language, and builds on the strengths of a plethora of previous Korean NLP tools created by numerous researchers. This tool is designed to be simple and easy to use, and is open-source under a GPLv3 or above license. Due to its collaborative community, this tool will likely improve even more in the years to come.
Malaya is another Github NLP toolkit, this time for the Malaysian language, Bahasa Malaysia, also known as Malay. Malaya features augmentation, constituency parsing, part of speech recognition, and other capabilities necessary for sentiment analysis.
PyThaiNLP is a natural language processing project for the Thai language. It’s also a python package for text processing and linguistic analysis. It has many features, including character and word classes, unit segmentation/tokenization, and parts of speech tagging.
VnCoreNLP is a Vietnamese natural language processing toolkit on Github. It's an annotated pipeline that includes word segmentation, POS tagging, name entity recognition (NER), and dependency parsing. It's built to be accurate, fast, and easy to use.
This list is not inclusive of all Asian languages (there are over 2,300) or of all the NLP tools available for them. We invite you to explore the options and learn more about the many ways that NLP can be used in different grammars and scripts. Hopefully, tools like this will lead to an increase in NLP research in a wider variety of languages, fostering more diverse analyses.