Tactics for Text Mining non-Roman Scripts
Authors: Hilary Faxon, Ph.D. & Win Moe
With the help of neural networks and natural language processing, interest in data science approaches to textual analysis has exploded in recent years. But much of this research is done in English, or at least on Roman scripts. While English might be the world’s most common language, Mandarin and Hindi are close behind. For computational social scientists, urgent global issues demand new sorts of literacy. Researchers curious about political attitudes in Russia or Hong Kong will have to tackle various challenges of tokenization, encoding, and fonts.
Here, we draw on experience assembling and text mining a Burmese language corpus drawn from Facebook posts to discuss approaches and challenges for working with low-resource languages with non-Roman scripts. This work was part of a larger project to explore rural markets and politics on Myanmar Facebook. As part of this work, Hilary’s team used CrowdTangle, a tool for monitoring public content from Facebook, Instagram, and Reddit, to identify popular posts from Burmese language Facebook groups and pages related to farming and agriculture over the course of 13 weeks between December 2020 and April 2021. After assembling the 2,005 post corpus, Hilary, a former D-Lab Data Science Fellow and primarily a qualitative researcher, teamed up with Berkeley undergraduate computer science major Win Moe to explore the material through text mining.
Myanmar’s historical isolation during decades of dictatorship resulted in the development of a different character encoding system, Zawgyi. While Unicode has grown in popularity, many Facebook users continue to use Zawgyi. An initial step was to standardize the character formatting of all text to Unicode through an online Unicode converter.
Additional challenges for text mining arose due to the underlying structure of Burmese words and sentences. Burmese, a tonal language spoken by over 40 million people, predominantly in the country now known as Myanmar, is part of the Sino-Tibetan language family. Like many languages used across South and Southeast Asia, its script is an abugida, meaning that consonant-vowel sequences are written as units, rather than as separate letters in an alphabet.
For example, the word “Myanmar” combines the sounds:
“m” (မ) + “y” (မြ) + “an” (မြန်)
and then “m” (မ) + “ahh” (မာ)
to produce “Myanmar” (မြန်မာ)
In Burmese, there are additive strokes that can change the entire meaning of a word. There are no clear word spacing conventions when forming sentences, leading to public posts having very different spacing methods. Therefore, it was important to first delete all spaces in the text. Next, we found a word segmenter that could run accurately and efficiently. Most tools we considered were not installable or scalable to our texts. We chose a basic word-matching method from a dictionary text file, which was easily modified to run through the textual database, while assigning each word to its source, date, and whether it’s from a page or group.
One of the shortcomings was its inconsistency in breaking up compound words into smaller words that carry their own meaning. An example of this would be the Burmese words “mountain” (တောင်) and “person” (သူ). While there are some cases in which separating these would be useful, this would cause problems in our case, since their compound (တောင်သူ) is one of several words that means “farmer”.
There were a total of 258,000 Burmese words mined, with 13,000 of them being unique. We put them into Google Sheets, with each word having the time attribute and whether it’s from a group or a page. Through PivotTables, we grouped words by count, with the top words having counts around 500. Afterwards, we used a word cloud creator website for the results to be visualized by count and differentiated by size.
Figure 1: Words used before February 2021
Figure 2: Words used after February 2021
Caption to Figures: During segmentation, we classified the word’s date, source, and length. We captured this raw data in Google Sheets and grouped them in a PivotTable.
For part of this research, we were interested in how online discourse shifted after Myanmar’s February 1, 2021 military coup. Word clouds from before and after the coup show the persistence of words related to agriculture, but also the arrival of new, political words such as ‘red’ and ‘peacock’ (the color and symbol of the National League for Democracy, the target of the military coup) and ‘green’ (the color of military uniforms, referring to the military). These findings converged with earlier qualitative coding, which found a spike in political discourse immediately after the coup. While our qualitative analysis was able topull out and contextualize key examples, text mining provided additional evidence of a widespread shift.
Mixing these methods worked seamlessly as there were not too many attributes for each word. The implications are that there should soon be standardization of segmenting Burmese words, and consideration of grouping hyponyms in the future so that language models such as ChatGPT more easily understand the language. Additionally, there should be tools helping prepare Burmese textual data to be machine-readable, such as spacing conventions, so that text mining will work more accurately. Lastly, it will help if n-grams methods are improved with better UI and integration with data preparation software so that advanced and accurate language processing is possible. New tools for NLP in Ukrainian language provide key examples of efforts to resource data scientists working on key questions for global geopolitics that unfold in low-resource languages with non-Roman scripts.