One of the persistent issues we deal with at HumanGeo is determining the language of a block of text. There are multiple Language Detection (LD) libraries available claiming high accuracy, so building our own wasn’t necessary considering that these are libraries built by experts in Computational Linguistics. Out of the many choices, we were interested in determining the accuracy and performance of the different libraries for detecting the language of tweets. In the past, we have used the following libraries in various projects:

Recently, we needed to perform LD on text in Java, so we focused our efforts on two Java libraries, LangID and LanguageDetection. We ran tests on these two libraries to determine the accuracy and perfomance of the libraries. Below are the highlights of the process and a discussion of the results.

Process

To perform Testing on this classification problem, a prelabeled data set of text needs to be used for testing. For this, we use Twitter data that we gathered from the Twitter Streaming API. Messages from Twitter contain a field for language (lang) that has a two-letter ISO code representing the language of the tweet that Twitter has determined using their own process. We understand this language classification by Twitter is not perfect, but we will overlook this issue momentarily because the large quantities of categorized data available outweigh these issues. We will not ignore this issue completely and even study the data Twitter provides at a later time. Now that we have collected adequate data (millions), we take the text of every tweet and remove #hashtags, @mentions, and URLs. Given that these elements have primarily English characters, even in tweets in other languages, we don’t want these elements to throw off the language detection. This filtered text is passed into the two Language Detectors to perform detection. The two detectors each return a list of languages that are ranked from most likely to least likely. The different libraries have a “threshold” value that can filter the returned languages that have a score/probability higher than the threshold value for convenience.

A quick note on probabilities

Many people request clarifications on what the “probablities” mean. Each language has a probability associated with it. For example, you could get the following as a list of languages for a detection: (en: .6, es: .3, fr: .1). These numbers mean English (en) is twice as likely as Spanish (es) to being the probable language English is also six times more likely than French (fr). Similarly, Spanish is three times as likely as French. The numbers are derived from true probabilities which end up being very small. They are scaled so that they maintain their relative proportions and they sum up to 1.

Comparison Code

The following is the high-level portion of code used the perform the main detection and store the results. This is where the actual detection happens. These blocks were derived from code found in tutorials and examples of the respective libraries.

The first portion just handles retrieving and stripping text from a Twitter message. As stated, hashtags, mentions and urls are removed. The lang variable is the “true” language according to Twitter. The detLang is a variable for the detected Language from each library.

String text = massageMessage(message), 
        lang = getLanguage(message),
        detLang;

The following block of code retrieves the top language from LanguageDetection’s detection. If no language met its default threshold, the code assigns a value of und. The code then updates the confusion matrix with the true and detected languages.

List<DetectedLanguage> languageOpt = 
        languageDetector.getProbabilities(textObjectFactory.forText(text));

detLang = languageOpt.size() == 0 ? "und" : languageOpt.get(0).getLocale().getLanguage();

updateMatrix(detectorMatrix, lang, detLang);

The following block of code uses LangID to perform Language Detection similar to the block of code above. One difference here is that a bit of wrangling/sorting is needed to get the top detected language.

langID.classify(text, true);

List<DetectedLanguage> results = new ArrayList<>(langID.rank(true));
Collections.sort(results, (o1, o2) -> Float.compare(o2.confidence, o1.confidence));

List<String> detectedLangs = results.stream()
        .map(DetectedLanguage::getLangCode)
        .collect(Collectors.toList());

detLang = results.size() == 0 ? "und" : detectedLangs.get(0);
updateMatrix(idMatrix, lang, detLang);

Results - Precision, Recall, F1 Scores - Confusion Matrices

Below we present the F1, Recall, Precision scores and Total messages of each language. The most frequent language is English, with 1.4 million messages. Spanish, Portuguese, Japanese, Arabic, Indonesian, Turkish, Russian rounding out the top languages with at least 100k total messages.

There are a few differences in the results statistics between the two libraries. LanguageDetection seems to return more und values than LangID. Having more undetermined identifications lowers the recall score of a category. This is seen in several recall scores of LanguageDetection, in particular for English, with a recall score under .5. In turn, by potentially removing ambiguities, a classifier may improve its precision. Again, LanguageDetection has some precision scores above .9, as seen in English and Spanish.

LangID, on the other hand, was willing to make mistakes in the precision, but typically had reasonable recall.

Overall, which library someone would prefer “out of the box” depends on which metric is more important. Another way to see this: the classifier can either be: * very certain about it’s decision, while balking at any ambiguous text * ok with fielding a guess even though it may be wrong, with a focus on doing better at capturing certain languages.

LanguageDetection

lang Lang recall precision F1 Total
am Amharic 0 0 0 6
ar Arabic 0.632 0.994 0.772703567 128462
bg Bulgarian 0.464 0.06 0.106259542 1010
bn Bengali 0.718 0.99 0.83234192 1274
ckb null 0 0 0 22
cs Czech 0.189 0.14 0.160851064 2386
cy Welsh 0.427 0.025 0.047234513 1808
da Danish 0.231 0.06 0.095257732 3577
de German 0.422 0.147 0.218045694 16680
el “Greek Modern (1453-)” 0.545 0.914 0.68283756 5124
en English 0.467 0.942 0.624434351 1402991
es Spanish; Castilian 0.37 0.957 0.533669932 482062
et Estonian 0.122 0.064 0.083956989 9448
eu Basque 0.353 0.091 0.144698198 2668
fa Persian 0.483 0.122 0.194796694 2288
fi Finnish 0.27 0.057 0.09412844 3148
fr French 0.456 0.679 0.545592952 82573
gu Gujarati 0.703 0.922 0.797742769 101
he Hebrew 0.725 0.986 0.83559322 1539
hi Hindi 0.139 0.908 0.241092646 7076
ht Haitian; Haitian Creole 0.16 0.022 0.038681319 6180
hu Hungarian 0.305 0.09 0.138987342 1624
hy Armenian 0 0 0 22
id Indonesian 0.258 0.7 0.377035491 126543
is Icelandic 0.272 0.138 0.183102439 1259
it Italian 0.458 0.326 0.380887755 31264
ja Japanese 0.89 0.999 0.941355214 213352
ka Georgian 0 0 0 14
km Central Khmer 0.96 1 0.979591837 25
kn Kannada 0.273 0.6 0.375257732 55
ko Korean 0.619 0.437 0.512316288 6776
lo Lao 0 0 0 5
lt Lithuanian 0.296 0.081 0.127193634 1477
lv Latvian 0.323 0.282 0.301110744 3248
ml Malayalam 0.441 1 0.612074948 111
mr Marathi 0.338 0.349 0.343411936 198
my Burmese 0 0 0 7
ne Nepali 0.399 0.848 0.542665597 1118
nl Dutch; Flemish 0.352 0.223 0.273029565 16497
no Norwegian 0.244 0.041 0.070203509 2838
or Oriya 0 0 0 12
pa Panjabi; Punjabi 0.2 1 0.333333333 25
pl Polish 0.424 0.445 0.43424626 11232
ps Pushto; Pashto 0 0 0 61
pt Portuguese 0.494 0.869 0.629913426 327847
ro Romanian; Moldavian; Moldovan 0.166 0.035 0.057810945 2640
ru Russian 0.483 0.962 0.643108651 104784
sd Sindhi 0 0 0 6
si Sinhala; Sinhalese 0 0 0 108
sl Slovenian 0.388 0.03 0.05569378 1305
sr Serbian 0.454 0.046 0.083536 679
sv Swedish 0.333 0.188 0.240322457 7545
ta Tamil 0.569 0.989 0.72238896 1405
te Telugu 0.565 0.929 0.702657296 23
th Thai 0.82 0.976 0.891224944 65742
tl Tagalog 0.518 0.574 0.544564103 72644
tr Turkish 0.491 0.937 0.644351541 131587
uk Ukrainian 0.251 0.481 0.32986612 10654
und Undetermined 0.818 0.147 0.249214508 269358
ur Urdu 0.38 0.371 0.375446072 2323
vi Vietnamese 0.078 0.046 0.057870968 2767
zh Chinese 0.387 0.54 0.450873786 3964


LangID

lang Lang recall prec F1 Total
am Amharic 1 0.001 0.001998002 6
ar Arabic 0.837 0.952 0.890803801 127625
bg Bulgarian 0.535 0.067 0.119086379 949
bn Bengali 0.998 0.231 0.375163548 1262
ckb null 0 0 0 20
cs Czech 0.391 0.156 0.22302011 2532
cy Welsh 0.098 0.063 0.076695652 2202
da Danish 0.31 0.098 0.148921569 3717
de German 0.783 0.303 0.436922652 17057
el “Greek Modern(1453-)” 1 0.782 0.877665544 5119
en English 0.899 0.792 0.842114725 1472971
es Spanish; Castilian 0.853 0.907 0.879171591 548801
et Estonian 0.066 0.065 0.065496183 10684
eu Basque 0.132 0.066 0.088 3364
fa Persian 0.892 0.14 0.242015504 2259
fi Finnish 0.599 0.119 0.198554318 3250
fr French 0.832 0.696 0.757947644 86063
gu Gujarati 1 0.352 0.520710059 101
he Hebrew 0.992 0.145 0.253016711 1609
hi Hindi 0.518 0.695 0.59358615 7340
ht Haitian; Haitian-Creole 0.015 0.127 0.026830986 9366
hu Hungarian 0.413 0.118 0.183555556 1882
hy Armenian 0.826 0.024 0.046644706 23
id Indonesian 0.49 0.777 0.600994475 125555
is Icelandic 0.499 0.294 0.370002522 1461
it Italian 0.741 0.394 0.514456388 33668
ja Japanese 0.942 0.915 0.928303716 217173
ka Georgian 1 0.004 0.007968127 14
km Central-Khmer 0.96 0.002 0.003991684 25
kn Kannada 1 0.188 0.316498316 55
ko Korean 0.964 0.598 0.738120359 6959
lo Lao 1 0.005 0.009950249 5
lt Lithuanian 0.187 0.042 0.068593886 1848
lv Latvian 0.505 0.385 0.436910112 3653
ml Malayalam 1 0.227 0.37000815 111
mr Marathi 0.813 0.144 0.244664577 198
my Burmese 0 0 0 7
ne Nepali 0.743 0.465 0.572011589 1118
nl Dutch;Flemish 0.758 0.49 0.595224359 18726
no Norwegian 0.34 0.138 0.196317992 2666
or Oriya 1 0.05 0.095238095 12
pa Panjabi;Punjabi 1 0.174 0.296422487 25
pl Polish 0.758 0.51 0.609747634 12075
ps Pushto;Pashto 0.59 0.02 0.038688525 61
pt Portuguese 0.774 0.921 0.841125664 354859
ro Romanian; Moldavian; Moldovan 0.124 0.048 0.069209302 2622
ru Russian 0.834 0.945 0.886037099 101405
sd Sindhi 0 0 0 5
si Sinhala;Sinhalese 1 0.076 0.141263941 122
sl Slovenian 0.475 0.074 0.128051002 1247
sr Serbian 0.766 0.051 0.095632803 552
sv Swedish 0.703 0.355 0.471767486 7750
ta Tamil 0.996 0.647 0.784433354 1405
te Telugu 1 0.113 0.203054807 23
th Thai 0.999 0.827 0.904899233 65745
tl Tagalog 0.398 0.731 0.515390611 82385
tr Turkish 0.883 0.971 0.924911543 124692
uk Ukrainian 0.541 0.486 0.512027264 10490
und null 0 0 0 292757
ur Urdu 0.92 0.174 0.292650823 2325
vi Vietnamese 0.099 0.073 0.084034884 3096
zh Chinese 0.986 0.074 0.137667925 4191


Confusion Matrices

Below are the visual Confusion Matrices that represent the categorization totals of each language. The rows are labeled with the “truth” language, while the columns are labeled with detected language. Each cell in the matrix is color and transparency coded to represent the relative weight to the other cells in the row. The values of the cells are not linear (.1 value -> .9 transparency) but log scaled to bring out low scores and visualize any potential clusters.

The diagonals are labeled as blue because in this view with logarithmic scales, it would not be appropriate to compare and contrast the values of correct detection to the incorrect detections. That type of analysis is meant to be done using the recall/precision as shown above. The matrix is meant to determine what other languages are being detected instead of the true language.

The confusion matrices are available for download as csv files: LangID, LangDetection

The frequency ranking in the dropdown sorts according to the number of “true” texts for each row. As stated before, the rows get sorted to English, Spanish, Portuguese, Japanese, Arabic, Indonesian, Turkish, Russian. Additionally there is the und for the “true” language that represents messages that haven’t been identified by Twitter.

Some interesting false detections points:

  • When Arabic messages get detected incorrectly, they are usually tagged: Pashto, Urdu, Farsi
  • For Japanese messages, Chinese is the culprit
  • For English: French, German, Spanish, Italian, and Chinese(??)
  • For Russian: Bulgarian, Serbian, Ukranian
  • For Korean: Japanese, Thai

LanguageDetection Confusion Matrix

LangID Confusion Matrix

Concluding Remarks

This study was an overview of a few Java-based Language Detection libraries. Though there is no clear indication of a “better” library, our preference is to use LangID “out of the box”, because it has a reasonable recall score for many languages.

We will delve into other issues such as the true accuracy of Twitter’s language detection in the future. This will help us create a proper Gold-Standard test (and maybe training) set for future studies. This study is meant to reiterate the fact that no machine learning classifier is perfect. It is also a helpful push to anyone interested in this problem and looking to contribute, since the source code to these libraries are on github.com and available for modification.

Additional Notes:

Twitter is still tagging Messages as in for Indonesian, and iw for Hebrew. This has been reported to Twitter.