ULI SHARED TASK

We were accepting submissions until the end of the evaluation phase of the VarDial Evaluation Campaign 2021 on February 2, 2021. Participants who submitted results are all invited to submit a system description paper to appear in the proceedings of VarDial 2021.

The results list can be updated again after the workshop has been held on April (19th or 20th as of current knowledge).

Current Top Results

ULI-RLE

Rank Team name Link to paper Method Relevant macro F1
1 NRC VarDial2021 Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.8138
2 Phlyers VarDial2021 Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5. 0.8085
3 SUKI baseline HeLI 0.8004
4 Phlyers VarDial2021 Naive Bayes classifier trained on character 5grams 0.7977
5 LAST VarDial2021 Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25 0.7758
6 LAST VarDial2021 Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25 0.7755
7 Phlyers VarDial2021 SVM binary classifier (char n-grams 3-4) followed by Naive Bayes classifier (char n-grams 3-5) 0.7740
8 LAST VarDial2021 Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25 0.7727
9 Phlyers VarDial2021 Naive Bayes classifier trained on character 3grams and 4grams 0.7584
10 NRC VarDial2021 BERT-style deep neural network with early stopping 0.7430
11 NRC VarDial2021 BERT-style deep neural network 0.6866
12 Phlyers VarDial2021 SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5) 0.6783
13 CRF-LI22 CRF with many different types of features 0.6140
14 NRC VarDial2020 deep neural network with adaptation to the test set 0.2996
15 NRC VarDial2020 ensemble of 6 deep neural networks 0.2872
16 NRC VarDial2020 deep neural network 0.2514

ULI-RSS

Rank Team name Link to paper Method Relevant micro F1
1 NRC VarDial2021 Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.9668
2 SUKI baseline HeLI 0.9632
3 NRC VarDial2021 BERT-style deep neural network with early stopping 0.9530
4 LAST VarDial2021 Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25 0.9496
5 LAST VarDial2021 Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25 0.9492
6 LAST VarDial2021 Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25 0.9484
7 CRF-LI22 CRF with many different types of features 0.8693
8 Phlyers VarDial2021 SVM binary classifier (char n-grams 3-4) followed by Naive Bayes classifier (char n-grams 3-5) 0.8389
9 NRC VarDial2021 BERT-style deep neural network 0.8177
10 Phlyers VarDial2021 SVM binary classifier (char n-grams 5-7) followed by Naive Bayes classifier (char n-grams 3-5) 0.7595
11 Phlyers VarDial2021 Naive Bayes classifier trained on character 5grams 0.5934
12 Phlyers VarDial2021 Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5. 0.5932
13 NRC VarDial2020 ensemble of 6 deep neural networks 0.2596
14 NRC VarDial2020 deep neural network with adaptation to the test set 0.1547
15 NRC VarDial2020 deep neural network 0.1359

ULI-178

Rank Team name Link to paper Method Macro F1
1 SUKI baseline HeLI 0.9252
2 LAST VarDial2021 Logistic Regression classifier trained on word internal char n-grams 1-4 weighted with BM25 0.9164
3 LAST VarDial2021 Majority vote ensemble of three Logistic Regression classifiers trained on char n-grams 1-3 weighted with BM25 0.9131
4 LAST VarDial2021 Logistic Regression classifier trained on char n-grams 1-3 weighted with BM25 0.9125
5 NRC VarDial2021 Probabilistic classifier (similar to Naive Bayes) using character 5-grams 0.9079
6 NRC VarDial2021 BERT-style deep neural network with early stopping 0.9039
7 Phlyers VarDial2021 Ensemble of SVM and Naive Bayes classifiers using character n-grams 3-5. 0.8847
8 Phlyers VarDial2021 Naive Bayes classifier trained on character 5grams 0.8831
9 Phlyers VarDial2021 Naive Bayes classifier trained on character 3grams and 4grams 0.8753
10 CRF-LI22 CRF with many different types of features 0.8644
11 NRC VarDial2021 BERT-style deep neural network 0.8366
12 NRC VarDial2020 deep neural network with adaptation to the test set 0.6751
13 NRC VarDial2020 deep neural network 0.6628
14 NRC VarDial2020 ensemble of 6 deep neural networks 0.6356

Training and testing

Read the task descriptions below. You can download the training data from here and the testing data from here. If you have any questions or when you wish to have your results evaluated contact the first author of this article.

Task description

As training data for the relevant languages, we use the Wanca 2016 corpus. In total, the corpus contains 646,043 unique sentences, ranging from 19 sentences of Kemi Sami to 214,225 sentences of Northern Sami. The source version of the corpus can be downloaded from urn:nbn:fi:lb-2020022902. The test data includes new sentences from the yet unpublished Wanca 2017 corpus and will be provided to the participants by the task organizers in the beginning of the evaluation period. Not all of the 29 relevant languages in the training set are attested in the test set: the distribution of languages in the test set is close to the actual distribution of new sentences in the forthcoming Wanca 2017 corpus.

In addition to the relevant languages, the test set includes sentences in 149 other languages. The three largest Uralic languages have been included into this category. The download links for the training data for these non-relevant languages are distributed by the task organizers only to participating teams. In total, the training data for this task consists of 63,772,445 sentences in non-relevant and 646,043 sentences in relevant languages, totaling 64,418,488 sentences.

Both, the training data for the relevant and non-relevant languages must be considered as noisy, e.g. there will be incorrectly labeled sentences (not intentionally, though). The Wanca 2016 corpus includes a http-address for each sentence and the form of these addresses themselves can be used in the task as well. For example, our current pipeline allows only one of two close languages to be found from the same page and this kind of information can be used to clean the corpora if deemed helpful by the participants.

The shared task is divided in three different tracks. All of the tracks are closed, so no other data or models can be used for training in addition to the 64,418,488 sentences in the training set. All the tracks use the same training data.

Track 1: ULI-RLE (Relevant languages as equals)

The first track of the shared task considers all the relevant languages equal in value and the aim is to maximize their average F-score. This is important when one is interested to find also the very rare languages included in the set of relevant languages. The F-score is calculated as a macro-F1 score over the relevant languages in the training set. E.g. if you predict relevant languages in the test set that are not supposed to be there at all, your precision and thus your F1-score for that language goes to zero. The result is the average of the F1-scores of all the 29 relevant languages.

Track 2: ULI-RSS (Relevant sentences as equals)

The second track considers each sentence in the test set that is written in or is predicted to be in a relevant language as equals. When compared to the first track, this track gives less importance to the very rare languages as their precision is not so important when the resulting F-score is calculated. The resulting F-score is calculated as a micro-F1 over the sentences in the test set for sentences in the relevant languages as well as those that you have predicted to be in relevant languages.

Track 3: ULI-178 (All 178 languages as equals)

In the first two tracks, there is no difference between the non-relevant languages when the F1-scores are calculated. The third track, however, does not especially concentrate on the 29 relevant languages, but instead the target is to maximize the average F-score over all the 178 languages present in the training set. This track will be the LI shared task with the largest number of languages to date (ALTW 2010 included 74 languages). The F-score is calculated as a macro-F1 score over all the languages in the training set.

Languages

The training set contains sentences in the 178 languages below.

The 29 relevant languages are:

The 149 irrelevant languages are:

Afrikaans (afr), Tosk Albanian (als), Amharic (amh), Arabic (ara), Assamese (asm), North Azerbaijani (azj), Bashkir (bak), Bavarian (bar), Central Bikol (bcl), Belarusian (bel), Bengali (ben), Bosnian (bos), Bishnupriya (bpy), Breton (bre), Bulgarian (bul), Catalan (cat), Cebuano (ceb), Czech (ces), Chechen (che), Chuvash (chv), Mandarin Chinese (cmn), Corsican (cos), Welsh (cym), Danish (dan), German (deu), Dimli (diq), Dhivehi (div), Standard Estonian (ekk), Modern Greek (ell), English (eng), Esperanto (epo), Basque (eus), Extremaduran (ext), Faroese (fao), Finnish (fin), French (fra), Western Frisian (fry), Irish (gle), Galician (glg), Manx (glv), Goan Konkani (gom), Guarani (grn), Swiss German (gsw), Gujarati (guj), Haitian (hat), Hebrew (heb), Fiji Hindi (hif), Hindi (hin), Croatian (hrv), Upper Sorbian (hsb), Hungarian (hun), Ido (ido), Iloko (ilo), Interlingua (ina), Indonesian (ind), Icelandic (isl), Italian (ita), Javanese (jav), Japanese (jpn), Kalaallisut (kal), Kannada (kan), Georgian (kat), Kazakh (kaz), Kirghiz (kir), Korean (kor), Karachay-Balkar (krc), Kölsch (ksh), Latin (lat), Latvian (lav), Limburgan (lim), Lithuanian (lit), Lombard (lmo), Luxembourgish (ltz), Ganda (lug), Lushai (lus), Malayalam (mal), Marathi (mar), Minangkabau (min), Macedonian (mkd), Malagasy (mlg), Maltese (mlt), Mongolian (mon), Maori (mri), Mirandese (mwl), Mazanderani (mzn), Low German (nds), Nepali (nep), Newari (new), Dutch (nld), Norwegian Nynorsk (nno), Norwegian Bokmål (nob), Pedi (nso), Occitan (oci), Oriya (ori), Ossetian (oss), Pampanga (pam), Panjabi (pan), Iranian Persian (pes), Pfaelzisch (pfl), Piemontese (pms), Western Panjabi (pnb), Polish (pol), Portuguese (por), Pushto (pus), Quechua (que), Romansh (roh), Romanian (ron), Russian (rus), Yakut (sah), Sicilian (scn), Scots (sco), Samogitian (sgs), Sinhala (sin), Slovak (slk), Slovenian (slv), Shona (sna), Somali (som), Southern Sotho (sot), Spanish (spa), Sardinian (srd), Serbian (srp), Sundanese (sun), Swahili (swa), Swedish (swe), Tamil (tam), Tatar (tat), Telugu (tel), Tajik (tgk), Tagalog (tgl), Thai (tha), Tsonga (tso), Turkmen (tuk), Turkish (tur), Uighur (uig), Ukrainian (ukr), Urdu (urd), Northern Uzbek (uzn), Venetian (vec), Vietnamese (vie), Vlaams (vls), Volapük (vol), Walloon (wln), Wu Chinese (wuu), Xhosa (xho), Mingrelian (xmf), Yiddish (yid), Zeeuws (zea), Standard Malay (zsm), Zulu (zul).