The TweetLID shared task consists in identifying the language or languages in which tweets are written. Focusing on events, and news in the Iberian Peninsula, the main focus of the task is the identification of tweets written in the 5 top languages from the Peninsula (Basque, Catalan, Galician, Spanish, and Portuguese), and English
We will provide the participants of the task with a training corpus that includes approximately 15,000 tweets manually annotated with the language(s). The participants will have a month to develop and tweak their language identification systems from this training corpus. They will have apply their system on the test set afterwards, and submit the output of the system, which will be evaluated and compared to the other participants’ systems.
It is worth noting that some tweets are written in more than one language (e.g., partly in Portuguese, and partly in Galician), and that the language cannot be determined in some cases (e.g., “jajaja”). The corpus also takes into account these specific cases, providing annotations such as “ca+es” (written in Catalan and Spanish), “ca/es” (it can be either Catalan or Spanish, it does not make a difference in this case), “other” (it is written in a language that is not considered in the task), o “und” (when it cannot be determined).
The evaluation will be conducted considering that not all languages are equally popular. To deal with the imbalance, we will compute the precision recall and F1 score for each language, and compute a global average of all languages afterwards. This is intended to provide higher scores to systems that perform well for many languages, rather than those who perform very well in the most popular languages such as Spanish and Portuguese.
The precision will then measure the number of correctly guessed instances for a system. To determine whether a system’s output for a tweets is correct, we will compare with the manually annotated ground truth when it is a single language. When the ground truth includes more than one language or belongs to another category, we will rely on the following criteria:
- For tweets in more than one language, we will consider the number of languages identified by the system, i.e., for tweet annotated as “ca+es”, the system that outputs just “ca” will get 0.5 precision value for that tweet, while precision 1 will be for the system identifying both languages.
- For ambiguous tweets that could have been written in any of a set of languages, any of the responses will be deemed correct, i.e., for a tweet annotated as “ca/es”, both “ca” and “es” will be deemed correct.
- The categories “other” and “und” include tweets beyond the scope of this task, although it is inevitable to get them in the data collection. Therefore, we are providing two different annotations for those, “other” and “und”, in the training corpus, for whomever wants to make use of these annotations separately. However, we will not differentiate these two categories in the evaluation, and “other” and “und” will be equivalent for evaluation purposes.