A corpus has been generated for the specific purposes of the TweetLID workshop. An initial collection of tweet was gathered during march 2014 including tweets from For regions in the Iberian Peninsula, where two official languages coexist. Those for regions are:
- Province of Gipuzcoa: Basque and Spanish tweets collected.
- Province of Lugo: Galician and Spanish tweets collected.
- Province of Girona: Catalan and Spanish tweets collected.
- Portugal: Portuguese and Spanish tweets collected.
Two data-set have been generated form the initial corpus: one development-set composed of 15000 tweets, and one test-set composed of 15000 tweets.
Due to restrictions in the Twitter API Terms of Service), it is forbidden to redistribute a corpus that includes text contents or information about users. However, it is valid if those fields are removed and instead IDs (including Tweet IDs and user IDs) are provided. The actual message content can be easily obtained by making queries to the Twitter API using the twitid. The script for downloading tweets available in the Downloads section provides such functionality.
Once the participation period comes to an end, we will check the tweets that are still publicly available at the moment, so we can generate the final subset of tweets that will be used as a reference for evaluation purposes. The reference subset will vary slightly from the initial set, provided that some tweets tend to become unavailable for different reasons.
- Along with the tweet IDs and user names, the file also provides the manually annotated language(s) of the tweet. The annotation uses the following names for languages:
- eu: Basque.
- ca: Catalan.
- gl: Galician.
- es: Spanish.
- en: English.
- pt: Portuguese.
- other: A different language from those listed above (e.g., French).
- und: Undeterminable, which means that the text of the tweet includes words that are widely used in any of the languages considered in the task, which makes it impossible to determine the language being used in that specific case.
- In some cases, some tweets include more than a single language, annotated as follows:
- es/gl/pt: when a tweet is annotated with two or more languages separated by slashes, it means that the text of the tweet may have been written in any of those languages. For evaluation purposes, any of the languages will be deemed correct.
Final True Detective. Pssss ca/en/es/gl/pt
- es+eu: when a tweet is annotated with two more languages separated by plus signs, it means that the text of tweet contains parts in both languages. For evaluation purposes, the more languages a system finds, the higher will be the precision value (finding only “es” for an “es+eu” tweet will be scored with 0.5 precision for that tweet).
Qeeeee matadaaa da Biyar laneaaaa.... es+eu Acho que vi a Ramona hoje but im not sure pt+en
- Note that the corpus includes only tweets with at least one word (i.e., string fully made of a-z characters), and that #hashtags and @user mentions have not been considered in the annotation of a tweet.
- Note that Named Entities are not considered for language identification, we assume a NER system should be able to identify NEs in their original language.
Para los que hallan visto los ultimos cap de 'the walking dead' ... cagate lorito es
- Finally, note that we will not differentiate “other” and “und” for evaluation purposes, as we will consider both to be categories including tweets that cannot be assigned one or more of the considered languages.
- TweetLID_corpusV2: 35K tweets. This release includes all data-sets annotated and those used during the shared task, as well as the evaluation script. This release includes not only tweet Ids, but also the actual tweets. (New: 2015/03/17)
- TweetLID_corpusV1: 35K tweets. This release includes all data-sets annotated and those used during the shared task, as well as the scripts below. (2014/10/01)
- Python script for downloading tweets. (Last update: 2014/06/06)
- Evaluation script: written in Perl. (Updated: 2014/06/30)