2016/02/17: Updated version of the TweetLID corpus. Minor correction of some labels.
2015/03/17: New version of the TweetLID corpus is available in the Resources section! The new release includes the tweet messages!
2014/10/01: All TweeLID data-sets are publicly available in the Resources section.
2014/09/08: The proceedings of the workshop are online: http://ceur-ws.org/Vol-1228/
2014/07/11: Results are out!! http://komunitatea.elhuyar.org/tweetlid/participation/#Results
2014/06/30: A bug has been discovered in the evaluation script. Please download the corrected version here.
2014/06/24: We have released an evaluation script. You can download it on the Resources section.
2014/06/04: The training corpus has been released. It is available in the Resources section.
2014/05/28: New inscription deadline: 2014/06/06!
Twitter Language Identification
TweetLID is a workshop and shared task on the automatic identification of the language in which tweets are written. It will take place on the 16th of September, 2014, in Girona, co-located with SEPLN 2014. The objective of the task is to bring together researchers interested in the topic, as well as to join forces to experiment with and compare different approaches for identification of tweet languages.
The identification of tweet language is arousing an increasing interest in the scientific community (Carter et al., 2013). Identifying the language in which a tweet is written is crucial if we intend to apply NLP techniques subsequently on the tweet, e.g., machine translation, sentiment analysis, information extraction, etc. Accurately identifying the language will facilitate the application of resources suitable to the language in question.
However, despite the increasing volume of research in identification of major languages such as English, French, or Spanish, the application of these techniques to other languages with lesser presence on Twitter has not been studied in detail. The scope of the task will focus on the 5 top languages of the Iberian Peninsula (Spanish, Portuguese, Catalan, Basque, and Galician), besides English. These languages are likely to co-occur along with many news and events relevant to the Iberian Peninsula, and thus an accurate identification of the language is key to make sure that we use the appropriate resources for the linguistic processing.
The workshop aims to be a forum where researchers will have a chance to compare their algorithms, systems, and results. The organizing committee will release an annotated development corpus that will enable participants to train their systems. The final evaluation will be conducted with another unannotated corpus that the participants will have to submit with their results in a short period of time.
Call for Participation
Interested participants need to register for the task and workshop by sending an email to firstname.lastname@example.org on or before May 30th.
The paper submissions will open in July, once the evaluation of the shared task is completed. Submissions will not exceed the maximum length of 4 pages, and will be formatted following the SEPLN journal styles.
The proceedings of the workshop will be published using the ceur-ws.org repository, and will be indexed by DBLP.
- June 6th: Inscription deadline
- June 2nd: Release of the development-set
- July 1st: Release of the test-set
- July 3rd: Result submission deadline
- July 12th: Result publication
- July 25th: Short paper submission deadline
- August 31st: Papers’ camera ready version
- September 16th: Workshop