Twitter Language Identification

TweetLID is a workshop and shared task on the automatic identification of the language in which tweets are written. It will take place on the 16th of September, 2014, in Girona, co-located with SEPLN 2014. The objective of the task is to bring together researchers interested in the topic, as well as to join forces to experiment with and compare different approaches for identification of tweet languages.

The identification of tweet language is arousing an increasing interest in the scientific community (Carter et al., 2013). Identifying the language in which a tweet is written is crucial if we intend to apply NLP techniques subsequently on the tweet, e.g., machine translation, sentiment analysis, information extraction, etc. Accurately identifying the language will facilitate the application of resources suitable to the language in question.

However, despite the increasing volume of research in identification of major languages such as English, French, or Spanish, the application of these techniques to other languages with lesser presence on Twitter has not been studied in detail. The scope of the task will focus on the 5 top languages of the Iberian Peninsula (Spanish, Portuguese, Catalan, Basque, and Galician), besides English. These languages are likely to co-occur along with many news and events relevant to the Iberian Peninsula, and thus an accurate identification of the language is key to make sure that we use the appropriate resources for the linguistic processing.

The workshop aims to be a forum where researchers will have a chance to compare their algorithms, systems, and results. The organizing committee will release an annotated development corpus that will enable participants to train their systems. The final evaluation will be conducted with another unannotated corpus that the participants will have to submit with their results in a short period of time.

Call for Participation


Interested participants need to register for the task and workshop by sending an email to tweetlid@elhuyar.com on or before May 30th.

Paper submission

The paper submissions will open in July, once the evaluation of the shared task is completed. Submissions will not exceed the maximum length of 4 pages, and will be formatted following the SEPLN journal styles.

The proceedings of the workshop will be published using the ceur-ws.org repository, and will be indexed by DBLP.

Important dates

  • June 6th: Inscription deadline
  • June 2nd: Release of the development-set
  • July 1st: Release of the test-set
  • July 3rd: Result submission deadline
  • July 12th: Result publication
  • July 25th: Short paper submission deadline
  • August 31st: Papers’ camera ready version
  • September 16th: Workshop