2019/01/07: The Tweet-Norm_es corpus V3 is the last version available in the Resources section. This version includes the original tweets for the test datasets, not included in the V2 version.
2015/03/17: The Tweet-Norm_es corpus V2 is publicly available for download in the Resources section! This version includes not only the annotations but also the original tweets!
2013/11/19: The proceedings of the workshop are available in the Proceedings section!
2013/11/12: The Tweet-Norm_es corpus is publicly available for download in the Resources section!
2013/09/16: Workshop program is out!
2013/08/20: Paper reviews have been sent to authors. Final versions of the papers are due on August 31st.
2013/07/24: The evaluation results have been published
2013/07/05: The development corpus has been updated with some corrections. Available in the Resources section. In Addition the evaluation process has been specified.
2013/06/18: The script for downloading tweets has been updated, and an evaluation script has been released. Both are available in the Resources section.
2013/06/10:A collection comprised of 227,255 tweet Ids is available for participants, in case someone want to use them for development. Development corpus was extracted from this collection.
2013/06/05: The development corpus has been released. It is available in the Resources section.
Tweet normalization (e.g., https://www.aclweb.org/anthology/P/P11/P11-1038.pdf) is attracting a huge amount of interest within the research community, which is paramount for accurately performing subsequent tasks in fields such as machine translation, and sentiment analysis, among others. Even though there has been recent research attempting to normalize tweets and phone texts in English, little is known about normalization of messages written in Spanish.
Given the dearth of research in the field, several research groups associated with a number of projects joined forces to organize a workshop/task on LEXICAL NORMALIZATION OF TWEETS IN SPANISH, which will be co-located with the SEPLN 2013 conference to be held in Madrid. This workshop can be considered to be a follow-up/complementary to the TASS task, which took place in 2012, and will run for the second time in 2013: http://www.daedalus.es/TASS .
We believe that this task presents an important research challenge, allowing generation of a benchmark from the competition/cooperation among research groups. Likewise, it will enable to apply novel techniques and algorithms with the end of exploring improvement and adaptation to the task. Research groups taking part in the competition will have a chance to test their methods, algorithms, and linguistic resources in this novel task.
For the purposes of the first edition of this task, the scope of the lexical normalization task will focus on normalizing single words (abbreviations, unnormalized spellings, repeated letters,…), holding off the normalization of other issues such as syntactic variations, style, etc. One of the main challenges will be the detection of out-of-vocabulary words (OOV), unnormalized spellings of unseen words, or named entities.
The workshop aims to be a forum enabling comparison of algorithms, methods, and results from participants. The size of the training and test corpora to be used will be defined depending on the resources we can afford to use.
Program and Call for Participation
Tweet Normalization Workshop at SEPLN 2013
20 September, 2013
Call for participation
TWEET-NORM 2013, that will be held in the 29th edition of the Annual Conference
of the Spanish Society for Natural Language Processing (SEPLN2013) in Madrid
(Spain), invites everyone interested on systems, methods and algorithms for lexical normalization of tweets
in Spanish to attend to the workshop the 20th of September.
Inscription is free, and can be done by email to “tweet-norm at elhuyar dot com” or by filling the form in this website.
Although it is not mandatory, we encourage you to attend to the SEPLN conference.
Here is the Workshop program, held together with the TASS workshop, about polarity detection in Tweets:
Friday, 20th of September:
- 14:30-15.30: Tweet-norm: Overview and 3 oral presentations
- 15:30-16.30: TASS-2013: Overview and 3 oral presentations
- 16.30-17:00: Discussion and future workshops
- 17:00-18:00: Poster session
We hope to see you in Madrid!
Call For Papers download CFP
========================================================================== TWEET-NORM 2013 Tweet Normalization Workshop at SEPLN 2013 Madrid, Spain 15-20 September, 2013 http://komunitatea.elhuyar.org/tweet-norm/ ========================================================================== Call for papers ========================================================================== TWEET-NORM 2013, that will be held in the 29th edition of the Annual Conference of the Spanish Society for Natural Language Processing (SEPLN2013) in Madrid (Spain), invites researchers to submit articles or unpublished recent studies relating to systems, methods and algorithms for lexical normalization of tweets in Spanish and to participate in the proposed shared task. Introduction ------------ One of the most important challenges facing us today is how to process and analyze the large amount of information on the Internet, and especially social networking sites like Twitter, where millions of people daily express ideas and opinions on any topic of interest. These texts, called tweets, are characterized by having a short length (140 characters) that is too small compared with the size of traditional genres. Consequently, users of these networks have developed a new form of expression that includes SMS-style abbreviations, lexical variants, letters repetitions, use of emoticons, etc. The result is that current NLP tools can have problems to process and understand these short and noisy texts unless they are normalized first. The TWEET-NORM lexical normalization task proposes the automatic "cleansing" of a set amount of tweets by identifying and normalizing, abbreviations, words with repeated letters, and generally any out of the vocabulary (OOV) words, regardless of syntactic or stylistic variants. While there has been some progress in this field for English tweets there are very few studies and resources available to date for Spanish. Thus, the aim of the workshop is to provide a forum for discussion and communication where researchers can test approaches, algorithms and resources in order to promote the application of techniques and algorithms in this area. To do this, a shared task in which the participants will have to normalize a set of tweets, is proposed. An annotated corpus will be provided to the participants in order to develop and test the proposed solutions. Corpus ------ The corpus is composed by tweets gathered between the 1st and 2nd of April 2013 covering the geographic area of the Iberian peninsula, but ignoring those regions that have co-official languages. A large portion of these messages contain serious normalization problems. From this initial corpus two subsets are generated: a development set consisting of 500 tweets, and a test set consisting of 2000 tweets. Corpora will be available in the web page of the workshop at http://komunitatea.elhuyar.org/tweet-norm/resources/ Registration ------------ Participants are required to register for the task in order to obtain the corpus by sending an email before May 31 to firstname.lastname@example.org Submitting articles ------------------------ Submitted papers will have a maximum length of 4 pages, should follow the format established by the SEPLN (http://nil.fdi.ucm.es/sepln2013/callen.html) and will be sent by web. Important Dates --------------------------- May 30: Registration deadline for participants and publication of the development set. July 5: Publication of the test set. July 15: Result submission deadline. July 25: Publication of results. July 31: Article submission deadline. September 20: Workshop at SEPLN 2013 in Madrid.