A corpus has been generated for the specific purposes of the Tweet-norm workshop. An initial collection of tweet corresponding to 2013 April 1st and 2nd has been created, including tweets from the Iberian Peninsula, but only from those regions with no other official language than Spanish. The tweets in this collections are mostly tweets with serious normalization problems.
Two data-set have been generated form the initial corpus: one development-set composed of 500 tweets, and one test-set composed of 600 tweets.
Due to restrictions in the Twitter API Terms of Service), it is forbidden to redistribute a corpus that includes text contents or information about users. However, it is valid if those fields are removed and instead IDs (including Tweet IDs and user IDs) are provided. The actual message content can be easily obtained by making queries to the Twitter API using the twitid. The script for downloading tweets available in the Downloads section provides such functionality.
- Once the participation period comes to an end, we will check the tweets that are still publicly available at the moment, so we can generate the final subset of tweets that will be used as a reference for evaluation purposes. The reference subset will vary slightly from the initial set, provided that some tweets tend to become unavailable for different reasons.
- The corpora have been annotated by means of the Brat (http://brat.nlplab.org) tool. Each tweet has been annotated following these guidelines:
- 0-variation: The correct standard form is included as a note
Con lo felizzz que estaba yo... y Donosti me ha recibido lloviendo. Ahhhhhhhh. felizzz 0 feliz Donosti 1 Ahhhhhhhh 2
- Correct word corrresponding to a Named Entity (e.g., Zaragoza) or a loanword (e.g. twitter): Correct
- word with emphatic or dialectal variation, misspelling, lack or misuse of acute accent: mark as variation and provide the standard spelling (muuuuuuucho -> mucho, kasa -> casa, cafe ->café)
- words written as one with no space separation: mark as variation and provide the standard spelling.
- a single word split into smaller strings: mark all as variation and provide the standard spelling.
- unintelligible or foreign word, or others, e.g., XD: mark as NoES
Note that only OOV words will be considered, and real-word errors will be ignored (e.g., a word that should be spelled with an acute accent, but also exists without it)
- tweet-norm_es V2: It contains all the sets of annotated tweets used in the Tweet-Norm 2013 shared task, and also the actual Twitter messages. (New: 2015/03/17).
- tweet-norm_es V1: It contains all the sets of annotated tweets used in the Tweet-Norm 2013 shared task. (2013/11/12).
- Annotation manual: Guidelines used by the annotators. Includes instructions for pre-processing the tweets.
- Annotated sample. Includes instructions for downloading the tweets. (Last update: 2013/07/08).
- Development corpus: 500 tweets. (Last updated with corrections: 2013/07/08).
- 227,255 TweetId collection. Initial collection of tweets gathered on April 1st and 2nd of 2013de 2013, including tweets from the Iberian Peninsula, but only from those regions with no other official language than Spanish (Test corpus tweetIds are not included). (2013/06/10)
- Script for downloading tweets. (Last update: 2013/06/18).
- Evaluation script: At the moment it computes the accuracy of the corrections with respect to a given reference (Last update: 2013/07/04).
- Test corpus: 564 tweets. (2013/07/24).