Resources

Corpus

A corpus has been generated for the specific purposes of the TweetMT Workshop. An initial collection of tweets was gathered by crawling a series of accounts that were previously identified as entities that post multilingual and parallel tweets in several languages.

Two data-sets have been generated from the initial corpus: one development-set composed of 4,000 parallel tweets for each language pair and one test-set composed of 3,000 parallel tweets for each language pair.

Due to de corpus building strategy, it was only possible to automatically gather corpora for Basque-Spanish (Eu-Es) and Catalan-Spanish (Ca-Es). Thus, development corpora are provided only for those language pairs.nixon replica watches

For Spanish-Galician (Es-Gl)) and Spanish-Portuguese (Es-Pt) language pairs, test-sets were manually generated by crowd-source translating test corpora in other languages by means of the crowdFlower platform (acquired by Appen in 2019).

Alignment

  • For Basque-Spanish and Catalan-Spanish language pairs:
      • The development corpus was automatically aligned.
      • The test corpus was automatically aligned and manually corrected by natice speakers of their respective languages.mens replica watches best sellers

Downloads

  • TweetMT corpus v2: dev and test datasets. Difference with v1 is that this version contains the Gold standards of the test sets used in the shared task. (New: 2021/09/08)

Free resources for MT