Resources

Corpus

A corpus has been generated for the specific purposes of the TweetMT Workshop. An initial collection of tweets was gathered by crawling a series of accounts that were previously identified as entities that post multilingual and parallel tweets in several languages.

Two data-sets have been generated from the initial corpus: one development-set composed of 4,000 parallel tweets for each language pair and one test-set composed of 3,000 parallel tweets for each language pair.

Due to de corpus building strategy, it was only possible to automatically gather corpora for Basque-Spanish (Eu-Es) and Catalan-Spanish (Ca-Es). Thus, development corpora are provided only for those language pairs.

For Spanish-Galician (Es-Gl)) and Spanish-Portuguese (Es-Pt) language pairs, test-sets were manually generated by crowd-source translating test corpora in other languages by means of the crowdFlower platform.

Alignment

  • For Basque-Spanish and Catalan-Spanish language pairs:
      • The development corpus was automatically aligned.
      • The test corpus was automatically aligned and manually corrected by natice speakers of their respective languages.

Downloads

  • TweetMT corpus v1: dev and test datasets used in the shared task. (New: 2015/11/11)
  • Test corpus: test datasets for es-eu, es-ca, es-gl and es-pt language pairs. (2015/05/26)

Free resources for MT