A corpus has been generated for the specific purposes of the TweetMT Workshop. An initial collection of tweets was gathered by crawling a series of accounts that were previously identified as entities that post multilingual and parallel tweets in several languages.
Two data-sets have been generated from the initial corpus: one development-set composed of 4,000 parallel tweets for each language pair and one test-set composed of 3,000 parallel tweets for each language pair.
Due to de corpus building strategy, it was only possible to automatically gather corpora for Basque-Spanish (Eu-Es) and Catalan-Spanish (Ca-Es). Thus, development corpora are provided only for those language pairs.
For Spanish-Galician (Es-Gl)) and Spanish-Portuguese (Es-Pt) language pairs, test-sets were manually generated by crowd-source translating test corpora in other languages by means of the crowdFlower platform.
- For Basque-Spanish and Catalan-Spanish language pairs:
- The development corpus was automatically aligned.
- The test corpus was automatically aligned and manually corrected by natice speakers of their respective languages.
- TweetMT corpus v1: dev and test datasets used in the shared task. (New: 2015/11/11)
- Test corpus: test datasets for es-eu, es-ca, es-gl and es-pt language pairs. (2015/05/26)
- Development corpus: 4K tweet pairs for es-eu and es-ca language pairs. (2015/04/23)
Free resources for MT
- The open parallel corpus (Including Europarl)
- API to Apertium RBMT systems
- Monolingual wikipedia dumps
- CA-ES: RBMT-Apertium(es-ca), Monolingual-Corpus-CaWaC(ca)
- EU-ES: TMX1, TMX2, RBMT-Matxin-source-code(es-eu), RBMT-Apertium-source-code(eu-es), RBMT-Matxin-API(es-eu)
- GL-ES: RBMT-Apertium(es-gl)
- PT-ES: TMX, RBMT-Apertium(es-pt)