Resources | TweetMT

Corpus

Page Contents

1 Corpus
- 1.1 Alignment
2 Downloads
3 Free resources for MT

A corpus has been generated for the specific purposes of the TweetMT Workshop. An initial collection of tweets was gathered by crawling a series of accounts that were previously identified as entities that post multilingual and parallel tweets in several languages.

Two data-sets have been generated from the initial corpus: one development-set composed of 4,000 parallel tweets for each language pair and one test-set composed of 3,000 parallel tweets for each language pair.

Due to de corpus building strategy, it was only possible to automatically gather corpora for Basque-Spanish (Eu-Es) and Catalan-Spanish (Ca-Es). Thus, development corpora are provided only for those language pairs.nixon replica watches

For Spanish-Galician (Es-Gl)) and Spanish-Portuguese (Es-Pt) language pairs, test-sets were manually generated by crowd-source translating test corpora in other languages by means of the crowdFlower platform (acquired by Appen in 2019).

Alignment

For Basque-Spanish and Catalan-Spanish language pairs:

Downloads

TweetMT corpus v2: dev and test datasets. Difference with v1 is that this version contains the Gold standards of the test sets used in the shared task. (New: 2021/09/08)

TweetMT corpus v1: dev and test datasets used in the shared task.

Test corpus: test datasets for es-eu, es-ca, es-gl and es-pt language pairs.(2015/05/26)replica rolex explorer

Development corpus: 4K tweet pairs for es-eu and es-ca language pairs. (2015/04/23)

Free resources for MT

The open parallel corpus (Including Europarl)
API to Apertium RBMT systems
Monolingual wikipedia dumps
Dictionaries
CA-ES: RBMT-Apertium(es-ca), Monolingual-Corpus-CaWaC(ca)
EU-ES: TMX1, TMX2, RBMT-Matxin-source-code(es-eu), RBMT-Apertium-source-code(eu-es), RBMT-Matxin-API(es-eu)
GL-ES: RBMT-Apertium(es-gl)
PT-ES: TMX, RBMT-Apertium(es-pt)