MulTed Corpus

Full Official Name: A multilingual aligned and tagged parallel corpus
Submission date: Feb. 26, 2018, 2:24 p.m.

The MulTed is a multilingual aligned and tagged parallel corpus. i.e., it is multilingual and Part of Speech (PoS) tagged, but the sentence-alignment is bilingual, with English as a pivot language. This corpus is designed for many NLP applications, where the sentence alignment, the PoS tagging, and the size of corpora are influential, such as statistical machine translation, language recognition, and bilingual dictionary generation. The corpus is a collection of extracted subtitles from TEDx talks. Currently, it has subtitles that cover 1100 talks available in over 30 languages. Yet, the subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used. Moreover, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP, Information Retrieval, and Corpus linguistics, especially for under-resourced languages.

Creator(s)
Distributor(s)
Right Holder(s)