Resource: MulTed Corpus

Reference A multilingual aligned and tagged parallel corpus
Date of Submission Feb. 26, 2018, 2:24 p.m.
Status accepted
ISLRN 367-302-230-252-1
Resource Type Primary Text
Media Type Text
Source
Language Arabic, Bulgarian, Chinese, Croatian, Czech, Dutch, English, French, Greek, Modern (1453-), Indonesian, Italian, Japanese, Korean, Romanian, Russian, Serbian, Slovak, Thai, Turkish, Vietnamese
Format/MIME Type text/xml
Size 46+ million tokens
Description

The MulTed is a multilingual aligned and tagged parallel corpus. i.e., it is multilingual and Part of Speech (PoS) tagged, but the sentence-alignment is bilingual, with English as a pivot language. This corpus is designed for many NLP applications, where the sentence alignment, the PoS tagging, and the size of corpora are influential, such as statistical machine translation, language recognition, and bilingual dictionary generation. The corpus is a collection of extracted subtitles from TEDx talks. Currently, it has subtitles that cover 1100 talks available in over 30 languages. Yet, the subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used. Moreover, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP, Information Retrieval, and Corpus linguistics, especially for under-resourced languages.

Version First
Creator Imad Zeroual - Mohamed First University
Distributor Imad Zeroual - Mohamed First University
Rights Holder Imad Zeroual - Mohamed First University