ISLRN

MulTed Corpus

Full Official Name: A multilingual aligned and tagged parallel corpus

Submission date: Feb. 26, 2018, 2:24 p.m.

The MulTed is a multilingual aligned and tagged parallel corpus. i.e., it is multilingual and Part of Speech (PoS) tagged, but the sentence-alignment is bilingual, with English as a pivot language. This corpus is designed for many NLP applications, where the sentence alignment, the PoS tagging, and the size of corpora are influential, such as statistical machine translation, language recognition, and bilingual dictionary generation. The corpus is a collection of extracted subtitles from TEDx talks. Currently, it has subtitles that cover 1100 talks available in over 30 languages. Yet, the subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used. Moreover, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP, Information Retrieval, and Corpus linguistics, especially for under-resourced languages.

Creator(s)

Mohamed First University - Imad Zeroual

Distributor(s)

Mohamed First University - Imad Zeroual

Right Holder(s)

Mohamed First University - Imad Zeroual

Status : Accepted

ISLRN :

367-302-230-252-1

Version

First

Source

http://oujda-nlp-team.net/en/corpora/multed-corpus/

Resource Type

Primary Text

Media Type

Text

Language(s)

Arabic

Bulgarian

Chinese

Croatian

Czech

Dutch

English

French

Indonesian

Italian

Japanese

Korean

Modern

Romanian

Russian

Serbian

Slovak

Thai

Turkish

Vietnamese

Access Medium