Resource: NUM 5M Mongolian written corpus

Reference NUM 5M Mongolian written corpus
Date of Submission July 12, 2017, 11:06 a.m.
Status accepted
ISLRN 492-817-146-504-9
Resource Type Primary Text
Media Type Text
Source
Language Mongolian
Format/MIME Type Plain text
Access Medium Downloadable
Description

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.

The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises:
- 144 texts from laws,
- 278 stories,
- 8 novelettes,
- 4 novels from literature;
- 597 news,
- 505 interviews,
- 302 reports,
- 578 essays,
- 469 stories,
- 1,258 editorials from newspaper.

Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in TEI format.

Version 1.0
Distributor ELRA