ISLRN

ROMBAC - Romanian balanced corpus

Full Official Name: ROMBAC - Romanian balanced corpus

Submission date: Jan. 19, 2016, 5:09 p.m.

ROMBAC is a Romanian corpus containing equal shares of texts from 5 different genres: journalism, legalese, fiction, medicine and biographical data for Romanian literary personalities. For each genre, texts have been selected containing around 7,000,000 words, so that the entire corpus counts around 41,000,000 words, including punctuation. The corpus is annotated at paragraph, sentence, constituent group and word levels. It provides morpho-syntactic information (MSD) which has been assigned automatically with the high accuracy TTL tagger (accuracy is at least 98%), which implements the tiered tagging methodology. About 20% of the MSDs have been manually checked, validated and, where the case, corrected. MSDs follow the Multext-East specifications. For Romanian there are 614 different MSDs. They have been slightly modified (new tags for named entities have been added). The corpus is xml encoded.

Creator(s)

Distributor(s)

ELRA

Right Holder(s)

Status : Accepted

ISLRN :

162-192-982-061-0

Version

1.0

Source

http://catalog.elra.info/product_info.php?products_id=1253

Resource Type

Primary Text

Media Type

Text

Language(s)

Romanian

Access Medium

Dvd