Full Official Name: The Serbian Paraphrase Corpus
Submission date: Sept. 13, 2017, 11 a.m.

The Serbian Paraphrase Corpus consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary score that indicates whether the sentences are semantically similar enough to be considered close paraphrases. The sentences are written in the Serbian Latin script. Corpus construction was created from news stories found through, a Serbian news aggregator, in the manner described in the reference papers. News stories from 2010 and the first seven months of 2011 were used in this process. Corpus annotation The entire corpus was annotated by a single annotator. In order to increase annotation consistency a set of scoring criteria was established. These criteria are described in the Decision Support Systems paper listed in the Reference papers. A different annotator scored a portion of the corpus (30% of it) later on, in order to measure the inter-annotator agreement. The agreement measured on this portion of the corpus is 78.27%. Corpus statistics contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number). The given training set has 386 semantically equivalent pairs (46.23%) and 449 semantically diverse pairs (53.77%). The given test set has 167 semantically equivalent pairs (46.52%) and 192 semantically diverse pairs (53.48%). Reference papers * A software system for determining the semantic similarity of short texts in Serbian, Vuk Batanović, Bojan Furlan, Boško Nikolić, in Proceedings of the 19th Telecommunications Forum (TELFOR 2011), pp. 1249-1252, Belgrade, Serbia (2011). (Paper in Serbian) * Semantic similarity of short texts in languages with a deficient natural language processing support, Bojan Furlan, Vuk Batanović, Boško Nikolić, Decision Support Systems, vol. 55, no. 3, pp. 710-719 (2013).

Right Holder(s)