ISLRN

A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Full Official Name: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Submission date: Feb. 10, 2017, 5:28 p.m.

This dataset is made available for evaluation of cross-lingual similarity detection algorithms. The characteristics of the dataset are the following: • it is multilingual: French, English and Spanish; • it proposes cross-language alignment information at different granularities: document-level, sentence-level and chunk-level; • it is based on both parallel and comparable corpora; • it contains both human and machine translated text; • part of it has been altered (to make the cross-language similarity detection more complicated) while the rest remains without noise; • documents were written by multiple types of authors: from average to professionals.

Creator(s)

Distributor(s)

Right Holder(s)

Status : Accepted

ISLRN :

723-785-513-738-2

Version

1.0

Source

https://github.com/FerreroJeremy/Cross-Language-Dataset

Resource Type

Primary Text

Media Type

Text

Language(s)

English

French

Spanish

Access Medium