A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection

Full Official Name: A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection
Submission date: Feb. 10, 2017, 5:28 p.m.

This dataset is made available for evaluation of cross-lingual similarity detection algorithms. The characteristics of the dataset are the following: • it is multilingual: French, English and Spanish; • it proposes cross-language alignment information at different granularities: document-level, sentence-level and chunk-level; • it is based on both parallel and comparable corpora; • it contains both human and machine translated text; • part of it has been altered (to make the cross-language similarity detection more complicated) while the rest remains without noise; • documents were written by multiple types of authors: from average to professionals.

Creator(s)
Distributor(s)
Right Holder(s)