Multilingual summary evaluation data

Full Official Name: Multilingual summary evaluation data
Submission date: Oct. 20, 2014, 12:02 p.m.

This is a manually annotated collection of document clusters of parallel texts in seven languages (Arabic, Czech, English, French, German, Russian and Spanish) that can be used to evaluate multi-document, or even single document, summarisation software. The accompanying publication by M. Turchi, J. Steinberger, M. Kabadjov and R. Steinberger (2010): 'Using parallel corpora for multilingual (multi-document) Summarisation Evaluation' (Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation time can be saved by projecting the monolingual sentence selection annotation across languages due to the sentence alignment information in this parallel corpus. Various ways are proposed to make use of the varying degree of overlap of the manual annotation by four different annotators. The downloadable zip file contains the full text of all documents in seven languages, sentence-split full texts, sentence alignment information for all language pairs involving English, as well as the annotations of the English documents. Important background information about the xml structure of the files can be found in the Readme file. The four document clusters consist of five high-level commentaries each selected from, discussing fields that can roughly be described as being about malaria, Israel-and-Palestine-Conflict, genetics and science-and-society.

Right Holder(s)