Resource: Multilingual summary evaluation data

Reference Multilingual summary evaluation data
Date of Submission Oct. 20, 2014, 12:02 p.m.
Status accepted
ISLRN 762-292-165-648-8
Resource Type Primary Text
Media Type Text
Source
Language Arabic, Czech, English, French, German, Russian, Spanish
Format/MIME Type text/xml
Size 28 news commentary clusters with 5 documents each
Access Medium files for download
Description

This is a manually annotated collection of document clusters of parallel texts in seven languages (Arabic, Czech, English, French, German, Russian and Spanish) that can be used to evaluate multi-document, or even single document, summarisation software.

The accompanying publication by M. Turchi, J. Steinberger, M. Kabadjov and R. Steinberger (2010): 'Using parallel corpora for multilingual (multi-document) Summarisation Evaluation' (Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation time can be saved by projecting the monolingual sentence selection annotation across languages due to the sentence alignment information in this parallel corpus. Various ways are proposed to make use of the varying degree of overlap of the manual annotation by four different annotators. The downloadable zip file contains the full text of all documents in seven languages, sentence-split full texts, sentence alignment information for all language pairs involving English, as well as the annotations of the English documents. Important background information about the xml structure of the files can be found in the Readme file. The four document clusters consist of five high-level commentaries each selected from www.project-syndicate.org, discussing fields that can roughly be described as being about malaria, Israel-and-Palestine-Conflict, genetics and science-and-society.

Version 1.0 (September 2010)
Creator European Commission - Joint Research Centre (JRC)
Distributor Ralf Steinberger - European Commission - Joint Research Centre (JRC)
Rights Holder European Union (EU)