ISLRN

Multilingual summary evaluation data

Full Official Name: Multilingual summary evaluation data

Submission date: Oct. 20, 2014, 12:02 p.m.

This is a manually annotated collection of document clusters of parallel texts in seven languages (Arabic, Czech, English, French, German, Russian and Spanish) that can be used to evaluate multi-document, or even single document, summarisation software. The accompanying publication by M. Turchi, J. Steinberger, M. Kabadjov and R. Steinberger (2010): 'Using parallel corpora for multilingual (multi-document) Summarisation Evaluation' (Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation time can be saved by projecting the monolingual sentence selection annotation across languages due to the sentence alignment information in this parallel corpus. Various ways are proposed to make use of the varying degree of overlap of the manual annotation by four different annotators. The downloadable zip file contains the full text of all documents in seven languages, sentence-split full texts, sentence alignment information for all language pairs involving English, as well as the annotations of the English documents. Important background information about the xml structure of the files can be found in the Readme file. The four document clusters consist of five high-level commentaries each selected from www.project-syndicate.org, discussing fields that can roughly be described as being about malaria, Israel-and-Palestine-Conflict, genetics and science-and-society.

Creator(s)

European Commission - Joint Research Centre (JRC)

Distributor(s)

European Commission - Joint Research Centre (JRC) - Ralf Steinberger

Right Holder(s)

European Union (EU)

Status : Accepted

ISLRN :

762-292-165-648-8

Version

1.0 (September 2010)

Source

https://ec.europa.eu/jrc/en/language-technologies

Resource Type

Primary Text

Media Type

Text

Language(s)

Arabic

Czech

English

French

German

Russian

Spanish

Access Medium

Files For Download