Resource: Multilingual summary evaluation data
|Reference||Multilingual summary evaluation data|
|Date of Submission||Oct. 20, 2014, 12:02 p.m.|
|Resource Type||Primary Text|
|Language||Arabic, Czech, English, French, German, Russian, Spanish|
|Size||28 news commentary clusters with 5 documents each|
|Access Medium||files for download|
This is a manually annotated collection of document clusters of parallel texts in seven languages (Arabic, Czech, English, French, German, Russian and Spanish) that can be used to evaluate multi-document, or even single document, summarisation software.
The accompanying publication by M. Turchi, J. Steinberger, M. Kabadjov and R. Steinberger (2010): 'Using parallel corpora for multilingual (multi-document) Summarisation Evaluation' (Proceedings of CLEF'2010, Springer LNCS series) suggests that precious annotation time can be saved by projecting the monolingual sentence selection annotation across languages due to the sentence alignment information in this parallel corpus. Various ways are proposed to make use of the varying degree of overlap of the manual annotation by four different annotators. The downloadable zip file contains the full text of all documents in seven languages, sentence-split full texts, sentence alignment information for all language pairs involving English, as well as the annotations of the English documents. Important background information about the xml structure of the files can be found in the Readme file. The four document clusters consist of five high-level commentaries each selected from www.project-syndicate.org, discussing fields that can roughly be described as being about malaria, Israel-and-Palestine-Conflict, genetics and science-and-society.
|Version||1.0 (September 2010)|
|Creator||European Commission - Joint Research Centre (JRC)|
|Distributor||Ralf Steinberger - European Commission - Joint Research Centre (JRC)|
|Rights Holder||European Union (EU)|