Full Official Name: The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
Submission date: Feb. 19, 2020, 12:25 p.m.

*Introduction* The STEM ECR v1.0 dataset introduces the task of Scientific Entity Extraction, Classification, and Resolution on scholarly publications in STEM (Science, Technology, Engineering, and Medicine) disciplines. It comprises annotated scholarly abstracts from 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. The annotated data includes: phrase-based scientific entities, and their corresponding disambiguated references in Wikipedia and Wiktionary as applicable. The purpose of the dataset is to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. *Data* Source data for the annotations in this corpus comprise scholarly abstracts collected by Elsevier. The annotations per abstract are presented as UTF-8 encoded files comprising the abstract text and its corresponding character-span based annotations in separate files. A summary of the data by domains and mentions is below: Domain Mentions Astronomy 791 Agriculture 741 Engineering 741 Earth Science 698 Biology 649 Medicine 600 Material Science 574 Computer Science 553 Chemistry 483 Mathematics 297 *Acknowledgement* This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and by the TIB Leibniz Information Centre for Science and Technology.

Right Holder(s)