ISLRN

News Sub-domain Named Entity Recognition

Full Official Name: News Sub-domain Named Entity Recognition

Submission date: Nov. 16, 2023, 9:40 p.m.

News Sub-domain Named Entity Recognition (LDC2023T12) was developed at the University of Pennsylvania and contains over 20,000 English news sentences annotated with named entities and categorized into sub-domains. The sentences were extracted from The New York Times Annotated Corpus (LDC2008T19), which is comprised of over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. Sentences were selected from different years and topics following the metadata provided in the New York Times corpus above. Named entity annotation was based on the CoNLL-2003 guidelines and annotation scheme. Sentences were labeled with person (PER), location (LOC) and organization (ORG) tags using phrase matching with a manual second pass. Sub-domains are: Arts (+Weekend/Cultural), Business (+Financial), Classifieds (+Obituary), Editorial, Foreign, Metropolitan, Sports and Others. "Others" includes topics such as Real Estate, New Jersey Weekly, Book Review, Job Market, Science, and Health & Fitness. Each line in the annotation files (except the document id) contains two columns separated by tabs: the first column contains the word, and the second column contains the tag. Following CoNLL guidelines, tags are B-TYPE, I-TYPE and O. TYPE can be PER, LOC or ORG. Annotation and source text files are presented in txt format.

Creator(s)