Resource: JRC-Names

Reference JRC-Names
Date of Submission Oct. 3, 2014, 4:37 p.m.
Status accepted
ISLRN 328-863-023-410-2
Resource Type Lexicon
Media Type Text
Source
Language Arabic, Bulgarian, Chinese, Danish, Dutch, English, Estonian, French, Georgian, German, Greek, Modern (1453-), Hebrew, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish
Format/MIME Type text/xml
Size 611,000 (status 1 October 2014)
Access Medium files for download, incl. software
Description

JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.).

The named entity resource file with the list of spelling variants is accompanied by Java-implemented demonstrator software that (a) allows to produce - for any input name - a list of known spelling variants, and that (b) analyses UTF8-encoded text files to find known entity mentions, returning the name variant found, the preferred display name for that entity, the unique name identifier for that name, the position of the entity name in the text, and its length in characters.

The names were mostly identified in real-life news articles through named entity recognition and name spellings were mostly automatically added to the main name spelling.

The list of names gets updated every day with newly found names and their variants.

Version 1.0
Creator European Commission - Joint Research Centre (JRC)
Distributor Ralf Steinberger - European Commission - Joint Research Centre (JRC)
Rights Holder European Union (EU)