Resource: EnToFrNE - a Parallel English-French Lexicon of Named Entities
|Reference||EnToFrNE - a Parallel English-French Lexicon of Named Entities|
|Date of Submission||Sept. 10, 2019, 2:25 p.m.|
|Format/MIME Type||text/csv, text/xml|
In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on. They are often denoted by proper names and can be abstract or have a physical existence. Examples of named entities include: United States of America, Paris, Google, Mercedes Benz, Microsoft Windows, or anything else that can be named.
The lexicon consists of 1,167,263 parallel named entities in English and French.
There are 1,167,263 entries in the lexicon. At least one tag is assigned to each one of them. The distribution of tags is as follows:
The total number of tags, 1,201,866, is slightly higher than the number of entries, due to the fact that some named entities may belong to more classes. For example, Tom Sawyer is tagged as both PRODUCT (the title of the novel) and PERSON (the character from the novel).
In order to evaluate the tagging, a random sample containing 1,000 entries has been extracted from the lexicon. The entries from the sample have been tagged manually and then compared to the tagging performed by the algorithm. The precision of tagging is between 0.94 for ORGANIZATION and 0.99 for PERSON. The recall is slightly lower, from 0.83 for PRODUCT and MISC to 0.97 for PERSON. The higher values of precision show that the tagging algorithm was adjusted to tag the named entities correctly, rather than to extract more named entities for the lexicon.
The structure of the xml file is similar. The columns’ names from the csv file are now names of elements.