Resource: EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages

Reference EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages
Date of Submission April 24, 2019, 5:18 p.m.
Status accepted
ISLRN 690-348-503-270-1
Resource Type Lexicon
Media Type Text
Source
Language Bosnian, Bulgarian, Croatian, English, Macedonian, Serbian, Slovenian
Format/MIME Type text/plain
Size 26155 entries
Access Medium downloadable
Description

This lexicon contains multiword entries which are not strictly named entities, but contain a word which is. For example, German shepherd is an entry in this lexicon, since many dogs of this breed exist. But, the adjective German makes it a named entity in a broader sense. Accordingly, there are many multiword units in the lexicon which contain ethnonyms. Similarly, the unit Planck's law belongs to this lexicon as well.

Certain natural terms like biological species and substances, which are sometimes considered named entities, are not included in the lexicon.

Languages
The lexicon consists of 26,155 parallel named entities in seven languages: English and six South Slavic ones: Bosnian, Bulgarian, Croatian, Macedonian, Serbian, and Slovenian.

Slovenian, Croatian and Bosnian are written in Latin script, Macedonian and Bulgarian in Cyrillic. Serbian language is specific since it may come in two scripts (Cyrillic and Latin) and two dialects (ekavica and ijekavica). This lexicon takes Serbian ekavica variant and its Cyrillic script.

Classification
The tags used for named entities are: ORGANIZATION, LOCATION, PERSON, PRODUCT and MISC. Each named entity belongs to one of these classes. The classes comprise:
ORGANIZATION: political organizations, companies, schools, rock bands, sport teams
LOCATION: geographical terms, fictional places, cosmic terms
PERSON: humans, gods, saints, fictional characters
PRODUCT: industrial products, software products, weapons, art works, documents, concepts, standards, formats, anthems, algorithms, journals, coats of arms, platforms, websites
MISC: events, languages, peoples, tribes, alliances, orders, scientific discoveries, theories, titles, currencies, holidays, dynasties, positions, projects, historical periods, competitions, deceases, breeds, programs, set of locations, awards, musical genres, missions, artistic directions, set of organizations, networks.

The lexicon consists of 26,155 entries. A tag is assigned to each one of them. The distribution of classes is as follows:
ORGANIZATION: 1,575 entries
LOCATION: 6,327 entries
PERSON: 8,584 entries
PRODUCT: 1,716 entries
MISC: 7,953 entries

Formats
The lexicon comes in two formats: csv and xml.
The first row in the csv file is a title row and tab is used as a field separator, eg:
German Shepherd Nemški ovčar Njemački ovčar Njemački ovčar Немачки овчар Германски овчар Немска овчарка MISC

In the xml file, the tag denoting the class is an attribute and languages are elements.

Version 1.0
Distributor ELRA