EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages

Full Official Name: EnToSSLNE - a Lexicon of Parallel Named Entities from English to South Slavic Languages
Submission date: April 24, 2019, 5:18 p.m.

This lexicon contains multiword entries which are not strictly named entities, but contain a word which is. For example, German shepherd is an entry in this lexicon, since many dogs of this breed exist. But, the adjective German makes it a named entity in a broader sense. Accordingly, there are many multiword units in the lexicon which contain ethnonyms. Similarly, the unit Planck's law belongs to this lexicon as well. Certain natural terms like biological species and substances, which are sometimes considered named entities, are not included in the lexicon. Languages The lexicon consists of 26,155 parallel named entities in seven languages: English and six South Slavic ones: Bosnian, Bulgarian, Croatian, Macedonian, Serbian, and Slovenian. Slovenian, Croatian and Bosnian are written in Latin script, Macedonian and Bulgarian in Cyrillic. Serbian language is specific since it may come in two scripts (Cyrillic and Latin) and two dialects (ekavica and ijekavica). This lexicon takes Serbian ekavica variant and its Cyrillic script. Classification The tags used for named entities are: ORGANIZATION, LOCATION, PERSON, PRODUCT and MISC. Each named entity belongs to one of these classes. The classes comprise: ORGANIZATION: political organizations, companies, schools, rock bands, sport teams LOCATION: geographical terms, fictional places, cosmic terms PERSON: humans, gods, saints, fictional characters PRODUCT: industrial products, software products, weapons, art works, documents, concepts, standards, formats, anthems, algorithms, journals, coats of arms, platforms, websites MISC: events, languages, peoples, tribes, alliances, orders, scientific discoveries, theories, titles, currencies, holidays, dynasties, positions, projects, historical periods, competitions, deceases, breeds, programs, set of locations, awards, musical genres, missions, artistic directions, set of organizations, networks. The lexicon consists of 26,155 entries. A tag is assigned to each one of them. The distribution of classes is as follows: ORGANIZATION: 1,575 entries LOCATION: 6,327 entries PERSON: 8,584 entries PRODUCT: 1,716 entries MISC: 7,953 entries Formats The lexicon comes in two formats: csv and xml. The first row in the csv file is a title row and tab is used as a field separator, eg: German Shepherd Nemški ovčar Njemački ovčar Njemački ovčar Немачки овчар Германски овчар Немска овчарка MISC In the xml file, the tag denoting the class is an attribute and languages are elements.

Right Holder(s)