Resource: Speechtera Pronunciation Dictionary

Reference Speechtera Pronunciation Dictionary
Date of Submission Feb. 10, 2020, 3:36 p.m.
Status accepted
ISLRN 645-563-102-594-8
Resource Type Lexicon
Media Type Text
Source
Language Portuguese
Size 737347 entries
Access Medium downloadable
Description

The SpeechTera Pronunciation Dictionary is a machine-readable pronunciation dictionary for Brazilian Portuguese and comprises 737,347 entries. Its entries were primarily designed for Speech Technologies, such as Automatic Speech Recognition Systems and Speech Synthetizers. However, it may be used by linguists, speech therapists, lexicographers, students of Brazilian Portuguese as a second language, and whoever is interested in the sound structure of Brazilian Portuguese.

Its phonetic transcription is based on 13 linguistics varieties spoken in Brazil : São Paulo (capital city), countryside of São Paulo State, Rio de Janeiro (RJ), Brasilia (Federal District), Belo Horizonte (MG), Curitiba (PR), Manaus (AM), Porto Alegre (RS), Salvador (BA), Goiâna (GO), Belém (PA), Vitoria (ES) and Cuiabà (MT). The transcription was generated using in-house grapheme-to-phoneme converter and then, its output was manually revised by Brazilian linguists.

The SpeechTera Pronunciation Dictionary contains the pronunciation of the frequent word forms found in the transcription data of the SpeechTera's speech and text database (literary, newspaper, movies, miscellaneous). Each one of the thirteen dialects comprises 56,719 entries, including:

- 44,396 entries including common nouns, adjectives, verbs, adverbs, articles, pronouns, numbers, prepositions, conjunctions;
- 8,074 proper nouns (including person names, family names, cities, streets, companies and brand names);
- 1,400 acronyms
- 1,994 heterophonic homographs
- 26 unstressed words (clitics)
- 92 prefixes constituted by the middle vowels "e" and "o"
- 40 common nouns with metaphonic plurals
- 698 foreign words frequently used in Brazil

The phone set for each one of the 13 varieties of Brazilian Portuguese were derived individually from the literature, following best practices for automatic speech processing. Detailed information about the phone set used can be found in the handbook for corpora annotation, written by SpeechTera's experts team, provided with the dictionary. It has mappings from words to their pronunciations in the ARPAbet phoneme set, but a mapping between the ARPAbet, the International Phonetic Alphabet (IPA) and the Speech Assessment Methods Phonetic Alphabet (SAMPA) is also provided for the purpose of understanding the phonetic symbol used in the transcriptions. Syllable carries a lexical stress marker, for example, "abacaxi aa bb aa kk aa1 sh iy".

The dictionary was created semi-automatically using in-house grapheme-to-phoneme converter. In the first step, initial pronunciations of all word forms appearing in the SpeechTera Pronunciation Dictionary transcriptions. After the automatic creation process, the dictionary was manually cross-checked by linguists' native speakers, correcting potential errors of the automatic pronunciation generation process.

Version 1.0
Distributor ELRA