Spoken Portuguese Corpus

Full Official Name: Spoken Portuguese Corpus
Submission date: Jan. 24, 2014, 4:31 p.m.

The Spoken Portuguese corpus was collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language. In a total of 86 recordings, the texts exemplify the Portuguese spoken in Portugal (30), in Brazil (20), in the African countries with Portuguese as its official language: Angola, Cape Verde, Guinea-Bissau, Mozambique and Sao Tome and Principe (5 each), in Macao (5), in Goa (3) and in East-Timor (3), corresponding to a total of 8h44m of recording. The corpus was recorded in a situation of spontaneous oral communication, on different themes of everyday life, with speakers of different ages and social and professional backgrounds. The recordings cover a period that goes from 1970 to 2001, and approximately 70% of them fall within the nineties. The corpus contains 153,588 tokens. The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text. The plain text files also have automatically assigned POS-tag information. The transcriptions of the corpus are also available in html format. The characters have been encoded in UTF-8.

Creator(s)
Distributor(s)
Right Holder(s)