Resource: NEMLAR Broadcast News Speech Corpus

Reference NEMLAR Broadcast News Speech Corpus
Date of Submission Jan. 24, 2014, 4:30 p.m.
Status accepted
ISLRN 479-507-036-103-9
Resource Type Primary Text
Media Type Audio
Source
Language Arabic
Description

This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).

The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc.

Each broadcast contains between 25 and 30 minutes of news and interviews. The recordings were carried out at three different periods between 30 June 2002 and 18 July 2005. All files were recorded in linear PCM format, 16 kHz, 16 bit.

The software used for the transcription is Transcriber with the additional patch for Arabic. Thus the transcriptions were done in Arabic characters and the software automatically generated the transliterations. The following annotation levels are included:
• Orthographic transcription of speech (in news, not in music, commercials, etc.), including Named Entities
• Speakers and speaker turns
• Segment markers (portions of maximum 10 seconds)
• Topic/story boundaries
• Background noises (stationary and instantaneous noise events)
• Change of background
• Music/Noise
• Word boundaries

A lexicon of 62,000 words with transliterations, frequency and SAMPA for Arabic is also included.

The database is distributed in 1 ISO 9660 DVD-ROM volume. It has been validated by an external partner and a validation report is provided.

Version 1.0
Distributor ELRA