AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

Full Official Name: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
Submission date: Jan. 17, 2023, 8:52 p.m.

Introduction: AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts. The broadcast recordings and transcripts were produced to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages. The telephone speech audio recordings were collected to support the NIST 2011 Language Recognition Evaluation  which focused on pair discrimination for 24 languages/dialects. These recording are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11. The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. Data: The CTS audio data was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. It was collected using LDC's telephone infrastructure comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release. The BN data was taken from 87 news recordings broadcast by various Ukrainian sources. All BN audio files were originally collected as mp3 via web-download or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release. Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. All transcripts are delivered as *.tsv tab delimited files that include metadata and statistics. Sponsorship: This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.

Right Holder(s)