2022 NIST Language Recognition Evaluation Test and Development Sets

Full Official Name: 2022 NIST Language Recognition Evaluation Test and Development Sets
Submission date: Feb. 10, 2026, 10:04 p.m.

**Introduction** 2022 NIST Language Recognition Evaluation Test and Development Sets, Linguistic Data Consortium (LDC) catalog number LDC2025S03, was developed by LDC and the National Institute of Standards and Technology (NIST). This release contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source speech data is comprised of approximately 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu. The goals of NIST's Language Recognition Evaluation are to advance language recognition technologies, to facilitate technology development, and to measure the performance of current state-of-the-art technology. LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan. **Data** The test and development segments in this release were drawn from three datasets developed by LDC: the Speech Archive of South African Languages (SASAL) (CTS, BNBS), the Maghrebi Linguistic Information Corpus (MAGLIC) (CTS), and the Low Resource African Languages (LRAL) collection (BNBS). For the SASAL CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. MAGLIC consists of conversational telephone speech recordings in three varieties of Maghrebi Arabic (Tunisian, Libyan, and Algerian) and North African French, collected in accordance with the SASAL CTS protocol. LRAL contains Oromo and Tigrinya narrowband speech from off-the-air from broadcasts in Ethiopia and Eritrea, following the parameters used in the SASAL BNBS collection. Test and development segments from SASAL and MAGLIC CTS callee call sides (and comparatively few claque sides) and from SASAL and LRAL BNBS data were extracted by NIST. All test and development segments are presented as single channel, 8-bit a-law SPHERE files sampled at 8 kHz. Metadata for the development partition is provided as a tab-separated file listing the file name, language code, LDC audio identifier, source time offset, and duration for each audio segment.

Creator(s)
Distributor(s)
Right Holder(s)