MATERIAL Kazakh-English Language Pack

Full Official Name: MATERIAL Kazakh-English Language Pack
Submission date: March 31, 2025, 7:33 p.m.

*Introduction* MATERIAL Kazakh-English Language Pack, Linguistic Data Consortium Catalog Number LDC2025S03, was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 57 hours of Kazakh conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Kazakh speech in this release represents that spoken in the Northern and Southern dialect regions of Kazakhstan. Speakers were 18 years of age or older. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately 17% of the speech data, all of which was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Kazakh-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded.

Creator(s)
Distributor(s)
Right Holder(s)