ISLRN

BOLT CTS CallFriend CallHome Mainland Mandarin Chinese Transcripts and Translations

Full Official Name: BOLT CTS CallFriend CallHome Mainland Mandarin Chinese Transcripts and Translations

Submission date: May 1, 2025, 10:51 p.m.

*Introduction* BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations, Linguistic Data Consortium (LDC) Catalog Number LDC2025T05, was developed by LDC and consists of transcripts and their corresponding English translations for 93 hours of conversational telephone speech between native speakers of the Mandarin Chinese dialect spoken in mainland China. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The telephone data was transcribed, translated and annotated for various tasks including word alignment, treebanking, and co-reference. *Data* The source audio recordings consist of 236 telephone conversations taken from LDC's multilingual CALLFRIEND and CALLHOME series developed to support speech identification and language identification technology development. Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. Further information about the transcription methodology is contained in the transcription guidelines accompanying this release. All speech data was transcribed. The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. Further information about the translation methodology is contained in the translation guidelines accompanying this release. 89% of the transcripts were translated into English. The transcripts are divided into training, development and evaluation partitions as follows: partition doc count su count src ntoken src nword eng nword train 30 7,490 72,242 48,161 60,303 dev 101 31,429 332,201 221,467 215,714 eval 170 113,149 1,189,415 792,943 880,202 total 301 152,068 1,593,858 1,062,572 1,155,679 Transcripts and translations are presented in xml format, UTF-8 encoded. *Directory Structure* Please see file.tbl for a complete file list as well as checksums for this publication. The data in this package is organized by the partition as used by the BOLT program - development data (dev), training data (train) and evaluation data (eval). data/ - Contains the transcription and translation files. dev/ - Contains the development data train/ - Contains the training data eval/ - Contains the evaluation data docs/ - Contains additional documentation, guidelines and a file table. *Acknowledgement* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Creator(s)

Jennifer Tracey

Song Chen

Dana Delgado

Stephanie Strassel

Distributor(s)

LDC

Right Holder(s)

Status : Accepted

ISLRN :

075-534-579-254-4

Version

1.0

Source

https://catalog.ldc.upenn.edu/LDC2025T05

Resource Type

Transcribed Speech Data

Media Type

Text

Language(s)

Mandarin

Access Medium

Web Download