Resource: BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Reference BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
Date of Submission July 19, 2021, 5:13 p.m.
Status accepted
ISLRN 176-795-802-758-5
Resource Type Primary Text
Media Type Text
Source
Language Egyptian Arabic
Format/MIME Type text/plain
Size 13711
Access Medium Web Download
Description

*Introduction*

BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Egyptian Arabic discussion forum (DF), SMS/Chat and conversational telephone speech (CTS).

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

*Data*

DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Egyptian Arabic CALLHOME and CALLFRIEND telephone collections.

Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs.

Annotation files are presented in UTF-8 encoded XML format.

*Sponsorship*

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Version 1.0
Creator Sameer Pradhan , Nitin Agarwal , Michelle Kappler , Linnea Micciulla , Lance Ramshaw , Michelle Francini
Distributor Linguistic Data Consortium
Rights Holder Portions © 1996-1997, 2002, 2012-2015, 2019, 2021 Trustees of the University of Pennsylvania