Resource: Training and test data for Arabizi detection and transliteration

Reference Training and test data for Arabizi detection and transliteration
Date of Submission June 6, 2018, 4:57 p.m.
Status accepted
ISLRN 986-364-744-303-9
Resource Type Primary Text
Media Type Text
Source
Language Arabic, English
Size Set 1) 5207 tokens, Set 2) 3452 tokens
Access Medium downloadable
Description

The dataset is composed of two distinct resources:
1) A collection of mixed English and Arabizi text intended to train and test a system for the automatic detection of code-switching in mixed English and Arabizi texts. The training part of the corpus contains: 522 tweets composed of 5,207 tokens (including 3,307 English tokens, 1,203 Arabizi tokens and 697 other tokens). Tokens are manually labelled as English (“e”), Arabizi (“a”), or other (“o”). The testing part contains: 475 tweets containing 3,533 tokens (803 English tokens; 1,965 Arabizi tokens; and 765 other tokens).
2) A set of 3,452 Arabizi tokens manually transliterated into Arabic, and a set of 127 Arabizi tweets containing 1,385 word also manually transliterated into Arabic. This dataset was intended to train and test a system that performs Arabizi to Arabic transliteration.

Version 1.0
Distributor ELRA