Annotated tweet corpus in Arabizi, French and English

Full Official Name: Annotated tweet corpus in Arabizi, French and English
Submission date: April 5, 2022, 11:20 a.m.

The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism). For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi. For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence: • Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible) • Topic: the annotator can add a new topic if it does not exist in the proposed list • Opinion: 3 possible annotations (Negative, Neutral, Positive) In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format. Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.

Right Holder(s)