Resource: OSIAN Corpus

Reference The Open Source International Arabic News (OSIAN) corpus
Date of Submission Jan. 8, 2018, 4:48 p.m.
Status accepted
ISLRN 255-977-746-042-1
Resource Type Primary Text
Media Type Text
Source
Language Arabic
Format/MIME Type text/xml
Size 156,534,923 words
Description

The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, each word is annotated with lemma and part-of-speech.

Version new version/release of "OSIAN Corpus"
Creator Imad Zeroual - Faculty of Science , Abdelhak Lakhouaja - Mohammed First University , Dirk Goldhahn - Natural Language Processing Group, University of Leipzig
Distributor Imad Zeroual - Faculty of Science , Abdelhak Lakhouaja - Mohammed First University , Dirk Goldhahn - Natural Language Processing Group, University of Leipzig
Rights Holder Imad Zeroual - Faculty of Science , Abdelhak Lakhouaja - Mohammed First University , Dirk Goldhahn - Natural Language Processing Group, University of Leipzig