Parallel Corpora for 6 Indian Languages

Submission date: Feb. 16, 2022, 1:19 p.m.

The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 words – 33,000 parallel sentences), translated into English. Each data set was created by taking around 100 Indian-language Wikipedia pages and obtaining four independent translations in English of each of the sentences in those documents via non-professional translators hired by crowdsourcing on Amazon Mechanical Turk. All data sets are provided in plain text format. For each of the 6 Indian language, the directory contains: - A metadata file which is organized into rows with four columns each. The rows correspond to the original documents that were translated, and the columns denote (1) the (internal) segment ID assigned to the document (2) the document's original title (3) a translation of the title (4) the manual category assignment we assigned to the document. - The data splits which were constructed by manually assigning the documents to one of eight categories (Technology, Sex, Language and Culture, Religion, Places, People, Events, and Things), and then selecting about 10% of the documents in each category for dev, devtest, and test data (that is, roughly 30% of the data), and the remaining for training data. - Dictionaries created in a separate Mechanical Turk job. - Votes files contain the results from a separate Mechanical Turk task wherein new Turkers were asked to vote on which of the four translations of a given sentence was the best. Such information is available for all languages except Malayalam.

