Linguatools Webcrawl Parallel Corpus German-English 2015

Full Official Name: Linguatools Webcrawl Parallel Corpus German-English 2015
Submission date: May 26, 2015, 3:28 p.m.

The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. The sentences were gathered from over 112,000 different hosts. We applied an elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter, etc. to get as clean data as possible. There are no duplicate sentence pairs, and there is no overlap with existing publicly available corpora like europarl, DGT-TM, etc. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8).

Right Holder(s)