Resource: Linguatools Webcrawl Parallel Corpus German-English 2015

Reference Linguatools Webcrawl Parallel Corpus German-English 2015
Date of Submission May 26, 2015, 3:28 p.m.
Status accepted
ISLRN 800-190-274-236-9
Resource Type Primary Text
Media Type Text
Source
Language English, German
Format/MIME Type text/xml
Size 7.9 gigabytes
Access Medium download or DVD
Description

The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. The sentences were gathered from over 112,000 different hosts. We applied an elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter, etc. to get as clean data as possible. There are no duplicate sentence pairs, and there is no overlap with existing publicly available corpora like europarl, DGT-TM, etc. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8).

Version 1.0
Creator Peter Kolb - Peter Kolb & Prochazkova GbR , Petra Prochazkova - Peter Kolb & Prochazkova GbR
Distributor ELRA , Peter Kolb - Peter Kolb & Prochazkova GbR , Petra Prochazkova - Peter Kolb & Prochazkova GbR
Rights Holder Peter Kolb - Peter Kolb & Prochazkova GbR , Petra Prochazkova - Peter Kolb & Prochazkova GbR