ISLRN

Linguatools Webcrawl Parallel Corpus German-English 2015

Full Official Name: Linguatools Webcrawl Parallel Corpus German-English 2015

Submission date: May 26, 2015, 3:28 p.m.

The corpus consists of 10 million German-English parallel sentences that were crawled from the internet between 10/2013 and 04/2015. The sentences were gathered from over 112,000 different hosts. We applied an elaborate multi-step quality filtering, including language identification filter, machine translation filter, grammaticality filter, etc. to get as clean data as possible. There are no duplicate sentence pairs, and there is no overlap with existing publicly available corpora like europarl, DGT-TM, etc. Web pages have been automatically categorized for subject area. The corpus is available in TMX and Moses format (encoding UTF-8).

Creator(s)

Peter Kolb & Prochazkova GbR - Petra Prochazkova

Peter Kolb & Prochazkova GbR - Peter Kolb

Distributor(s)

Peter Kolb & Prochazkova GbR - Petra Prochazkova

Peter Kolb & Prochazkova GbR - Peter Kolb

ELRA

Right Holder(s)

Peter Kolb & Prochazkova GbR - Petra Prochazkova

Peter Kolb & Prochazkova GbR - Peter Kolb

Status : Accepted

ISLRN :

800-190-274-236-9

Version

1.0

Source

http://linguatools.org/tools/corpora/webcrawl-parallel-corpus-german-english-2015/

Resource Type

Primary Text

Media Type

Text

Language(s)

English

German

Access Medium

Download Or Dvd