Bilingual Formal / Informal Address Corpus
This page provides the parallel German-English text corpus used in Faruqui and Pado 2012. It consists of 106 public-domain novels and stories, mostly 19th-century texts. The texts are segmented into paragraphs, sentences and words, are aligned at the sentence level, and POS-tagged and lemmatized.
Corpus sources and licensing
The texts are taken from Project Gutenberg for English and Projekt Gutenberg-DE for German. The English texts can be used freely, including redistribution. The German texts are provided for free by Projekt Gutenberg-DE for personal use (which we assume to include academic fair use).
- List of novels, authors, and original languages
- Training set (74 novels, 57M)
- Development set (19 novels, 17M)
- Test set (13 novels, 13M)
Tools used to construct the corpus
- TreeTagger: POS tagging and lemmatization for English and German
- Gargantua: Unsupervised sentence alignment
Feedback is always welcome at sebastian%40nlpado.de.