Bilingual Formal / Informal Address Corpus

This page provides the parallel German-English text corpus used in Faruqui and Pado 2012. It consists of 106 public-domain novels and stories, mostly 19th-century texts. The texts are segmented into paragraphs, sentences and words, are aligned at the sentence level, and POS-tagged and lemmatized.

Corpus sources and licensing

The texts are taken from Project Gutenberg for English and Projekt Gutenberg-DE for German. The English texts can be used freely, including redistribution. The German texts are provided for free by Projekt Gutenberg-DE for personal use (which we assume to include academic fair use).


Tools used to construct the corpus

  • TreeTagger: POS tagging and lemmatization for English and German
  • Gargantua: Unsupervised sentence alignment


