GERMAN-ENGLISH Parallel Corpus with T/V address annotation ========================================================== Manaal Faruqui, manaalfar@gmail.com PARALLEL CORPUS CONSTRUCTION ============================ The Parallel corpus collection provided here is a collection of Literary novels primarily from the 19th century sourced from Project Gutenberg (English) and Projekt Gutenberg-DE (German). These novels were then cleaned, tokenized, POS-tagged, lemmatized and sentence-aligned using the tools mentioned in Faruqui and Pado 2012. There are a total of 106 novels in the collection. Every sentence on the German side of the parallel corpus was then tagged with T/V address based on the rules described in Faruqui and Pado 2012. These T/V tags were then projected on to the aligned English sentences. The collection was then partitioned into training, development and test sets containing 74, 19 and 13 novels respectively. Every novel in the collection is characterized by three different files:- (1)novelName_de.txt: The Novel in German (2)novelName_en.txt: The Novel in English (3)novelName_align.txt: The sentence-wise alignment from German to English FILE-FORMAT =========== (1) novelName_de.txt -------------------- Every sentence in the novel is represented in four lines in the corpus:- Line (1): Meta-information about the sentence: Paragraph no., Sentence no., Formal(V) and Informal(T) connotation. Line (2): Tokenized sentence. Line (3): Corresponding lemma of every word in the sentence. Line (4): Corresponding POS-tag of every word in the sentence. For example, following is a sentence in the novel with Formal(V) connotation:- paraNum:1129 sentNum:2348 F:1 I:0 Ich freue mich , Sie zu sehen . ich freuen ich , Sie|sie|sie zu sehen . PPER VVFIN PRF $, PPER PTKZU VVINF $. F:1 I:0 --> Formal(V) F:0 I:1 --> Informal(T) F:0 I:0 --> Null F:1 I:1 --> Both formal and informal (very rare) --> ignored in analysis (2)novelName_en.txt ------------------- The sentences in the English novels are further divided into "direct speech segments" using the BIO-notation. For details refer to Section 5.3, Faruqui and Pado 2012. Those who do not need the segment information should just treat all the segments having the same sentence no. as part of one sentence. Every sentence in the novel is represented in four lines in the corpus:- Line (1): Meta-information about the sentence: Paragraph no., Sentence no., Segment type, Segment no., Formal & Informal connotation. Line (2): Tokenized sentence. Line (3): Corresponding lemma of every word in the sentence. Line (4): Corresponding POS-tag of every word in the sentence. For example, following is a segment of a sentence in the novel with Informal(T) connotation:- paraNum:66 sentNum:145 segType:I-SP segNum:162 F:0 I:1 Much of that wouldn 't do for you , Jerry ! much of that do for you , Jerry ! RB IN DT NN NN VVP IN PP , NP SENT (3)novelName_align.txt ---------------------- Every sentence in this file is of the following format:- novelNamedeSentNo(s)enSentNo(s) For example, the following line states that sentences no. 7246 & 7247 in the German novel are aligned to sentence no. 7568 in the English novel:- 2staedte.txt 7246 7247 7568 CONTACT ======= For more information, bug reports, and fixes, contact: Manaal Faruqui Dept. of Computer Science & Engineering Indian Institute of Technology (IIT) Kharagpur Kharagpur 721302, West Bengal, INDIA manaalfar@gmail.com http://cse.iitkgp.ac.in/~manaalf