GERMAN-ENGLISH Parallel Corpus with T/V address annotation
==========================================================
Manaal Faruqui, manaalfar@gmail.com

PARALLEL CORPUS CONSTRUCTION
============================

The Parallel corpus collection provided here is a collection of
Literary novels primarily from the 19th century sourced from Project
Gutenberg (English) and Projekt Gutenberg-DE (German). These novels
were then cleaned, tokenized, POS-tagged, lemmatized and
sentence-aligned using the tools mentioned in Faruqui and Pado
2012. There are a total of 106 novels in the collection.

Every sentence on the German side of the parallel corpus was then
tagged with T/V address based on the rules described in Faruqui and
Pado 2012. These T/V tags were then projected on to the aligned
English sentences. The collection was then partitioned into training,
development and test sets containing 74, 19 and 13 novels
respectively.

Every novel in the collection is characterized by three different files:-
(1)novelName_de.txt: The Novel in German
(2)novelName_en.txt: The Novel in English
(3)novelName_align.txt: The sentence-wise alignment from German to English

FILE-FORMAT
===========

(1) novelName_de.txt
--------------------

Every sentence in the novel is represented in four lines in the corpus:-
Line (1): Meta-information about the sentence: Paragraph no., Sentence no., Formal(V) and Informal(T) connotation.
Line (2): Tokenized sentence. 
Line (3): Corresponding lemma of every word in the sentence. 
Line (4): Corresponding POS-tag of every word in the sentence.

For example, following is a sentence in the novel with Formal(V) connotation:-

<S>paraNum:1129 sentNum:2348 F:1 I:0</S>
Ich freue mich , Sie zu sehen . 
ich freuen ich , Sie|sie|sie zu sehen .
PPER VVFIN PRF $, PPER PTKZU VVINF $.

F:1 I:0 --> Formal(V)
F:0 I:1 --> Informal(T)
F:0 I:0 --> Null
F:1 I:1 --> Both formal and informal (very rare) --> ignored in analysis

(2)novelName_en.txt
-------------------

The sentences in the English novels are further divided into "direct
speech segments" using the BIO-notation. For details refer to Section
5.3, Faruqui and Pado 2012. Those who do not need the segment
information should just treat all the segments having the same
sentence no. as part of one sentence.

Every sentence in the novel is represented in four lines in the corpus:-
Line (1): Meta-information about the sentence: Paragraph no., Sentence no., Segment type, Segment no., Formal & Informal connotation. 
Line (2): Tokenized sentence. 
Line (3): Corresponding lemma of every word in the sentence. 
Line (4): Corresponding POS-tag of every word in the sentence.

For example, following is a segment of a sentence in the novel with Informal(T) connotation:-

<S>paraNum:66 sentNum:145 segType:I-SP segNum:162 F:0 I:1</S>
Much of that wouldn 't do for you , Jerry !
much of that <unknown> <unknown> do for you , Jerry !
RB IN DT NN NN VVP IN PP , NP SENT

(3)novelName_align.txt
----------------------

Every sentence in this file is of the following format:-

novelName<tab>deSentNo(s)<tab>enSentNo(s)

For example, the following line states that sentences no. 7246 & 7247 in the German novel are aligned to sentence no. 7568 in the English novel:-

2staedte.txt	7246 7247	7568

CONTACT
=======

For more information, bug reports, and fixes, contact:

    Manaal Faruqui
    Dept. of Computer Science & Engineering 
	Indian Institute of Technology (IIT) Kharagpur
    Kharagpur 721302, West Bengal, INDIA
    manaalfar@gmail.com
    http://cse.iitkgp.ac.in/~manaalf