Cross-lingual projection of semantic roles
In two studies (Pado and Lapata, EMNLP 2005 and ACL/COLING 2006), we have proposed the use of annotation projection for the task of creating corpora with role-semantic annotation for new languages. To evaluate this approach, we have annotated a 1000-sentence sample from the English-German EUROPARL bitext, which is now available for download.
Our sample selection procedure was informed by two existing resources, FrameNet for English and SALSA for German. Inter-annotator agreement (on an additional, but comparable, calibration set) was 0.87. Details can be found in Pado and Lapata (EMNLP 2005).
This page makes available:
- Corpora with manual semantic role annotation on automatic syntactic analyses (for German, with Amit Dubay's Sleepy parses, and for English, with Michael Collins' parser) as in the EMNLP and ACL/COLING papers
- Corpora with manual semantic role annotation on hand-corrected syntactic analyses (for German, according to the TIGER guidelines, and for English, according to the Penn Treebank guidelines)
- The GIZA++ word alignment and the manual gold alignment (produced according to the BLINKER guidelines) compared in Pado and Lapata (ACL/COLING 2006).
The corpus and the word alignments are freely available for research purposes. However, we'd be grateful to hear from you if you use this corpus in your research.
- English corpus (automatic syntax, manual semantics, all sentences).
- German corpus (automatic syntax, manual semantics, all sentences).
- English corpus (manual syntax, manual semantics, sentences with matching frames only).
- German corpus (manual syntax, manual semantics, sentences with matching frames only).
- English-German intersective alignments (gzipped, 130 KB).
- English-German manual word alignments (gzipped, 150 KB).
The corpora use the SALSA/TIGER XML format, which can be visualised directly using the SALTO tool. The alignments are stored in a simple text file format, with four lines per sentence: The ID (corresponding to the sentence ID in the XML file), the two sentences as sequences of tokens, and the word alignment as pairs of token indices.
Tools used to construct the corpus
- SALTO: Graphical annotation tool for role semantics
- GIZA++: Automatic induction of word alignments
- linearb: Manual correction of word alignments
- Sleepy: Parser for German by Amit Dubey (no longer maintained)
- Collins' parser: Parser for English
Feedback is always welcome at sebastian%40nlpado.de.