Publications [Google Scholar]

Journal articles [preprints provided for non open access articles]

Measuring Historical Emotions and Their Evolution: An Interdisciplinary Endeavour to Investigate The ‘Emotions of Encounter’.
Liinc Em Revista, 15(1):70-84, 2019.
Hanno Ehrlicher, Roman Klinger, Jörg Lehmann and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
The empirical study of emotions in Spanish travelogues and reports requires cultural knowledge as well as the use of linguistic annotation and quantitative methods. We report on an interdisciplinary project in which we perform emotion annotation on a selection of texts spanning several centuries to analyze the differences across different time slices. We show that indeed the emotional connotation changes qualitatively and quantitatively. Next to this evaluation, we sketch strategies for future automation. This scalable reading approach combines quantitative with qualitative insights and identifies developments over time that call for deeper investigation.
Disambiguation of newly derived nominalizations in context: A Distributional Semantics approach.
Word Structure, 11(3):315-350, 2018.
Gabriella Lapesa, Lea Kawaletz, Ingo Plag, Marios Andreou, Max Kisselew and Sebastian Padó.
[doi]  [BibTeX] 
FrameNet's 'Using' Relation As Source of Concept-driven Paraphrases.
Constructions and Frames, 10(1):38-60, 2018. Preprint at https://nlpado.de/~sebastian/pub/papers/cf18_sikos.pdf
Jennifer Sikos and Sebastian Padó.
[doi]  [BibTeX] 
Complement Coercion: The Joint Effects of Type and Typicality .
Frontiers in Psychology, 8:1987, 2017.
Alessandra Zarcone, Ken McRae, Alessandro Lenci and Sebastian Padó.
[doi]  [BibTeX] 
Native Language Identification Across Text Types: How Special Are Scientists?.
Italian Journal of Computational Linguistics, 2(1):32-45, 2016.
Sabrina Stehwien and Sebastian Padó.
[doi]  [BibTeX] 
Design and Realization of a Modular Architecture for Textual Entailment.
Journal of Natural Language Engineering, 21(2):167-200, 2015. Preprint at https://nlpado.de/~sebastian/pub/papers/jnle13_pado.pdf
Sebastian Padó, Tae-Gil Noh, Asher Stern, Rui Wang and Roberto Zanoli.
[doi]  [BibTeX] 
On the importance of a rich embodiment in the grounding of concepts: Perspectives from embodied cognitive science and computational linguistics.
Topics in Cognitive Science, 6(3):545-558, 2014.
Serge Thill, Sebastian Padó and Tom Ziemke.
[doi]  [BibTeX] 
Logical metonymy resolution in a words-as-cues framework: Evidence from self-paced reading and probe recognition.
Cognitive Science, 38(5):973-996, 2014.
Alessandra Zarcone, Sebastian Padó and Alessandro Lenci.
[doi]  [BibTeX] 
Crosslingual and Multilingual Construction of Syntax-Based Vector Space Models.
Transactions of the Association of Computational Linguistics, 2:245-258, 2014.
Jason Utt and Sebastian Padó.
[doi]  [BibTeX] 
High-Precision Sentence Alignment by Bootstrapping from Wood Standard Annotations.
Prague Bulletin of Mathematical Linguistics, 99:5-16, 2013.
Éva Mújcricza-Majdt, Huiqin Körkel-Qu, Stefan Riezler and Sebastian Padó.
[doi]  [BibTeX] 
Semantic relations in bilingual vector spaces.
ACM Transactions in Speech and Language Processing, 8(2):3:1-3:21, 2011. Preprint at https://nlpado.de/~sebastian/pub/papers/tslp11_peirsman.pdf
Yves Peirsman and Sebastian Padó.
[doi]  [BibTeX] 
A Flexible, Corpus-driven Model of Regular and Inverse Selectional Preferences.
Computational Linguistics, 36(4):723-763, 2010.
Katrin Erk, Sebastian Padó and Ulrike Padó.
[doi]  [BibTeX] 
Measuring Machine Translation Quality as Semantic Equivalence: A Metric based on Entailment Features.
Machine Translation, 23(2--3):181-193, 2009. Preprint at https://nlpado.de/~sebastian/pub/papers/mt09_pado.pdf
Sebastian Padó, Daniel Cer, Michel Galley, Christopher D. Manning and Daniel Jurafsky.
[doi]  [BibTeX] 
Cross-lingual Annotation Projection of Role-semantic Information.
Artificial Intelligence Research, 36:307-340, 2009.
Sebastian Padó and Mirella Lapata.
[doi]  [BibTeX] 
Formalising Multi-layer Corpora in OWL DL.
Linguistic Issues in Language Technology, 1(1):1-33, 2008.
Aljoscha Burchardt, Sebastian Padó, Dennis Spohr, Anette Frank and Ulrich Heid.
[doi]  [BibTeX] 
Comparing and Combining Semantic Verb Classifications.
Language Resources and Evaluation, 42(3):161-199, 2008. Preprint at https://nlpado.de/~sebastian/pub/papers/lre08_culo.pdf
Oliver Čulo, Katrin Erk, Sebastian Padó and Sabine Schulte im Walde.
[doi]  [BibTeX] 
Dependency-based Construction of Semantic Spaces.
Computational Linguistics, 33(2):161-199, 2007.
Sebastian Padó and Mirella Lapata.
[doi]  [BibTeX] 

Conference papers

Text-based Joint Prediction of Numeric and Categorical Attributes of Entities in Knowledge Bases.
In: Proceedings RANLP. Varna, Bulgaria, 2019. To appear
V Thejas, Abhijeet Gupta and Sebastian Padó.
[abstract]  [BibTeX] 
Collaboratively constructed knowledge bases play an important role in information systems, but are essentially always incomplete. Thus, a large number of models has been developed for Knowledge Base Completion, the task of predicting new attributes of entities given partial descriptions of these entities. Virtually all of these models either concentrate on numeric attributes (Italy,GDP,2TE) or they concentrate on categorical (Tim Cook,chairman,Apple). In this paper, we propose a simple feed-forward neural architecture to jointly predict numeric and categorical attributes based on embeddings learned from textual occurrences of the entities in question. Following insights from multi-task learning, our hypothesis is that due to the correlations among attributes of different kinds, joint prediction improves over separate prediction. Our experiments on seven FreeBase domains show that this hypothesis is true of the two attribute types: we find substantial improvements for numeric attributes in the joint model, while performance remains largely unchanged for categorical attributes. Our analysis indicates that this is the case because categorical attributes, many of which describe membership in various classes, provide useful 'background knowledge' for numeric prediction, while this is true to a lesser degree in the inverse direction.
Quotation Detection and Classification with a Corpus-Agnostic Model.
In: Proceedings of RANLP. Varna, Bulgaria, 2019. To appear
Sean Papay and Sebastian Padó.
[abstract]  [BibTeX] 
The detection of quotations (i.e., reported speech, thought, and writing) has established itself as an NLP analysis task. However, state-of-the-art models have been developed on the basis of specific corpora and incorporate a high degree of corpus-specific assumptions and knowledge, which leads to fragmentation. In the spirit of task-agnostic modeling, we present a corpus-agnostic neural model for quotation detection and evaluate it on three corpora that vary in language, text genre, and structural assumptions. The model (a) approaches the state-of-the-art on the corpora when using established feature sets and (b) shows reasonable performance even when using solely word forms, which makes it applicable for non-standard (e.g., historical) corpora.
An Environment for the Relational Annotation of Political Debates.
In: Proceedings of ACL System Demonstrations. Florence, Italy, 2019.
André Blessing, Nico Blokker, Sebastian Haunss, Jonas Kuhn, Gabriella Lapesa and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper describes the MARDY corpus annotation environment developed for a collaboration between political science and computational linguistics. The tool realizes the complete workflow necessary for annotating a large newspaper text collection with rich information about claims (demands) raised by politicians and other actors, including claim and actor spans, relations, and polarities. In addition to the annotation GUI, the tool supports the identification of relevant documents, text pre-processing, user management, integration of external knowledge bases, annotation comparison and merging, statistical analysis, and the incorporation of machine learning models as "pseudo-annotators".
Crowdsourcing and Validating Event-focused Emotion Corpora for German and English.
In: Proceedings of ACL. Florence, Italy, 2019.
Enrica Troiano, Sebastian Padó and Roman Klinger.
[doi]  [abstract]  [BibTeX] 
Sentiment analysis has a range of corpora available across multiple languages. For emotion analysis, the situation is more limited, which hinders potential research on cross-lingual modeling and the development of predictive models for other languages. In this paper, we fill this gap for German by constructing deISEAR, a corpus designed in analogy to the well-established English ISEAR emotion dataset. Motivated by Scherer's appraisal theory, we implement a crowdsourcing experiment which consists of two steps. In step 1, participants create descriptions of emotional events for a given emotion. In step 2, five annotators assess the emotion expressed by the texts. We show that transferring an emotion classification model from the original english ISEAR to the German crowdsourced deISEAR via machine translation does not, on average, cause a performance drop.
Who Sides With Whom? Towards Computational Construction of Discourse Networks for Political Debates.
In: Proceedings of ACL. Florence, Italy, 2019.
Sebastian Padó, André Blessing, Nico Blokker, Erenay Dayanik, Sebastian Haunss and Jonas Kuhn.
[doi]  [abstract]  [BibTeX] 
Understanding the structures of political debates (which actors make what claims) is essential for understanding democratic political decision-making. The vision of computational construction of such discourse networks from newspaper reports brings together political science and natural language processing. This paper presents three contributions towards this goal: (a) a requirements analysis, linking the task to knowledge base population; (b) a first release of an annotated corpus of claims on the topic of migration, based on German newspaper reports; (c) initial modeling results.
Frame Identification as Categorization: Exemplars vs Protoypes in Embeddingland.
In: Proceedings of IWCS, pages 295-306. Gothenburg, Sweden, 2019.
Jennifer Sikos and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Categorization is a central capability of human cognition, and a number of theories have been developed to account for properties of categorization. Even though many tasks in semantics also involve categorization of some kind, theories of categorization do not play a major role in contemporary research in computational linguistics. This paper follows the idea that embedding-based models of semantics lend themselves well to being formulated in terms of classical categorization theories. The benefit is a space of model families that enables (a) the formulation of hypotheses about the impact of major design decisions, and (b) a transparent assessment of these decisions. We instantiate this idea on the task of frame-semantic frame identification. We define four models that cross two design variables: (a) the choice of prototype vs. exemplar categorization, corresponding to different degrees of generalization applied to the input, and (b) the presence vs. absence of a fine-tuning step, corresponding to generic vs. task-adaptive categorization. We find that for frame identification, generalization and task-adaptive categorization both yield substantial benefits. Our prototype-based, fine-tuned model, which combines the best choices over these variables, establishes a new state-of-the-art in frame identification.
DERE: A task and domain-independent slot filling framework for declarative relation extraction.
In: Proceedings of EMNLP. Brussels, Belgium, 2018.
Heike Adel, Laura Ana Maria Bostan, Sean Papay, Sebastian Padó and Roman Klinger..
[doi]  [abstract]  [BibTeX] 
Most machine learning systems for natural language processing are tailored to specific tasks. As a result, comparability of models across tasks is missing and their applicability to new tasks is limited. This affects end users without machine learning experience as well as model developers. To address these limitations, we present DERE, a novel framework for declarative specification and compilation of template- based information extraction. It uses a generic specification language for the task and for data annotations in terms of spans and frames. This formalism enables the representation of a large variety of natural language processing challenges. The backend can be instantiated by dif- ferent models, following different paradigms. The clear separation of frame specification and model backend will ease the implementation of new models and the evaluation of different models across different tasks. Furthermore, it simplifies transfer learning, joint learning across tasks and/or domains as well as the assessment of model generalizability. DERE is available as open-source software.
A Named Entity Recognition Shootout for German.
In: Proceedings of ACL, pages 120-125. Melbourne, Australia, 2018.
Martin Riedl and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We ask how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i.e., a big-data and a small-data scenario. The two best-performing model families are pitted against each other (linear-chain CRFs and BiLSTM) to observe the trade-off between expressiveness and data requirements. BiLSTM outperforms the CRF when large datasets are available and performs inferior for the smallest dataset. BiLSTMs profit substantially from transfer learning, which enables them to be trained on multiple corpora, resulting in a new state-of-the- art model for German NER on two contemporary German corpora (CoNLL 2003 and GermEval 2014) and two historic corpora.
Lexical Substitution for Evaluating Compositional Distributional Models.
In: Proceedings of NAACL, pages 206-211. New Orleans, LA, 2018.
Maja Buljan, Sebastian Padó and Jan Šnajder.
[doi]  [abstract]  [BibTeX] 
Compositional Distributional Semantic Models (CDSMs) model the meaning of phrases and sentences in vector space. They have been predominantly evaluated on limited, artificial tasks such as semantic sentence similarity on hand-constructed datasets. This paper argues for lexical substitution as a means to evaluate CDSMs. Lexical substitution is a more natural task, enables us to evaluate meaning composition at the level of individual words, and provides a common ground to compare CDSMs with dedicated lexical substitution models. We create a lexical substitution dataset for CDSM evaluation from an English-language corpus with manual “all-words” lexical substitution annotation. Our experiments indicate that the Practical Lexical Function CDSM outperforms simple component-wise CDSMs and performs on par with the context2vec lexical substitution model using the same context.
Leveraging Lexical Substitutes for Unsupervised Word Sense Induction.
In: Proceedings of AAAI. New Orleans, LA, 2018.
Domagoj Alagić, Jan Šnajder and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Word sense induction is the most prominent unsupervised approach to lexical disambiguation. It clusters word instances, typically represented by their bag-of-words contexts. Therefore, uninformative and ambiguous contexts present a major challenge. In this paper, we investigate the use of an alternative instance representation based on lexical substitutes, i.e., contextually suitable, meaning-preserving replacements. Using lexical substitutes predicted by a state-of-the-art automatic system and a simple clustering algorithm, we outperform bag-of-words instance representations and compete with much more complex structured probabilistic models. Furthermore, we show that an oracle based on manually-labeled lexical substitutes yields yet substantially higher performance. Taken together, this provides evidence for a complementarity between word sense induction and lexical substitution that has not been given much consideration before.
Integrating lexical-conceptual and distributional semantics: a case report.
In: Proceedings of the Amsterdam Colloquium, pages 75-84. Amsterdam, The Netherlands, 2017.
Tillmann Pross, Antje Rossdeutscher, Gabriella Lapesa and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
By means of a case study on German verbs prefixed with the preposition über (‘over’) we compare alternation-based lexical-conceptual and usage-based distributional approaches to verb meaning. Our investigation supports the view that when distributional vectors are rendered human-interpretable by approximation of their representation with its nearest neighbour words in the semantic vector space, they reflect conceptual commonalities be- tween verbs similar to those targeted in lexical-conceptual semantics. Moreover, our case study shows that distributional representations reveal conceptual features of verb meaning that are di cult if not impossible to detect and represent in theoretical frameworks of lexical semantics and thus that a general theory of word meaning requires a combination and complementation of lexical and distributional methods.
Living a discrete life in a continuous world: Reference in cross-modal entity tracking.
In: Proceedings of IWCS. Montpellier, France, 2017.
Gemma Boleda, Sebastian Padó, Nghia The Pham and Marco Baroni.
[doi]  [abstract]  [BibTeX] 
Reference is a crucial property of language that allows us to connect linguistic expressions to the world. Modeling it requires handling both continuous and discrete aspects of meaning. Data-driven models excel at the former, but struggle with the latter, and the reverse is true for symbolic models. This paper (a) introduces a concrete referential task to test both aspects, called cross-modal entity tracking; (b) proposes a neural network architecture that uses external memory to build an entity library: On being presented with exposures of multimodal, distributed entities combined with attributes, the model learns which exposures refer to the same underlying entities and to aggregate the information present in these exposures. Our model shows promise: it beats traditional neural network architectures on the task. However, it is still outperformed by Memory Networks, another model with external memory.
Modeling Derivational Morphology in Ukrainian.
In: Proceedings of IWCS. Montpellier, France, 2017.
Mariia Melymuka, Gabriella Lapesa, Max Kisselew and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We report on a study applying compositional distributional semantic models (CDSMs) to a set of Ukrainian derivational patterns. Ukrainian is an interesting language as it is morphologically rich, and low-resource. Our study aims at resolving inconsistent results from previous studies which employed CDSMs for derivation; we provide evidence for a cross-lingual advantage of CBOW over NMF representations, as well as a simple additive over a lexical function model. In addition, we present two case studies in which we test the capabilities of CDSMs to deal with pattern-level ambiguity and apply the same CDSMs to inflectional patterns.
Are doggies cuter than dogs? Emotional valence and concreteness in German derivational morphology.
In: Proceedings of IWCS. Montpellier, France, 2017.
Gabriella Lapesa, Sebastian Padó, Tillmann Pross and Antje Rossdeutscher.
[doi]  [abstract]  [BibTeX] 
The semantic behavior of derivational processes has been investigated with compositional distribu- tional models relating the meaning of base, affix, and derivative (e.g., anti+capitalist → anticapitalist). While broadly successful, these approaches model how the distributional behavior generally is affected by derivation. Meanwhile, their predictions can not be interpreted at the level of linguistic regularities. In this paper, we adopt an alternative approach and focus on the impact of derivation on finer-grained semantic properties of the base. We focus on (the psycholinguistically prominent) emotional valence, i.e., the speakers’ positive/negative evaluation of the word referent. We present two case studies on German derivational patterns, combining distributional and regression analysis. We are able to establish the broad presence of valence effects in German derivation as well as strong interactions with concreteness.
Does Free Word Order Hurt? Assessing the Practical Lexical Function Model for Croatian.
In: Proceedings of STARSEM. Vancouver, BC, 2017.
Zoran Medić, Jan Šnajder and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
The Practical Lexical Function (PLF) model is a model of computational distri- butional semantics that attempts to strike a balance between expressivity and learn- ability in predicting phrase meaning and shows competitive results. We investigate how well the PLF carries over to free word order languages, given that it builds on ob- servations of predicate-argument combina- tions that are harder to recover in free word order languages. We evaluate variants of the PLF for Croatian, using a new lexical substitution dataset. We find that the PLF works about as well for Croatian as for En- glish, but demonstrate that its strength lies in modeling verbs, and that the free word order affects the less robust PLF variant.
Distributed Prediction of Relations for Entities: The Easy, The Difficult, and The Impossible.
In: Proceedings of STARSEM. Vancouver, BC, 2017.
Abhijeet Gupta, Gemma Boleda and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Word embeddings are supposed to provide easy access to semantic relations such as “male of” (man–woman). While this claim has been investigated for concepts, little is known about the distributional behavior of relations of (Named) Entities. We de- scribe two word embedding-based models that predict values for relational attributes of entities, and analyse them. The task is challenging, with major performance dif- ferences between relations. Contrary to many NLP tasks, high difficulty for a re- lation does not result from low frequency, but from (a) one-to-many mappings; and (b) lack of context patterns expressing the relation that are easy to pick up by word embeddings.
Instances and concepts in distributional space.
In: Proceedings of EACL, pages 79-85. Valencia, Spain, 2017.
Gemma Boleda, Abhijeet Gupta and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Instances (``Mozart'') are ontologically distinct from concepts or classes (``composer''). Natural language encompasses both, but instances have received comparatively little attention in distributional semantics. Our results show that instances and concepts differ in their distributional properties. We also establish that instantiation detection (``Mozart -- composer'') is generally easier than hypernymy detection (``chemist -- scientist''), and that results on the influence of input representation do not transfer from hyponymy to instantiation.
''Show me the cup'': Reference with Continuous Representations.
In: Proceedings of CICLing. Budapest, Hungary, 2017.
Marco Baroni, Gemma Boleda and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.
Improving Zero-Shot-Learning for German Particle Verbs by using Training-Space Restrictions and Local Scaling.
In: Proceedings of STARSEM. Berlin, Germany, 2016.
Maximilian Köper, Sabine Schulte im Walde, Max Kisselew and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Recent models in distributional semantics consider derivational patterns (e.g., use → use+ful) as the result of a compositional process, where base term and affix are combined. We exploit such models for German particle verbs (PVs), and focus on the task of learning a mapping function between base verbs and particle verbs. Our models apply particle-verb motivated training-space restrictions relying on nearest neighbors, as well as recent advances from zero- shot-learning. The models improve the mapping between base terms and derived terms for a new PV derivation dataset, and also across existing derivation datasets for German and English.
Model Architectures for Quotation Detection.
In: Proceedings of ACL, pages 1736-1745. Berlin, Germany, 2016. Acceptance rate: 25%
Christian Scheible, Roman Klinger and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Quotation detection is the task of locating spans of quoted speech in text. The state of the art treats this problem as a sequence labeling task and employs linear-chain conditional random fields. We question the efficacy of this choice: The Markov assumption in the model prohibits it from making joint decisions about the begin, end, and internal context of a quotation. We perform an extensive analysis with two new model architectures. We find that (a), simple boundary classification combined with a greedy prediction strategy is competitive with the state of the art; (b), a semi-Markov model significantly outperforms all others, by relaxing the Markov assumption.
Smoothing Syntax-Based Semantic Spaces: Let The Winner Take It All.
In: Proceedings of KONVENS, pages 186-191. Bochum, Germany, 2016. Acceptance rate: 65%
Sebastian Padó, Jan Šnajder, Jason Utt and Britta Zeller.
[doi]  [abstract]  [BibTeX] 
Syntax-based semantic spaces are more flexible and can potentially better model semantic relatedness than bag-of-words spaces. Their application is however limited by sparsity and restricted coverage. We address these problems by smoothing syntax-based with word-based spaces and investigate when to choose which prediction. We obtain the best results by picking the maximal predicted similarity for each word pair, taking advantage of the tendency of unreliable models to underestimate similarity. We show that smoothing can substantially improve coverage while maintaining prediction quality on two German benchmark tasks.
Predictability of Distributional Semantics in Derivational Word Formation.
In: Proceedings of COLING, pages 1285-1296. Osaka, Japan, 2016.
Sebastian Padó, Aurélie Herbelot, Max Kisselew and Jan Šnajder.
[doi]  [abstract]  [BibTeX] 
Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity.
Combining Seemingly Incompatible Corpora for Implicit Semantic Role Labeling.
In: Proceedings of STARSEM, pages 40-50. Denver, CO, 2015.
Parvin Sadat Feizabadi and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Implicit semantic role labeling, the task of retrieving locally unrealized arguments from wider discourse context, is a knowledge-intensive task. At the same time, the annotated corpora that exist are all small and scattered across different annotation frameworks, genres, and classes of predicates. Previous work has treated these corpora as incompatible with one another, and has concentrated on optimizing the exploitation of single corpora. In this paper, we show that corpus combination is effective after all when the differences between corpora are bridged with domain adaptation methods. When we combine the SemEval-2010 Task 10 and Gerber and Chai noun corpora, we obtain substantially improved performance on both corpora, for all roles and parts of speech. We also present new insights into the properties of the implicit semantic role labeling task.
Obtaining a Better Understanding of Distributional Models of German Derivational Morphology.
In: Proceedings of IWCS, pages 58-63. London, UK, 2015. Acceptance rate: 36%
Max Kisselew, Sebastian Padó, Alexis Palmer and Jan Šnajder.
[doi]  [abstract]  [BibTeX] 
Derivationally related words (read / read+er) usually have closely related meanings. It is an interesting challenge for distributional semantics to account for this relationship by predicting the meaning (represented as a vector) of a derived term (read+er) from the meaning of its base term (read). Previous work has framed this task as an instance of compositional meaning construction, but its properties are not yet well understood. Our goal is to better understand the factors influencing performance on this task via quantitative and qualitative analysis of two existing composition models on a set of German derivation patterns (e.g., -in, durch-). We begin by introducing a rank-based evaluation metric that provides a more relevant assessment of the models’ practical value and reveals the task to be challenging due to specific properties of German (compounding, capitalization). We also find that performance varies greatly between patterns and even among base-derived term pairs of the same pattern. A regression analysis shows that semantic coherence of the base and derived terms within a pattern, as well as coherence of the semantic shifts from base to derived terms, all significantly impact prediction quality. Finally, we investigate false positives, finding that different models capture complementary aspects of the semantic shifts.
Generalization in Native Language Identification - Learners versus Scientists.
In: Proceedings of CLiC-IT, pages 264-268. Trento, Italy, 2015.
Sabrina Stehwien and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Native Language Identification (NLI) is the task of recognizing an author's native language from text in another language. In this paper, we consider three English learner corpora and one new, presumably more difficult, scientific corpus. We find that the scientific corpus is only about as hard to model as a less-controlled learner corpus, but cannot profit as much from corpus combination via domain adaptation. We show that this is related to an inherent topic bias in the scientific corpus: researchers from different countries tend to work on different topics.
Distributional vectors encode referential attributes.
In: Proceedings of EMNLP. Lisbon, Portugal, 2015. Acceptance rate: 24%
Abhijeet Gupta, Gemma Boleda, Marco Baroni and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Distributional methods have proven to excel at capturing fuzzy, graded aspects of meaning (Italy is more similar to Spain than to Germany). In contrast, it is difficult to extract the values of more specific attributes of word referents from distributional representations, attributes of the kind typically found in structured knowledge bases (Italy has 60 million inhabitants). In this paper, we pursue the hypothesis that distributional vectors also implicitly encode referential attributes. We show that a standard supervised regression model is in fact sufficient to retrieve such attributes to a reasonable degree of accuracy: When evaluated on the prediction of both categorical and numeric attributes of countries and cities, the model consistently reduces baseline error by 30 and is not far from the upper bound. Further analysis suggests that our model is able to "objectify" distributional representations for entities, anchoring them more firmly in the external world in measurable ways.
Dissecting the Practical Lexical Function Model for Compositional Distributional Semantics.
In: Proceedings of STARSEM, pages 153-158. Denver, CO, 2015.
Abhijeet Gupta, Jason Utt and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
The Practical Lexical Function model (PLF) is a recently proposed compositional distributional semantic model which provides an elegant account of composition, striking a balance between expressiveness and robustness and performing at the state-of-the-art. In this paper, we identify an inconsistency in PLF between the objective function at training and the prediction at testing which leads to an over- counting of the predicate’s contribution to the meaning of the phrase. We investigate two possible solutions of which one (the exclusion of simple lexical vector at test time) improves performance significantly on two out of the three composition datasets.
Multi-Level Alignments As An Extensible Representation Basis for Textual Entailment Algorithms.
In: Proceedings of STARSEM, pages 193-198. Denver, CO, 2015.
Tae-Gil Noh, Sebastian Padó, Vered Shwartz, Ido Dagan, Vivi Nastase, Kathrin Eichler and Lili Kotlerman.
[doi]  [abstract]  [BibTeX] 
A major problem in research on Textual Entailment (TE) is the high implementation effort for TE systems. Recently, interoperable standards for annotation and preprocessing have been proposed. In contrast, the algorithmic level remains unstandardized, which makes component re-use in this area very difficult in practice. In this paper, we introduce multi-level alignments as a central, powerful representation for TE algorithms that encourages modular, reusable, multilingual algorithm development. We demonstrate that a pilot open-source implementation of multi-level alignment with minimal features competes with state-of-the-art open-source TE engines in three languages.
Crowdsourcing Annotation of Non-Local Semantic Roles..
In: Proceedings of EACL, pages 226-230. Gothenburg, Sweden, 2014.
Parvin Sadat Feizabadi and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper reports on a study of crowdsourcing the annotation of non-local (or implicit) frame-semantic roles, i.e., roles that are realized in the previous discourse context. We describe two annotation setups (marking and gap filling) and find that gap filling works considerably better, attaining an acceptable quality relatively cheaply. The produced data is available for research purposes.
A Language Model Sensitive to Discourse Context.
In: Proceedings of KONVENS, pages 201-206. Hildesheim, Germany, 2014.
Tae-Gil Noh and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
The paper proposes a meta language model that can dynamically incorporate the influence of wider discourse context. The model provides a conditional probability in forms of P (text|context), where the context can be arbitrary length of text, and is used to influence the probability distribution over documents. A preliminary evaluation using a 3-gram model as the base language model shows significant reductions in perplexity by incorporating discourse context.
Towards Semantic Validation of a Derivational Lexicon.
In: Proceedings of COLING, pages 1728-1739. Dublin, Ireland, 2014. Acceptance rate: 31%
Britta Zeller, Sebastian Padó and Jan Šnajder.
[doi]  [abstract]  [BibTeX] 
Derivationally related lemmas like (friend – friendly – friendship) are derived from a common stem. Frequently, their meanings are also systematically related. However, there are also many examples of derivationally related lemma pairs whose meanings differ substantially, e.g., (object – objective). Most broad-coverage derivational lexicons do not reflect this distinction, mixing up semantically related and unrelated word pairs. In this paper, we investigate strategies to recover the above distinction by recognizing semantically related lemma pairs, a process we call semantic validation. We make two main contributions: First, we perform a detailed data analysis on the basis of a large German derivational lexicon. It reveals two promising sources of information (distributional semantics and structural information about derivational rules), but also systematic problems with these sources. Second, we develop a classification model for the task that reflects the noisy nature of the data. It achieves an improvement of 13.6%in precision and 5.8%in F1-score over a strong majority class baseline. Our experiments confirm that both information sources contribute to semantic validation, and that they are complementary enough that the best results are obtained from a combined model.
What Substitutes Tell Us - Analysis of an 'All-Words' Lexical Substitution Corpus.
In: Proceedings of EACL, pages 540-549. Gothenburg, Sweden, 2014. Acceptance rate: 25%
Gerhard Kremer, Katrin Erk, Sebastian Padó and Stefan Thater.
[doi]  [abstract]  [BibTeX] 
We present the first large-scale English "all-words lexical substitution" corpus. The size of the corpus provides a rich resource for investigations into word meaning. We investigate the nature of lexical substitute sets, comparing them to WordNet synsets. We find them to be consistent with, but more fine-grained than, synsets. We also identify significant differences to results for paraphrase ranking in context reported for the SEMEVAL lexical substitution data. This highlights the influence of corpus construction approaches on evaluation results.
Polysemy index for nouns: an experiment on Italian using the PAROLE SIMPLE CLIPS lexical database.
In: Proceedings of LREC, pages 2955-2963. Reykjavík, Iceland, 2014.
Francesca Frontini, Valeria Quochi, Sebastian Padó, Jason Utt and Monica Monachini.
[doi]  [abstract]  [BibTeX] 
An experiment is presented to induce a set of polysemous basic type alternations (such as ANIMAL-FOOD, or BUILDING-INSTITUTION) by deriving them from the sense alternations found in an existing lexical resource. The paper builds on previous work and applies those results to the Italian lexicon PAROLE SIMPLE CLIPS. The new results show how the set of frequent type alternations that can be induced from the lexicon is partly different from the set of polysemy relations selected and explicitly applied by lexicographers when building it. The analysis of mismatches shows that frequent type alternations do not always correspond to prototypical polysemy relations, nevertheless the proposed methodology represents a useful tool offered to lexicographers to systematically check for possible gaps in their resource.
Entailment Graphs for Text Analytics in the Excitement Project.
In: Proceedings of Text, Speech and Dialogue, pages 11-18. Brno, Czech Republic, 2014.
Bernardo Magnini, Ido Dagan, Günter Neumann and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
In the last years, a relevant research line in Natural Language Processing has focused on detecting semantic relations among portions of text, including entailment, similarity, temporal relations, and, with a less degree, causality. The attention on such semantic relations has raised the demand to move towards more informative meaning representations, which express properties of concepts and relations among them. This demand triggered research on "statement entailment graphs", where nodes are natural language statements (propositions), comprising of predicates with their arguments and modifiers, while edges represent entailment relations between nodes. We report initial research that defines the properties of entailment graphs and their potential applications. Particularly, we show how entailment graphs are profitably used in the context of the European project EXCITEMENT, where they are applied for the analysis of customer interactions across multiple channels, including speech, email, chat and social media, and multiple languages (English, German, Italian).
The EXCITEMENT Open Platform for Textual Inferences.
In: Proceedings of ACL (Demonstration Papers), pages 43-48. Baltimore, MD, 2014. Acceptance rate: 26%
Ido Dagan, Omer Levy, Bernardo Magnini, Tae-Gil Noh, Sebastian Padó, Asher Stern and Roberto Zanoli.
[doi]  [abstract]  [BibTeX] 
This paper presents the Excitement Open Platform (EOP), a generic architecture and a comprehensive implementation for textual inference in multiple languages. The platform includes state-of-art algorithms, a large number of knowledge resources, and facilities for experimenting and testing innovative approaches. The EOP is distributed as an open source software.
Derivational Smoothing for Syntactic Distributional Semantics.
In: Proceedings of ACL, pages 731-735. Sofia, Bulgaria, 2013. Acceptance rate: 26%
Sebastian Padó, Jan Šnajder and Britta D. Zeller.
[doi]  [abstract]  [BibTeX] 
Syntax-based vector spaces are used widely in lexical semantics and are more versatile than word-based spaces (Baroni and Lenci, 2010). However, they are also sparse, with resulting reliability and coverage problems. We address this problem by derivational smoothing, which uses knowledge about derivationally related words (oldish -- old) to improve semantic similarity estimates. We develop a set of derivational smoothing methods and evaluate them on two lexical semantics tasks in German. Even for models built from very large corpora, simple derivational smoothing can improve coverage considerably.
Fitting, not clashing! A distributional semantic model of logical metonymy.
In: Proceedings of IWCS, pages 404-410. Potsdam, Germany, 2013. Acceptance rate: 64%
Alessandra Zarcone, Alessandro Lenci, Sebastian Padó and Jason Utt.
[doi]  [abstract]  [BibTeX] 
Logical metonymy interpretation (e.g. begin the book → writing) has received wide attention in linguistics. Experimental results have shown higher processing costs for metonymic conditions compared with non-metonymic ones (read the book). According to a widely held interpretation, it is the type clash between the event-selecting verb and the entity-denoting object (begin the book) that triggers coercion mechanisms and leads to additional processing effort. We propose an alternative explanation and argue that the extra processing effort is an effect of thematic fit. This is a more economical hypothesis that does not need to postulate a separate type clash mechanism: entity- denoting objects simply have a low fit as objects of event-selecting verbs. We test linguistic datasets from psycholinguistic experiments and find that a structured distributional model of thematic fit, which does not encode any explicit argument type information, is able to replicate all significant experimental findings. This result provides evidence for a graded account of coercion phenomena in which thematic fit accounts for both the trigger of the coercion and the retrieval of the covert event.
DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German.
In: Proceedings of ACL, pages 1201-1211. Sofia, Bulgaria, 2013. Acceptance rate: 26%
Britta D. Zeller, Jan Šnajder and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Derivational models are still an under-researched area in computational morphology. Even for German, a rather resource-rich language, there is a lack of large-coverage derivational knowledge. This paper describes a rule-based framework for inducing derivational families (i.e., clusters of lemmas in derivational relationships) and its application to create a high-coverage German resource, DERIVBASE, mapping over 280k lemmas into more than 17k non-singleton clusters. We focus on the rule component and a qualitative and quantitative evaluation. Our approach achieves up to 93%precision and 71%recall. We attribute the high precision to the fact that our rules are based on information from grammar books.
A corpus study of clause combination.
In: Proceedings of IWCS, pages 179-190. Potsdam, Germany, 2013. Acceptance rate: 42%
Olga Nikitina and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We present a corpus-based investigation of cases of clause combination that can be expressed both through coordination or with subordination. We analyse the data with a two-step computational model which first distinguishes subordination from coordination and then determines the direction for cases of subordination. We find that a wide range of features help with the prediction, notably frequency of predicate participants, presence of adjuncts and sharing of participants between the clause predicates.
A Textual Entailment Dataset from German Web Forum Text.
In: Proceedings of IWCS, pages 288-299. Potsdam, Germany, 2013. Acceptance rate: 42%
Britta Zeller and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We present the first freely available large German dataset for Textual Entailment (TE). Our dataset builds on posts from German online forums concerned with computer problems and models the task of identifying relevant posts for user queries (i.e., descriptions of their computer problems) through TE. We use a sequence of crowdsourcing tasks to create realistic problem descriptions through summarisation and paraphrasing of forum posts. The dataset is represented in RTE-5 Search task style and consists of 172 positive and over 2800 negative pairs. We analyse the properties of the created dataset and evaluate its difficulty by applying two TE algorithms and comparing the results with results on the English RTE-5 Search task. The results show that our dataset is roughly comparable to the RTE-5 data in terms of both difficulty and balancing of positive and negative entailment pairs. Our approach to create task-specific TE datasets can be transferred to other domains and languages.
Building and Evaluating a Distributional Memory for Croatian.
In: Proceedings of ACL, pages 784-789. Sofia, Bulgaria, 2013. Acceptance rate: 24%
Jan Šnajder, Sebastian Padó and vZeljko Agić.
[doi]  [abstract]  [BibTeX] 
We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependency-parsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-the- art performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.
Corpus-Based Acquisition of Support Verb Constructions for Portuguese.
In: Proceedings of PROPOR 2012, pages 73-84. Coimbra, Portugal, 2012.
Britta Zeller and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We present a resource-poor approach to automatically acquire support verb constructions (SVCs) for Portuguese with a two-stage procedure. First, we apply a cross-lingual approach with a bilingual par- allel corpus: starting with a Portuguese full verb, we use the translations into another language and the corresponding backtranslations to identify Portuguese verb-noun pairs with the same meaning. Since not all of these are SVCs, the candidates are ranked and filtered in a second, monolingual step based on association statistics. We discuss two parametrisations of our procedure for a high-precision and a high-recall setting. In our experiments, these parametrizations achieve a maximum precision of 91%and a maximum recall of 86 respectively.
Automatic Identification of Motion Verbs in WordNet and FrameNet for Locational Inference.
In: Proceedings of KONVENS, pages 70-79. Vienna, Austria, 2012.
Parvin Sadat Feizabadi and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper discusses the automatic identification of motion verbs in the context of "locations inference", that is, the recovery of unrealized location roles from discourse context, a special case of missing argument recovery. We first report on a small corpus study on verb classes for which location roles are particularly relevant. This includes motion, orientation and position verbs. Then, we discuss the automatic recognition of these verbs on the basis of WordNet and FrameNet. For FrameNet, we obtain results up to 67%F-Score.
Corpus-based Acquisition of German Event- and Object-Denoting Nouns.
In: Proceedings of KONVENS, pages 259-263. Vienna, Austria, 2012.
Stefan Gorzitze and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper presents a simple distributional method for acquiring event-denoting and object-denoting nouns from corpora. Its core is a bootstrapping cycle that alternates between acquiring new instances and new features, using a simple log odds ratio for filtering. We acquire 3000 German nouns for each class with precisions of 93%(events) and 98%(objects), respectively.
Inferring covert events in logical metonymies: a probe recognition experiment.
In: Proceeedings of CogSci. Sapporo, Japan, 2012.
Alessandra Zarcone, Alessandro Lenci and Sebastian Padó.
[abstract]  [BibTeX] 
It has been widely acknowledged that the interpretation of logical metonymies involves the interpretation of covert events (begin the book -> reading / writing). Whether this implicit content is part of our lexicon or rather it comes from our world knowledge, it is currently subject of debate. We present results from a probe recognition experiment, providing novel evidence in support of early metonymy processing, consistent with the hypothesis that covert events are retrieved from knowledge of typical events.
Towards a model of formal and information address in English.
In: Proceedings of EACL 2012. Avignon, France, 2012. Acceptance rate: 26%
Manaal Faruqui and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Informal and formal (T/V) address in dialogue is not distinguished overtly in modern English, e.g. by pronoun choice like in many other languages such as French (tu/vous). Our study investigates the status of the T/V distinction in English literary texts. Our main findings are: (a) human raters can label monolingual English utterances as T or V fairly well, given sufficient context; (b), a bilingual corpus can be exploited to induce a supervised classifier for T/V without human annotation. It assigns T/V at sentence level with up to 68%accuracy, relying mainly on lexical features; (c), there is a marked asymmetry between lexical features for formal speech (which are conventionalized and therefore general) and informal speech (which are text-specific).
Regular polysemy: A distributional model.
In: Proceedings of *SEM, pages 151-160. Montréal, Canada, 2012.
Gemma Boleda, Sebastian Padó and Jason Utt.
[doi]  [abstract]  [BibTeX] 
Many types of polysemy are not word specific, but are instances of general sense alternations such as Animal-Food. Despite their pervasiveness, regular alternations have been mostly ignored in empirical computational semantics. This paper presents (a) a general framework which grounds sense alternations in corpus data, generalizes them above individual words, and allows the prediction of alternations for new words; and (b) a concrete unsupervised implementation of the framework, the Centroid Attribute Model. We evaluate this model against a set of 2,400 ambiguous words and demonstrate that it outperforms two baselines.
LODifier: Generating Linked Data from Unstructured Text.
In: Proceedings of ESWC, pages 210-224. Heraklion, Greece, 2012. Acceptance rate: 25%
Isabelle Augenstein, Sebastian Padó and Sebastian Rudolph.
[doi]  [abstract]  [BibTeX] 
The automated extraction of information from text and its transformation into a formal description is an important goal in both Semantic Web research and computational linguistics. The extracted information can be used for a variety of tasks such as ontology generation, question answering and information retrieval. LODifier is an approach that combines deep semantic analysis with named entity recognition, word sense disambiguation and controlled Semantic Web vocabularies in order to extract named entities and relations between them from text and to convert them into an RDF representation which is linked to DBpedia and WordNet. We present the architecture of our tool and discuss design decisions made. An evaluation of the tool on a story link detection task gives clear evidence of its practical potential.
French and German corpora for audience-based text classification.
In: Proceedings of LREC 2012. Istanbul, Turkey, 2012.
Amalia Todirascu, Sebastian Padó, Max Kisselew, Jennifer Krisch and Ulrich Heid.
[doi]  [abstract]  [BibTeX] 
This paper presents some of the results of the CLASSYN project which investigated the classification of text according to audience-related text types. We describe the design principles and the properties of the French and German linguistically annotated corpora that we have created. We report on tools used to collect the data and on the quality of the syntactic annotation. The CLASSYN corpora comprise two text collections to investigate general text types difference between scientific and popular science text on the two domains of medical and computer science.
Ontology-based Distinction between Polysemy and Homonymy.
In: Proceedings of IWCS 2011. Oxford, UK, 2011. Acceptance rate: 42%
Jason Utt and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We consider the problem of distinguishing polysemous from homonymous nouns. This distinction is often taken for granted, but is seldom operationalized in the shape of an empirical model. We present a first step towards such a model, based on WordNet augmented with ontological classes provided by CoreLex. This model provides a polysemy index for each noun which (a), accurately distinguishes between polysemy and homonymy; (b), supports the analysis that polysemy can be grounded in the frequency of the meaning shifts shown by nouns; and (c), improves a regression model that predicts when the "one-sense-per-discourse" hypothesis fails.
``I Thou Thee, Thou Traitor'': Predicting Formal vs. Informal Address in English Literature.
In: Proceedings of ACL/HLT 2011, pages 467-472. Portland, OR, 2011. Acceptance rate: 25%
Manaal Faruqui and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal ("T/V") address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.
Acquiring entailment pairs across languages and domains: A Data Analysis.
In: Proceedings of IWCS 2011. Oxford, UK, 2011. Acceptance rate: 53%
Manaal Faruqui and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Entailment pairs are sentence pairs of a premise and a hypothesis, where the premise textually entails the hypothesis. Such sentence pairs are important for the development of Textual Entailment systems. In this paper, we take a closer look at a prominent strategy for their automatic acquisition from newspaper corpora, pairing first sentences of articles with their titles. We propose a simple logistic regression model that incorporates and extends this heuristic and investigate its robustness across three languages and three domains. We manage to identify two predictors which predict entailment pairs with a fairly high accuracy across all languages. However, we find that robustness across domains within a language is more difficult to achieve.
Generalized Event Knowledge in Logical Metonymy Resolution.
In: Proceedings of CogSci 2011. Boston, MA, 2011.
Alessandra Zarcone and Sebastian Padó.
[abstract]  [BibTeX] 
The interpretation of logical metonymies like "begin the book" has traditionally been explained by assuming the existence of complex lexical entries containing information about event knowledge (qualia roles: "reading the book/writing the book"). Qualia structure provides concrete constraints on interpretation, which are however too rigid to be cognitively plausible. We suggest "generalized event knowledge" as an alternative source of interpretation. Results from a first self-paced reading experiment, where we capitalize on the verb-final word order in German subordinate phrases to create rich expectations for events, are presented to support this hypothesis. Consequences of this hypothesis for the interpretation logical metonymies are (a), it is primarily driven by pragmatic and world knowledge; (b), it may use the same (rather than distinct) mechanisms and resources as general incremental sentence comprehension does.
Training and Evaluating a German Named Entity Recognizer with Semantic Generalization.
In: Proceedings of KONVENS 2010. Saarbrücken, Germany, 2010.
Manaal Faruqui and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We present a freely available optimized Named Entity Recognizer (NER) for German. It alleviates the small size of available NER training corpora for German with distributional generalization features trained on large unlabelled corpora. We vary the size and source of the generalization corpus and find improvements of 6%F1-score (in-domain) and 9%(out-of-domain) over simple supervised training.
Cross-lingual Induction of Selectional Preferences with Bilingual Vector Spaces.
In: Proceedings of NAACL 2010, pages 921-929. Los Angeles, CA, 2010. Acceptance rate: 31%
Yves Peirsman and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We describe a cross-lingual method for the induction of selectional preferences for resource-poor languages, where no accurate monolingual models are available. The method uses bilingual vector spaces to ``translate'' foreign language predicate-argument structures into a resource-rich language like English. The only prerequisite for constructing the bilingual vector space is a large unparsed corpus in the resource-poor language, although the model can profit from (even noisy) syntactic knowledge. Our experiments show that the cross-lingual predictions correlate well with human ratings, clearly outperforming monolingual baseline models.
Assessing the Role of Discourse References in Entailment Inference.
In: Proceedings of ACL 2010, pages 1209-1219. Uppsala, Sweden, 2010. Acceptance rate: 25%
Shachar Mirkin, Ido Dagan and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Discourse references, notably coreference and bridging, play an important role in many text understanding applications, but their impact on textual entailment is yet to be systematically understood. On the basis of an in-depth analysis of entailment instances, we argue that discourse references have the potential of substantially improving textual entailment recognition, and identify a number of research directions towards this goal.
Exemplar-Based Models for Word Meaning In Context.
In: Proceedings of ACL 2010. Uppsala, Sweden, 2010. Acceptance rate: 22%
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper describes ongoing work on distributional models for word meaning in context. We abandon the usual one-vector-per-word paradigm in favor of an exemplar model that activates only relevant occurrences. On a paraphrasing task, we find that a simple exemplar model outperforms more complex state-of-the-art models.
Robust Machine Translation Evaluation with Entailment Features.
In: Proceedings of ACL 2009, pages 297-305. Singapore, 2009. Acceptance rate: 21%
Sebastian Padó, Michel Galley, Christopher D. Manning and Daniel Jurafsky.
[doi]  [abstract]  [BibTeX] 
Existing evaluation metrics for machine translation lack crucial robustness: their correlations with human quality judgments vary considerably across languages and genres. We believe that the main reason is their inability to properly capture meaning: A good translation candidate means the same thing as the reference translation, regardless of formulation. We propose a metric that evaluates MT output based on a rich set of features motivated by textual entailment, such as lexical-semantic (in-)compatibility and argument structure overlap. We compare this metric against a combination metric of four state-of-the-art scores (BLEU, NIST, TER, and METEOR) in two different settings. The combination metric outperforms the individual scores, but is bested by the entailment-based metric. Combining the entailment and traditional features yields further improvements.
A Structured Vector Space Model for Word Meaning in Context.
In: Proceedings of EMNLP 2008, pages 897-906. Honolulu, HI, 2008. Acceptance rate: 30%
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We address the task of computing vector space representations for the meaning of word occurrences, which can vary widely according to context. This task is a crucial step towards a robust, vector-based compositional account of sentence meaning. We argue that existing models for this task do not take syntactic structure sufficiently into account. We present a novel structured vector space model that addresses these issues by incorporating the selectional preferences for argument positions. This makes it possible to integrate syntax into the computation of word meaning in context. In addition, the model performs at and above the state of the art for modeling the contextual adequacy of paraphrases.
Semantic role assignment for event nominalisations by leveraging verbal data.
In: Proceedings of COLING 2008, pages 665-672. Manchester, UK, 2008. Acceptance rate: 24%
Sebastian Padó, Marco Pennacchiotti and Caroline Sporleder.
[doi]  [abstract]  [BibTeX] 
This paper presents a novel approach to the task of semantic role labelling for event nominalisations, which make up a considerable fraction of predicates in running text, but are underrepresented in terms of training data and difficult to model. We propose to address this situation by data expansion. We construct a model for nominal role labelling solely from verbal training data. The best quality results from salvaging grammatical features where applicable, and generalising over lexical heads otherwise.
Formalising Multi-layer Corpora in OWL DL - Lexicon Modelling, Querying and Consistency Control.
In: Proceedings of IJCNLP 2008. Hyderabad, India, 2008. Acceptance rate: 28%
Aljoscha Burchardt, Sebastian Padó, Dennis Spohr, Anette Frank and Ulrich Heid.
[doi]  [abstract]  [BibTeX] 
We present a general approach to formally modelling corpora with multi-layered annotation, thereby inducing a lexicon model in a typed logical representation language, OWL DL. This model can be interpreted as a graph structure that offers flexible querying functionality beyond current XML-based query languages and powerful methods for consistency control. We illustrate our approach by applying it to the syntactically and semantically annotated SALSA/TIGER corpus.
Flexible, corpus-based modelling of Human Plausibility Judgments.
In: Proceedings of EMNLP/CoNLL 2007, pages 400-409. Prague, Czech Republic, 2007. Acceptance rate: 27%
Sebastian Padó, Ulrike Padó and Katrin Erk.
[doi]  [abstract]  [BibTeX] 
In this paper, we consider the computational modelling of human plausibility judgements for verb-relation-argument triples, a task equivalent to the computation of selectional preferences. Such models have applications both in psycholinguistics and in computational linguistics. By extending a recent model, we obtain a completely corpus-driven model for this task which achieves significant correlations with human judgements. It rivals or exceeds deeper, resource-driven models while exhibiting higher coverage. Moreover, we show that our model can be combined with deeper models to obtain better predictions than from either model alone.
Annotation précise du français en sémantique de rôles par projection cross-linguistique.
In: Actes de TALN 2007. Toulouse, France, 2007.
Sebastian Padó and Guillaume Pitel.
[doi]  [abstract]  [BibTeX] 
Dans le paradigme FrameNet, cet article aborde le problème de l'annotation précise et automatique de rôles sémantiques dans langue sans lexique FrameNet existant. Nous évaluons la méthode proposée par Padó et Lapata (2005, 2006), fondé sur la projection de rôles et appliqué initialement à la paire anglais-allemand. Nous testons sa généralisabilité du point de vue (a) des langues, en l'appliquant à la paire (anglais-français) et (b) de la qualité de la source, en utilisant une annotation automatique du côté anglais. Les expériences montrent des résultats à la hauteur de ceux obtenus pour l'allemand, nous permettant de conclure que cette approche présente un grand potentiel pour réduire la quantité de travail nécessaire à la création de telles ressources dans de nombreuses langues.
Shalmaneser - a flexible toolbox for semantic role assignment.
In: Proceedings of LREC 2006. Genoa, Italy, 2006.
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This paper presents Shalmaneser, a software package for shallow semantic parsing, the automatic assignment of semantic classes and roles to free text. Shalmaneser is a toolchain of independent modules communicating through a common XML format. System output can be inspected graphically. Shalmaneser can be used either as a "black box" to obtain semantic parses for new datasets (classifiers for English and German frame-semantic analysis are included), or as a research platform that can be extended to new parsers, languages, or classification paradigms.
SALTO - A Versatile Multi-Level Annotation Tool.
In: Proceedings of LREC 2006. Genoa, Italy, 2006.
Aljoscha Burchardt, Katrin Erk, Anette Frank, AndreaKowalski, Sebastian Padó and Manfred Pinkal.
[doi]  [abstract]  [BibTeX] 
In this paper, we describe the SALTO tool. It was originally developed for the annotation of semantic roles in the frame semantics paradigm, but can be used for graphical annotation of treebanks with general relational information in a simple drag-and-drop fashion. The tool additionally supports corpus management and quality control.
The SALSA corpus: a German corpus resource for lexical semantics.
In: Proceedings of LREC 2006. Genoa, Italy, 2006.
Aljoscha Burchardt, Katrin Erk, Anette Frank, AndreaKowalski, Sebastian Padó and Manfred Pinkal.
[doi]  [abstract]  [BibTeX] 
This paper describes the SALSA corpus, a large German corpus manually annotated with role-semantic information, based on the syntactically annotated TIGER newspaper corpus (Brants et al., 2002). The rst release, comprising about 20,000 annotated predicate instances (about half the TIGER corpus), is scheduled for mid-2006. In this paper we discuss the frame-semantic annotation framework and its cross-lingual applicability, problems arising from exhaustive annotation, strategies for quality control, and possible applications.
Optimal Constituent Alignment with Edge Covers for Semantic Projection.
In: Proceedings of COLING/ACL 2006, pages 1161-1168. Sydney, Australia, 2006. Acceptance rate: 23%
Sebastian Padó and Mirella Lapata.
[doi]  [abstract]  [BibTeX] 
Given a parallel corpus, semantic projection attempts to transfer semantic role annotations from one language to another, typically by exploiting word alignments. In this paper, we present an improved method for obtaining constituent alignments between parallel sentences to guide the role projection task. Our extensions are twofold: (a) we model constituent alignment as minimum weight edge covers in a bipartite graph, which allows us to find a globally optimal solution efficiently; (b) we propose tree pruning as a promising strategy for reducing alignment noise. Experimental results on an English-German parallel corpus demonstrate improvements over state-of-the-art models.
Cross-lingual projection of role-semantic information.
In: Proceedings of HTL/EMNLP 2005, pages 859-866. Vancouver, Canada, 2005. Acceptance rate: 32%
Sebastian Padó and Mirella Lapata.
[doi]  [abstract]  [BibTeX] 
This paper considers the problem of automatically inducing role-semantic annotations in the FrameNet paradigm for new languages. We introduce a general framework for semantic projection which exploits parallel texts, is relatively inexpensive and can potentially reduce the amount of effort involved in creating semantic resources. We propose projection models that exploit lexical and syntactic information. Experimental results on an English- German parallel corpus demonstrate the advantages of this approach.
Analysing models of semantic role assignment using confusability.
In: Proceedings of HLT/EMNLP 2005. Vancouver, Canada, 2005. Acceptance rate: 32%
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We analyze models for semantic role assignment by defining a meta-model that abstracts over features and learning paradigms. This meta-model is based on the concept of role confusability, is defined in information-theoretic terms, and predicts that roles realized by less specific grammatical functions are more difficult to assign. We find that confusability is strongly correlated with the performance of classifiers based on syntactic features, but not for classifiers including semantic features. This indicates that syntactic features approximate a description of grammatical functions, and that semantic features provide an independent second view on the data.
Cross-lingual Bootstrapping for Semantic Lexicons: The case of FrameNet.
In: Proceedings of AAAI 2005. Pittsburgh, PA, 2005. Acceptance rate: 29%
Sebastian Padó and Mirella Lapata.
[doi]  [abstract]  [BibTeX] 
This paper considers the problem of unsupervised semantic lexicon acquisition. We introduce a fully automatic approach which exploits parallel corpora, relies on shallow text properties, and is relatively inexpensive. Given the English FrameNet lexicon, our method exploits word alignments to generate frame candidate lists for new languages, which are subsequently pruned automatically using a small set of linguistically motivated filters. Evaluation shows that our approach can produce high-precision multilingual FrameNet lexicons without recourse to bilingual dictionaries or deep syntactic and semantic analysis.
Semantic Role Labelling With Similarity-Based Generalisation Using EM-based Clustering.
In: Proceedings of SENSEVAL 3. Barcelona, Spain, 2004.
Ulrike Baldewein, Katrin Erk, Sebastian Padó and Detlef Prescher.
[doi]  [abstract]  [BibTeX] 
We describe a system for semantic role assignment built as part of the Senseval III task, based on an off-the-shelf parser and Maxent and Memory-Based learners. We focus on generalisation using several similarity measures to increase the amount of training data available and on the use of EM-based clustering to improve role assignment. Our final score yields Precision=73.6 Recall=59.4%(F=65.7).
A powerful and versatile XML Format for representing role-semantic annotation.
In: Proceedings of LREC 2004. Lisbon, Portugal, 2004.
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We present two XML formats for the description and encoding of semantic role information in corpora. The TIGER/SALSA XML format provides a modular representation for semantic roles and syntactic structure. The Text-SALSA XML format is a lightweight version of TIGER/SALSA XML designed for manual annotation with an XML editor rather than a special tool. Both formats can deal with underspecification, roles crossing the sentence boundary, compound splitting, and whole-sentence tags for meta-level comments.
Querying both time-aligned and hierarchical corpora with NXT search.
In: Proceedings of LREC 2004. Lisbon, Portugal, 2004.
Ulrich Heid, Holger Voormann, Jan-Uwe Milde, Ulrike Gut, Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
One problem of the (re-)usability and exchange of annotated corpora is in the lack of standards in corpus This paper reports on the NXT Search tool, which was used to query two corpora with very different annotation with automatic data format conversion both corpora can be accessed and searched with NXT Search.
The Influence of Argument Structure on Semantic Role Assignment.
In: Proceedings of EMNLP 2004, pages 103-110. Barcelona, Spain, 2004. Acceptance rate: 24%
Sebastian Padó and Gemma Boleda Torrent.
[doi]  [abstract]  [BibTeX] 
We present a data and error analysis for semantic role labelling. In a first experiment, we build a generic statistical model for semantic role assignment in the FrameNet paradigm and show that there is a high variance in performance across frames. The main hypothesis of our paper is that this variance is to a large extent a result of differences in the underlying argument structure of the predicates in different frames. In a second experiment, we show that frame uniformity, which measures argument structure variation, correlates well with the performance figures, effectively explaining the variance.
Semantic Role Labelling for Chunk Sequences.
In: Proceedings of the CoNLL 2004 shared task. Boston, MA, 2004. Acceptance rate: 48%
Ulrike Baldewein, Katrin Erk, Sebastian Padó and Detlef Prescher.
[doi]  [abstract]  [BibTeX] 
We describe a statistical approach to semantic role labelling that employs only shallow infor- mation. We use a Maximum Entropy learner, augmented by EM-based clustering to model the fit between a verb and its argument can- didate. The instances to be classified are se- quences of chunks that occur frequently as ar- guments in the training corpus. Our best model obtains an F score of 51.70 on the test set.
Constructing Semantic Space Models from Parsed Corpora.
In: Proceedings of ACL 2003, pages 128-135. Sapporo, Japan, 2003. Acceptance rate: 20%
Sebastian Padó and Mirella Lapata.
[doi]  [abstract]  [BibTeX] 
Traditional vector-based models use word co-occurrence counts from large corpora to represent lexical meaning. In this paper we present a novel approach for constructing semantic spaces that takes syntactic relations into account. We introduce a formalisation for this class of models and evaluate their adequacy on two modelling tasks: semantic priming and automatic discrimination of lexical relations.
Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation.
In: Proceedings of ACL 2003. Sapporo, Japan, 2003. Acceptance rate: 20%
Katrin Erk, Andrea Kowalski, Sebastian Padó and Manfred Pinkal.
[doi]  [abstract]  [BibTeX] 
We describe the ongoing construction of a large, semantically annotated corpus resource as reliable basis for the largescale acquisition of word-semantic information, e.g. the construction of domainindependent lexica. The backbone of the annotation are semantic roles in the frame semantics paradigm. We report experiences and evaluate the annotated data from the first project stage. On this basis, we discuss the problems of vagueness and ambiguity in semantic annotation.

Workshop papers

Modeling Paths for Explainable Knowledge Base Completion.
In: Proceedings of the ACL BlackboxNLP workshop. Florence, Italy, 2019.
Josua Stadelmaier and Sebastian Padó.
[doi]  [BibTeX] 
Clustering-Based Article Identification in Historical Newspapers.
In: Proceedings of the NAACL LaTeCH-CLfL workshop. Minneapolis, MN, 2019.
Martin Riedl, Daniela Betz and Sebastian Padó.
[doi]  [BibTeX] 
Using Embeddings to Compare FrameNet Frames Across Languages.
In: COLING Workshop on Linguistic Resources for Natural Language Processing. Santa Fe, NM, 2018.
Jennifer Sikos and Sebastian Padó.
[doi]  [BibTeX] 
Addressing Low-Resource Scenarios with Character-aware Embeddings.
In: Proceedings of the Second Workshop on Subword and Character Level Models. New Orleans, LA, 2018.
Sean Papay, Sebastian Padó and Ngoc Thang Vu.
[doi]  [abstract]  [BibTeX] 
Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interest – modeling the situation for specific domains and low-resource languages – and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We evaluate skip-gram word embeddings and two types of character-based embeddings on word relatedness prediction. On large corpora, performance of both model types is equal for frequent words, but character awareness already helps for infrequent words. Consistently, on small corpora, the character-based models perform overall better than skip-grams. The concatenation of different embeddings performs best on small corpora and robustly on large corpora.
Towards Cross-Lingual Comparability of Derivational Lexicons: An Extraction Algorithm for CELEX.
In: Proceedings of DeriMo. Milan, Italy, 2017.
Elnaz Shafaei, Diego Frassinelli, Gabriella Lapesa and Sebastian Padó.
[doi]  [BibTeX] 
Evaluating and Improving a Derivational Lexicon with Graph-theoretical Methods.
In: Proceedings of DeriMo. Milan, Italy, 2017.
Sean Papay, Gabriella Lapesa and Sebastian Padó.
[doi]  [BibTeX] 
Annotation, Modelling and Analysis of Fine-Grained Emotions on a Stance and Sentiment Detection Corpus.
In: Proceedings of the EMNLP WASSA workshop. Copenhagen, Denmark, 2017.
Hendrik Schuff, Jeremy Barnes, Julian Mohme, Sebastian Padó and Roman Klinger.
[doi]  [abstract]  [BibTeX] 
There is a rich variety of data sets for sentiment analysis (viz., polarity and subjectivity classification). For the more challenging task of detecting discrete emotions following the definitions of Ekman and Plutchik, however, there are much fewer data sets, and notably no resources for the social media domain. This paper contributes to closing this gap by extending the SemEval 2016 stance and sentiment dataset with emotion annotation. We (a) analyse annotation reliability and annotation merging; (b) investigate the relation between emotion annotation and the other annotation layers (stance, sentiment); (c) report modelling results as a baseline for future work.
Investigating the Relationship between Literary Genres and Emotional Plot Development.
In: Proceedings of the ACL LaTeCH-CLfL workshop. Vancouver, BC, 2017.
Evgeny Kim, Sebastian Padó and Roman Klinger.
[doi]  [abstract]  [BibTeX] 
Literary genres are commonly viewed as being defined in terms of content and stylistic features. In this paper, we focus on one particular class of lexical features, namely emotion information, and investigate the hypothesis that emotion-related information correlates with particular genres. Using genre classification as a testbed, we compare a model that computes lexicon-based emotion scores globally for complete stories with a model that tracks emotion arcs through stories on a subset of Project Gutenberg with five genres. Our main findings are: (a), the global emotion model is competitive with a large-vocabulary bag-of-words genre classifier (80% F1); (b), the emotion arc model shows a lower performance (59% F1) but shows complementary behavior to the global model, as indicated by very good performance of an oracle ensemble (94%F1); (c), genres differ in the extent to which stories follow the same emotional arcs, with particularly uniform behavior for anger (mystery) and fear (adventures, romance, humor, science fiction).
Predicting the Direction of Derivation in English conversion.
In: Proceedings of the ACL SIGMORPHON workshop, pages 93-98. Berlin, Germany, 2016.
Max Kisselew, Laura Rimell, Alexis Palmer and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Conversion is a word formation operation that changes the grammatical category of a word in the absence of overt morphology. Conversion is extremely productive in English (e.g., tunnel, talk). This paper investigates whether distributional information can be used to predict the diachronic direction of conversion for homophonous noun–verb pairs. We aim to predict, for example, that tunnel was used as a noun prior to its use as a verb. We test two hypotheses: (1) that derived forms are less frequent than their bases, and (2) that derived forms are more semantically specific than their bases, as approximated by information theoretic measures. We find that hypothesis (1) holds for N-to-V conversion, while hypothesis (2) holds for V-to-N conversion. We achieve the best overall account of the historical data by taking both frequency and semantic specificity into account. These results provide a new perspective on linguistic theories regarding the semantic specificity of derivational morphemes, and on the morphosyntactic status of conversion.
Same Same but Different: Type and Typicality in a Distributional Model of Complement Coercion.
In: Proceedings of NetWords, pages 91-94. Pisa, Italy, 2015.
Alessandra Zarcone, Sebastian Padó and Alessandro Lenci.
[doi]  [abstract]  [BibTeX] 
We aim to model the results from a self-paced reading experiment, which tested the effect of semantic type clash and typicality on the processing of German complement coercion. We present two distributional semantic models to test if they can model the effect of both type and typicality in the psycholinguistic study. We show that one of the models, without explicitly representing type information, can account both for the effect of type and typicality in complement coercion.
Measuring Semantic Content To Assess Asymmetry in Derivation.
In: Proceedings of the IWCS Workshop on Advances in Distributional Semantics. London, UK, 2015.
Sebastian Padó, Alexis Palmer, Max Kisselew and Jan Šnajder.
[doi]  [BibTeX] 
Mapping conceptual features to referential properties.
In: Proceedings of the 3rd international ESSENCE workshop: Algorithms for processing meaning. Barcelona, Spain, 2015.
Abhijeet Gupta, Gemma Boleda, Marco Baroni and Sebastian Padó.
[BibTeX] 
Morphological Priming in German: The Word is Not Enough (Or Is It?).
In: Proceedings of NetWords, pages 42-45. Pisa, Italy, 2015.
Sebastian Padó, Britta Zeller and Jan Šnajder.
[doi]  [BibTeX] 
GermEval 2014 Named Entity Recognition Shared Task: Companion Paper.
In: Proceedings of the KONVENS GermEval workshop, pages 104-112. Hildesheim, Germany, 2014.
Darina Benikova, Chris Biemann, Max Kisselew and Sebastian Padó.
[BibTeX] 
The Curious Case of Metonymic Verbs: A Distributional Characterization.
In: Proceedings of the IWCS workshop ''Towards A Formal Distributional Semantics''. Potsdam, Germany, 2013.
Jason Utt, Alessandro Lenci, Sebastian Pado and Alessandra Zarcone.
[doi]  [abstract]  [BibTeX] 
Logical metonymy combines an event-selecting verb with an entity-denoting noun (e.g., The writer began the novel), triggering a covert event interpretation (e.g., reading, writing). Experimental investigations of logical metonymy must assume a binary distinction between metonymic (i.e. event- selecting) verbs and non-metonymic verbs to establish a control condition. However, this binary distinction (whether a verb is metonymic or not) is mostly made on intuitive grounds, which introduces a potential confounding factor. We describe a corpus-based approach which characterizes verbs in terms of their behavior at the syntax-semantics interface. The model assesses the extent to which transitive verbs prefer event-denoting objects over entity-denoting objects. We then test this “eventhood” measure on psycholinguistic datasets, showing that it can distinguish not only metonymic from non-metonymic verbs, but that it can also capture more fine-grained distinctions among different classes of metonymic verbs, putting such distinctions into a new graded perspective.
Using UIMA to Structure An Open Platform for Textual Entailment.
In: Proceedings of the 3rd Workshop on Unstructured Information Management Architecture, pages 26-33. Darmstadt, Germany, 2013.
Tae-Gil Noh and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
EXCITEMENT is a novel, open software platform for Textual Entailment (TE) which uses the UIMA framework. This paper discusses the design considerations regarding the roles of UIMA within EXCITEMENT Open Platform (EOP). We focus on two points: a) how to best design the representation of entailment problems within UIMA CAS and its type system. b) the integration and usage of UIMA components among non-UIMA components.
Modeling covert event retrieval in logical metonymy: probabilistic and distributional accounts.
In: Proceedings of the NAACL Workshop on Cognitive Modeling in Computational Linguistics, pages 70-79. Montreal, QC, 2012.
Alessandra Zarcone, Jason Utt and Sebastain Pado.
[doi]  [abstract]  [BibTeX] 
Logical metonymies (The student finished the beer) represent a challenge to compositionality since they involve semantic content not overtly realized in the sentence (covert events -> drinking the beer). We present a contrastive study of two classes of computational models for logical metonymy in German, namely a probabilistic model and a distributional, similarity-based model. We build both models from the SDEWAC corpus and evaluate them against a dataset from a self-paced reading and a probe recognition study for their sensitivity to thematic fit effects via their accuracy in predicting the correct covert event in a metonymical context. The similarity-based models allow for better coverage while maintaining the accuracy of the probabilistic models.
A Distributional Memory for German.
In: Proceedings of the KONVENS workshop on recent developments and applications of lexical-semantic resources. Vienna, Austria, 2012.
Sebastian Padó and Jason Utt.
[doi]  [abstract]  [BibTeX] 
This paper describes the creation of a Distri- butional Memory (Baroni and Lenci 2010) resource for German. Distributional Mem- ory is a generalized distributional resource for lexical semantics that does not have to commit to a particular vector space at the time of creation. We induce a resource from a German corpus, following the original de- sign decisions as closely as possible, and discuss the steps necessary for a new lan- guage. We evaluate the German DM model on a synonym selection task, finding that it can compete with existing models.
Dependency-based Question Validation for German.
In: CLEF Working Notes. Amsterdam, Netherlands, 2011.
Svitlana Babych, Alexander Henn, Jan Pawellek and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This article describes the Heidelberg contribution to the CLEF 2011 QA4MRE task for German. We focus on the objective of not using any external resources, building a system that represents questions, answers and texts as formulae in propositional logic derived from dependency structure. Background knowledge is extracted from the background corpora using several knowledge extraction strategies. We answer questions by attempting to infer answers from the test documents complemented by background knowledge, with a distance measure as fall-back. The main challenge is to specify the translation from dependency structure into a logical representation. For this step, we suggest different rule sets and evaluate various configuration parameters that tune accuracy and coverage. All of runs exceed a random baseline, but show different coverage/accuracy profiles (accuracy up to 44 coverage up to 65.
Soundex-based Translation Correction in Urdu-English Cross-Language Information Retrieval.
In: Proceedings of the IJCNLP Workshop on Cross-Lingual Information Retrieval, pages 25-29. Chiang Mai, Thailand, 2011.
Manaal Faruqui, Prasenjit Majumdar and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Cross-language information retrieval is difficult for languages with few processing tools or resources such as Urdu. An easy way of translating content words is provided by Google Translate, but due to lexicon limitations named entities (NEs) are transliterated letter by letter. The resulting NEs errors (zynydyny zdn for Zinedine Zidane) hurts retrieval. We propose to replace English non-words in the translation output. First, we determine phonetically similar English words with the Soundex algorithm. Then, we choose among them by a modified Levenshtein distance that models correct transliteration patterns. This strategy yields an improvement of 4%MAP (from 41.2 to 45.1, monolingual 51.4) on the FIRE-2010 dataset.
Multi-Way Classification of Semantic Relations Between Pairs of Nominals.
In: Proceedings of the 5th SIGLEX Workshop on Semantic Evaluation, pages 33-38. Uppsala, Sweden, 2010.
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano and Stan Szpakowicz.
[doi]  [abstract]  [BibTeX] 
We describe the SemEval-2 task 8 (multi-way classification of semantic relations between pairs of nominals). The task was designed to compare different approaches to the problem and to provide a standard testbed for future research. This paper defines the task, describes the training and test data and the process of their creation, lists the participating systems (10 teams, 28 runs), and discusses their results.
``I like work: I can sit and look at it for hours'' - Type clash vs. plausibility in covert event recovery.
In: Proceedings of the VERB 2010 workshop. Pisa, Italy, 2010.
Alessandra Zarcone and Sebastian Padó.
[BibTeX] 
Multi-word expressions in Textual Entailment: Much ado about nothing?.
In: Proceedings of the ACL TextInfer workshop, pages 1-9. Singapore, 2009.
Marie-Catherine de Marneffe, Sebastian Padó and Christopher D. Manning.
[doi]  [abstract]  [BibTeX] 
Multi-word expressions (MWE) have seen much attention from the NLP community. In this paper, we investigate their impact on the recognition of textual entailment (RTE). Using the manual Microsoft Research annotations, we first manually count and classify MWEs in RTE data. We find few, most of which are arguably unlikely to cause processing problems. We then consider the impact of MWEs on a current RTE system. We are unable to confirm that entailment recognition suffers from wrongly aligned MWEs. In addition, MWE alignment is difficult to improve, since MWEs are poorly represented in state-of-the-art paraphrase resources, the only available sources for multi-word similarities. We conclude that RTE should concentrate on other phenomena impacting entailment, and that paraphrase knowledge is best understood as capturing general lexico-syntactic variation.
Paraphrase assessment in structured vector space: Exploring parameters and datasets.
In: Proceedings of the EACL Workshop on Geometrical Methods for Natural Language Semantics, pages 57-65. Athens, Greece, 2009.
Katrin Erk and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
The appropriateness of paraphrases for words depends often on context: ``grab'' can replace ``catch'' in ``catch a ball'', but not in ``catch a cold''. Structured Vector Space (SVS) is a model that computes word meaning in context in order to assess the appropriateness of such paraphrases. This paper investigates ``best-practice'' parameter settings for SVS, and it presents a method to obtain large datasets for paraphrase assessment from corpora with WSD annotation.
The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages.
In: Proceedings of CoNLL-2009, pages 1-18. Boulder, CO, 2009.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria A. Martì, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štepánek, Pavel Straňák, Mihai Surdeanu, Niawen Xue and Yi Zhang.
[doi]  [abstract]  [BibTeX] 
For the 11th straight year, the Conference on Computational Natural Language Learning has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntactic and semantic dependencies in multiple languages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.
SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals.
In: Proceedings of the NAACL Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 94-99. Boulder, CO, 2009.
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid Ó Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano and Stan Szpakowicz.
[doi]  [abstract]  [BibTeX] 
We present a brief overview of the main challenges in the extraction of semantic relations from English text, and discuss the shortcomings of previous data sets and shared tasks. This leads us to introduce a new task, which will be part of SemEval-2010: multi-way classification of mutually exclusive semantic relations between pairs of common nominals. The task is designed to compare different approaches to the problem and to provide a standard testbed for future research, which can benefit many applications in Natural Language Processing.
Textual Entailment Features for Machine Translation Evaluation.
In: Proceedings of the EACL Workshop on Machine Translation, pages 37-41. Athens, Greece, 2009.
Sebastian Padó, Michel Galley, Christopher D. Manning and Daniel Jurafsky.
[doi]  [abstract]  [BibTeX] 
We present two regression models for the prediction of pairwise preference judgments among MT hypotheses. Both models are based on feature sets that are motivated by textual entailment and incorporate lexical similarity as well as local syntactic features and specific semantic phenomena. One model predicts absolute scores; the other one direct pairwise judgments. We find that both models are competitive with regression models built over the scores of established MT evaluation metrics. Further data analysis clarifies the complementary behavior of the two feature sets.
Deciding Entailment and Contradiction with Stochastic and Edit Distance-based Alignment.
In: Proceedings of the Text Analysis Conference. Gaithersburg, VA, 2008.
Sebastian Padó, Marie-Catherine de Marneffe, Bill MacCartney, Anna N. Rafferty, Eric Yeh and Christopher D. Manning.
[BibTeX] 
Cross-lingual Parallelism and Translational Equivalence: The Case of FrameNet Frames.
In: Proceedings of the NODALIDA Workshop on Building Frame Semantics Resources for Scandinavian and Baltic Languages. Tartu, Estonia, 2007.
Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
Annotation projection is a strategy for the cross-lingual transfer of annotations which can be used to bootstrap linguistic resources for low-density languages, such as role-semantic databases similar to FrameNet. In this paper, we investigate the main assumption underlying annotation projection, cross-lingual parallelism, which states that annotation is parallel across languages. Concentrating on the level of frames, we provide a qualitative and quantitative characterisation of the relationship between translation and cross-lingual parallelism on the basis of a trilingual English–French– German corpus. We link frame (non)-parallelism to different kinds of translational shifts and show that a simple heuristic can detect the majority of such shifts.
Inducing a Computational Lexicon from a Corpus with Syntactic and Semantic Annotation.
In: Proceedings of IWCS-7. Tilburg, The Netherlands, 2007.
Dennis Spohr, Aljoscha Burchardt, Sebastian Padó, Anette Frank and Ulrich Heid.
[doi]  [abstract]  [BibTeX] 
To date, linguistically annotated corpora are mainly exploited for feature-based training of automatic labelling systems. In this paper, we present a general approach for the Description Logics-based modelling of multi-layered annotated corpora which offers (i) flexible and enhanced querying functionality that goes beyond current XML-based query languages, (ii) a basis for consistency checking, and (iii) a general method for defining abstractions over corpus annotations. We apply this method to the syntactically and semantically annotated SALSA/TIGER corpus . By defining abstractions over the corpus data, we generalise from a large set of individual corpus annotations to a corresponding lexicon model. We discuss issues arising from modelling multi-layered corpus annotations in Description Logics and illustrate the benefits of our approach at concrete examples.
Towards a Computational Model of Gradience in Word Sense.
In: Proceedings of IWCS-7. Tilburg, The Netherlands, 2007.
Katrin Erk and Sebastian Padó.
[BibTeX] 
To cause or not to cause: Cross-lingual semantic matching for paraphrase modelling.
In: Proceedings of the Cross-Language Knowledge Induction Workshop. Cluj-Napoca, Romania, 2005.
Sebastian Padó and Katrin Erk.
[BibTeX] 
PropBank, SALSA and FrameNet: How Design Determines Product.
In: Proceedings of the Workshop on Building Lexical Resources From Semantically Annotated Corpora, LREC-2004. Lisbon, Portugal, 2004.
Michael Ellsworth, Katrin Erk, Paul Kingsbury and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
We compare three projects that annotate semantic roles: PropBank, FrameNet, and SALSA. The first part of our analysis is a comparison of the different word sense distinction criteria underlying the annotation. Then, we study the effects of these criteria at the level of actual phenomena that require annotation. In particular, we discuss metaphor, support constructions, words with multiple meaning aspects, phrases realizing more than one semantic role, and nonlocal semantic roles.
Towards a better understanding of frame element assignment errors.
In: Proceedings of the Workshop on Prospects and Advances in the Syntax/Semantics Interface. Nancy, France, 2003.
Sebastian Padó and Gemma Boleda Torrent.
[BibTeX] 
The SALSA Annotation Tool.
In: Proceedings of the Workshop on Prospects and Advances in the Syntax/Semantics Interface. Nancy, France, 2003.
Katrin Erk, Andrea Kowalski and Sebastian Padó.
[BibTeX] 
Building a Resource for Lexical Semantics.
In: Proceedings of the Workshop on Frame Semantics, XVII. International Congress of Linguists. Prague, Czech Republic, 2003.
Katrin Erk, Andrea Kowalski, Sebastian Padó and Manfred Pinkal.
[BibTeX] 

Book chapters

Statistical Machine Translation Support Improves Human Adjective Translation.
In: O. Culo and S. Hansen-Schirra, editors, Crossroads between Contrastive Linguistics, Translation Studies and Machine Translation: TC3 II, pages 121-152. Language Science Press, 2017.
Gerhard Kremer, Matthias Hartung, Sebastian Padó and Stefan Riezler.
[doi]  [abstract]  [BibTeX] 
In this paper we present a study in computer-assisted translation, investigating whether non-professional translators can profit directly from automatically constructed bilingual phrase pairs. Our support is based on state-of-the-art statistical machine translation (smt), consisting of a phrase table that is generated from large parallel corpora, and a large monolingual language model. In our experiment, human translators were asked to translate adjective–noun pairs in context in the presence of suggestions created by the smt model. Our results show that smt support results in an acceptable slowdown in translation time while significantly improving translation quality.
Textual Entailment.
In: R. Mitkov, editor, Oxford Handbook of Computational Linguistics, edition 2nd. Oxford University Press, 2016.
Sebastian Padó and Ido Dagan.
[doi]  [abstract]  [BibTeX] 
Textual entailment is a binary relation between two natural-language texts (called ‘text’ and ‘hypothesis’), where readers of the ‘text’ would agree the ‘hypothesis’ is most likely true (Peter is snoring → A man sleeps). Its recognition requires an account of linguistic variability ( an event may be realized in different ways, e.g. Peter buys the car ↔ The car is purchased by Peter) and of relationships between events (e.g. Peter buys the car → Peter owns the car). Unlike logics-based inference, textual entailment also covers cases of probable but still defeasible entailment (A hurricane hit Peter’s town → Peter’s town was damaged). Since human common-sense reasoning often involves such defeasible inferences, textual entailment is of considerable interest for real-world language processing tasks, as a generic, application-independent framework for semantic inference. This chapter discusses the history of textual entailment, approaches to recognizing it, and its integration in various NLP tasks.
Semantics in Computational Lexicons.
In: C. Maienborn, K. von Heusinger and P. Portner, editors, Semantics: An International Handbook of Natural Language Meaning, pages 2887-2917. De Gruyter, 2012.
Anette Frank and Sebastian Padó.
[doi]  [abstract]  [BibTeX] 
This chapter gives an overview of work on the representation of semantic information in lexicon resources for computational natural language processing (NLP). It starts with a broad overview of the history and state of the art of different types of semantic lexicons in Computational Linguistics, and discusses their main use cases. Section 2 is devoted to questions of how to construct semantic lexicons for Computational Linguistics. We discuss diverse modelling principles for semantic lexicons and methods for their construction, ranging from largely manual resource creation to automated methods for learning lexicons from text, semi-structured or unstructured. Section 3 addresses issues related to the cross-lingual and multi-lingual creation of broad-coverage semantic lexicon resources. Section 4 discusses interoperability, i.e., the combination of lexical (and other) resources describing different meaning aspects. Section 5 concludes with an outlook on future research directions.
Machine Translation Evaluation and Optimization.
In: J. Olive, C. Christianson and J. McCary, editors, Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation, pages 745-843. Springer, 2011.
B. Dorr, Y. Al-Onaizan, M. Galley, N. Habash, D. Jones, S. Kulick, A. Lavie, G. Leusch, N. Madnani, C. Manning, M. Marcus, A. Mauser, M. Ostendorf, S. Padó, M. Przybocki, A. Rosti, R. Schwartz, M. Snover, C. Tate, S. Vogel and C. Voss.
[doi]  [BibTeX] 
Using FrameNet for the Semantic Analysis of German: Annotation, Representation, and Automation.
In: H. C. Boas, editor, Multilingual FrameNets in Computational Lexicography - Methods and Applications, pages 209-244. De Gruyter, 2009.
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó and Manfred Pinkal.
[doi]  [BibTeX] 

Abstracts

Distributional Analysis of Function Words.
In: Proceedings of the 13th International Tbilisi Symposium on Language, Logic and Computation. Batumi, Georgia, 2019. To appear
Daniel Hole and Sebastian Padó.
[doi]  [BibTeX] 
Learning Trilingual Dictionaries for Urdu - Roman Urdu - English.
In: Proceedings of the ACL Workshop on Widening NLP. Florence, Italy, 2019.
Moiz Rauf and Sebastian Padó.
[BibTeX] 
Supporting Discourse Network Analysis through Machine Learning for Claim Detection and Classification.
In: Proceedings of the 4th European Conference on Social Networks. Zurich, Switzerland, 2019. To appear
Sebastian Haunss, Nico Blokker, Sebastian Pado, Jonas Kuhn, Andre Blessing, Gabriella Lapesa and Erenay Dayanik.
[BibTeX] 
Digitale Modellierung von Figurenkomplexität am Beispiel des Parzival von Wolfram von Eschenbach.
In: Digital Humanities im Deutschsprachigen Raum. Cologne, Germany, 2018.
Manuel Braun, Roman Klinger, Sebastian Padó and Gabriel Viehhauser.
[BibTeX] 
Type disambiguation of English -ment derivatives.
In: Proceedings of the 11th Mediterranean Morphology Meeting. Nikosia, Cyprus, 2017.
Gabriella Lapesa, Lea Kawaletz, Marios Andreou, Max Kisselew, Sebastian Pado and Ingo Plag.
[BibTeX] 
Characterizing the pragmatic component of distributional vectors in terms of polarity: Experiments on German über verbs.
In: ESSLLI DISSALT Workshop: Distributional Semantics and Semantic Theory. Bolzano, Italy, 2016.
Gabriella Lapesa, Max Kisselew, Sebastian Padó, Tillmann Pross and Antje Rossdeutscher.
[BibTeX] 
Instance-based disambiguation of English -ment derivatives.
In: Proceedings of the conference on cognitive structures: Linguistic, Philosophical and Psychological Perspectives. Düsseldorf, Germany, 2016.
Marios Andreou, Lea Kawaletz, Max Kisselew, Gabriella Lapesa, Sebastian Pado and Ingo Plag.
[BibTeX] 
CRETA (Centrum für reflektierte Textanalyse) - Fachübergreifende Methodenentwicklung in den Digital Humanities.
In: Digital Humanities im Deutschsprachigen Raum. Leipzig, Germany, 2016.
Jonas Kuhn, Artemis Alexiadou, Manuel Braun, Thomas Ertl, Sabine Holtz, Cathleen Kantner, Catrin Misselhorn, Sebastian Padó, Sandra Richter, Achim Stein and Claus Zittel.
[BibTeX] 
Quantifying regularity in morphological processes: An ongoing study on nominalization in German.
In: ESSLLI DISSALT Workshop: Distributional Semantics and Semantic Theory. Bolzano, Italy, 2016.
Rossella Varvara, Gabriella Lapesa and Sebastian Padó.
[BibTeX] 
'Over reference': A comparative study on German prefix verbs.
In: ESSLLI SemRefPlus Workshop: Referential semantics one step further: Incorporating insights from conceptual and distributional approaches to meaning. Bolzano, Italy, 2016.
Tillmann Pross, Antje Rossdeutscher, Sebastian Padó, Gabriella Lapesa and Max Kisselew.
[BibTeX] 
Logical metonymy: Disentangling object type and thematic fit.
In: Architecture and Mechanisms of Language Processing. Aix en Provence, France, 2013.
Alessandra Zarcone and Sebastian Padó.
[BibTeX] 
Consistency and Coverage: Challenges for exhaustivesemantic annotation.
In: Deutsche Gesellschaft für Sprachwissenschaft. Bielefeld, Germany, 2006.
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó and Manfred Pinkal.
[BibTeX] 
Challenges in lexical semantics: Non-compositionality in SALSA corpus annotation.
In: Deutsche Gesellschaft für Sprachwissenschaft. Bielefeld, Germany, 2006.
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Kowalski, Sebastian Padó and Manfred Pinkal.
[BibTeX] 

Edited volumes

Proceedings of the ACL Workshop on Relational Models of Semantics.
Portland, OR, 2011.
S. Kim, Z. Kozareva, P. Nakov, D. Ó Séaghdha, S. Padó and S. Szpakowicz.
[BibTeX] 
Proceedings of the EMNLP Workshop on Geometrical Models of Natural Language Semantics.
Edinburgh, UK, 2011.
S. Padó and Y. Peirsman.
[BibTeX] 
Proceedings of the EMNLP TextInfer Workshop on Textual Entailment.
Edinburgh, UK, 2011.
S. Padó and S. Thater.
[BibTeX] 

PhD Thesis

Cross-Lingual Annotation Projection Models for Role- Semantic Information.
Institute for Computational Linguistics, Saarland University, 2007.
Sebastian Padó
[a4]  [a5]