Research interests (last updated Jul 2024)

Put briefly, my interested have been centered around the development of representations for the meaning of words, phrases and documents ("natural language") that can be acquired from corpora (or at least from normal language users). I am interested in such representations -- and the models used to construct them -- from three main angles:

(a) various linguistic and psycholinguistic phenomena (such as ambiguity, semantic relations, inference, and cognitive processing cost);
(b) how language data can provide insight into processes of (human) negotiation and knowledge transfer (opening a window to computational social sciences and humanities);
(c) how language data is represented by machine learning models for unstructured input and how such representation relate (and can be related) to structured formats such as knowledge graphs.

The following paragraphs make these core interests more concrete and link to the most recent (or most canonical) papers.

Meaning Representations

Modeling Word Meaning and Semantic Relations

Historically, I started out with developing a "distributional" framework for building word embeddings from dependency graphs for tasks such as synonymy detection and prediction of priming effects, following the intuition that these models can outperform pure bag-of-words models (Padó and Lapata CL 2007). We then used these dependency-based models for the representation of selectional preferences (Erk, Padó, Padó CL 2010).

I'm also interested in cross-lingual models, and have investigated strategies to induce bilingual embedding spaces from comparable corpora fairly early on. Back then, we found that this process can profit not only from cross-lingual synonymy (=translation) but also from "looser" semantic relations (Peirsman and Padó TSNLP 2011). Later, I looked into transferring embeddings using bilingual dictionaries (Utt and Padó TACL 2014). Lately, we have worked on multilingual data in the context of semantic frame identification (see below).

Linguistic Aspects

Semantics and Morphology

Derivation (build + er -> builder, build + ing -> building) is situated at the boundary between morphology and semantics and there is an interesting area to explore there with analysis methods from both sides. We have built DErivBase, a large derivational lexicon for German (Zeller, Šnajder, Padó ACL 2013). We have added information about semantic (in-)transparency to the resource, finding that both lexical (distributional) information and structural (paradigmatic) information contribute (Zeller, Padó, Šnajder COLING 2014). We have linked the difficulty of predicting the meaning of derived words to aspects of their argument structure (Padó et al. COLING 2016) and have worked on disambiguating novel nominalizations (Lapesa et al. Word Structure 2018) as well as characterizing differences between nominalization paradigms (Varvara, Lapesa, Padó Morphology 2021).

Discourse and Document Level Properties

Lexical knowledge also interfaces in interesting ways with structures at the discourse and document levels. We performed a corpus-based investigation of the concept of "formal vs. informal" address (tu/vous, tú/usted, du/Sie, etc.) in in English literature. Since English does not overtly mark formality, we took a bilingual approach (Faruqui and Padó EACL 2012). A second task I have worked on is the detection of reported speech ("quotations"), for which we developed a probabilistic model (Scheible, Klinger, Padó ACL 2016) and a corpus-agnostic neural model (Papay and Padó RANLP 2019). My more recent activities in this area mostly center around questions from political science (see below).

Compositionality and Polysemy

An early study I contributed to looked at the meaning of predicates is influenced by its arguments, which we modeled by way of expectations (Erk and Padó EMNLP 2008) -- this can be seen as a precursor of today's contextualized embeddings.

An interesting alternative to look at word meaning in context is to characterize it in terms of single-word substitutions that are only possible in particular contexts (the "lexical substitution" paradigm). We have also constructed a large, "all-words" lexical substitution corpus for English (Kremer et al. 2014) and showed that substitution data is helpful for sense discrimination (Alagic et al. AAAI 2018).

Psycholinguistics of Semantic Processing

My interest in meaning and ambiguity also extends to the psycholinguistic side. In an investigation of the time course of metonymic sentence interpretation, we have found effects from subject (actor) choice that are consistent with a primarily world knowledge-based interpretation process (Zarcone, Padó, Lenci J. CogSci 2014) or at least with an interaction between lexical and worled knowledge (Zarcone et al. Frontiers in Psychology 2017).

Semantic Lexicons and Frame Semantics

Even though the paradigm of inducing semantic lexical-knowledge from corpora is our best shot at large-scale resource building, it has its own share of problems. There may be fundamental disagreement on what semantic dimensions to use for the description of meaning, be it with respect to frameworks for semantic role annotation (Ellsworth et al. LREC 2006) or general-purpose semantic verb classifications (Culo et al. LRE 2008). Formal semantic analysis can also be useful in the context of Semantic Web / Information Extraction (Augenstein, Rudolph, Padó ESWC 2012).

Following up on my PhD thesis, where I worked on cross-lingual projection of frame-semantic information (Padó and Lapata JAIR 2009), I have a renewed interest in the analysis of frame-semantic frames. We investigated their ability to capture monolingual paraphrases (Sikos and Padó Constructions and Frames 2018), and to transfer across languages (Sikos, Roth, Padó LiLT 2022). We have also argued that frame semantics is a suitable framework to approach the analysis of adverbs (Nikolaev, Baker, Petruck, Padó STARSEM 2023).

Text and Understanding

I am involved in two broad application areas where NLP is mainly an enabling technology to investigate domain-specific questions: Digital Humanities and Computational Political Science.

Digital Humanities

In digital humanities, one phenomenon I have looked at is emotions with a focus on the relationship between events and emotions, taking a cue from semantic role research. We have elicited event-related emotions (Troiano, Klinger, Padó ACL 2019), and have looked at emotions in translation (Troiano, Klinger, Padó COLING 2020). We've also looked at emotions diachronically, analyzing travelogues over time (Ehrlicher et al. Liinc 2019). Most recently, we have established a nice correspondence between emotions and frame-semantic frames via appraisals (Troiano, Klinger, Padó NEJLT 2023).

A second phenomenon is genre, a very helpful but also elusive concept. We looked at the relationship between literary genres and emotions, with mixed success (Kim, Klinger, Padó LaTeCH/CLfL 2017) and investigated to what extent genre 'falls' out of distributional regularities in Spanish 'siglo de oro' plays (Lehmann and Padó ZfDG 2022).

A final topic I've worked on is historical newspaper data and the technical challenges it raises. This includes article segmentation (Betz et al. LaTeCH-CLfL 2019) and named entity recognition (Riedl and Padó ACL 2019).

Computational Political Science

In political science, I have worked on NLP methods to automatically construct discourse networks from newspaper texts. This is a relatively complex span/relation extraction task (Padó et al. ACL 2019). We have created a large annotated corpus for the migration domain, DebateNet (Blokker et al. LRE 2022), and have investigated a number of technical aspects: integrating automatic and manual annotation (Haunss et al. Politics & Governance 2020), fairness (Dayanik and Padó ACL 2020), and hierachical classification (Dayanik et al. ACL Findings 2022).

I am also interested in computational models of the positoning of political parties and other actors, looking at the capabilities of different model classes (Ceron, Blokker and Padó CoNLL 2022). We have extended the base question to fine-grained scaling (Ceron, Nikolaev and Padó ACL Findings 2023) and to multi-lingual scaling (Nikolaev, Ceron and Padó EMNLP 2023).

Language and Knowledge Representation

Model analysis and attribution

Given the rapid advances of word embedding models in recent years, we need to analyze the systematic effects inherent in such models. We have looked at bias (Dayanik, Vu, Padó NEJLT 2022), at interactions between properties of tasks, data, and models (Papay, Klinger, Padó EMNLP 2021) and at representation bias in sentence embeddings -- asking what it is that different models pay attention to. We looked at this in particular for sentence embeddings, considering Siamese encoders (Nikolaev and Padó EACL 2023) as well as BERT (Nikolaev and Padó IWCS 2023). We also presented a provably correct method for attribution in bi-encoders (Möller, Nikolaev and Padó EMNLP 2023) and used it to analyse shortcomings of news recommenders (Möller and Padó ACM TIST 2024).

I have also worked with a student to try and understand the processes behind compositional generalization in transformer-based LLMs (Han and Padó LREC-COLING 2024).

Knowledge graphs and relation extraction

In order to representation multiple layers of linguistic analysis (e.g., syntax and semantics) we have developed an OWL-DL model that addresses granularity, reliability, and supports comfortable querying (Burchardt et al. LiLT 2008).

I have also looked at the interface between distributional and symbolic knowledge and found that distributional representations are surprisingly good at predicting very fine-grained properties of entities, such as the GDP of countries (Gupta et al. EMNLP 2015). We have generalized this idea to represent categories in terms of the distributions of their instances rather than in terms of the class name (Westera et al. J. CogSci 2021).

In a 2022 paper, we went the opposite direction and showed how knowledge about admissible relational structure can be fed back into probabilistic graphical models, specifically linear-chain CRFs (Papay, Klinger, Padó ICLR 2022).

Textual Entailment

The framework of "Textual Entailment" (TE) tries to cast the semantic processing needs of NLP applications in terms of common sense entailment decisions. We have approached Machine Translation evaluation through TE (Padó et al. MT 2009) and have defined a generic platform that supports various kinds of approaches to computing TE (Padó et al. JNLE 2014). Furthermore, we have shown that a substantial number of "difficult" entailments could be solved with discourse knowledge, in particular coreference and bridging (Mirkin et al. ACL 2010)

Sebastian Padó