Computational Linguistics Datasets

A number of datasets that I've used in studies are available for research purposes.


The data on this page and its subpages is made available under the Creative Commons BY-NC-SA (Attribution-NonCommercial-ShareAlike) license unless specified otherwise. By downloading the data, you acknowledge the terms and conditions of the license. If you use the data, please cite the papers indicated on the respective pages.


  • 2013. Croatian Distributional Memory. (Snajder et al. 2013). See the TakeLab data page.
  • 2013. German Distributional Families. (Zeller et al. 2013). See Britta Zeller's page.
  • 2013. German Social Media Data. (Zeller and Pado 2013). See Britta Zeller's page.
  • 2012. German Distributional Memory. (Utt and Pado 2012). See Jason Utt's data page.
  • 2012. Locational inference annotation. (Feizabadi and Pado 2012). Guidelines as PDF, FrameNet motion verb list. Contact us for the data.
  • 2012. Regular polysemy evaluation dataset. (Boleda, Pado, and Utt 2012). Dataset used for experiments available for download: 28kB zip archive.
  • 2012. Parallel literary corpus with T/V pronoun labels. (Faruqui and Pado 2012). Dataset used for experiments available for download.
  • 2010. Textual Entailment Data with Discourse Annotation. (Mirkin, Dagan, and Pado 2010). The dataset and guidelines are stored externally. Please continue to
  • 2010. Manual Named Entity annotation for German EUROPARL data. German classifiers for the Stanford CRF-based NER systems (optimized in April 2010 and reported in Faruqui and Pado 2010) and manually annotated EUROPARL data as out-of-domain testset. See the German NER page.
  • 2010. Selectional Preferences for German and Spanish. (Peirsman and Pado 2010). Contact me.
  • 2009. Projection of semantic roles. The 1000-sentence bilingual English-German corpus with role-semantic annotation (Pado and Lapata 2009) is now available for download.
  • 2008. Semi-supervised SRL for event nouns. The specification of Pado, Pennacchiotti, and Sporleder 2008 is here.
  • 2007. Projection of frame-semantic classifications. Projected FrameNet predicate classes (Pado 2007) are available for German and French. Contact me.