GERMAN NAMED ENTITY RECOGNIZER Release May 2011 [compatible with Stanford NER version 1.2] =========================================== Manaal Faruqui, manaalfar@gmail.com LICENSE ======= Both the Stanford NER and the German classifiers are available under the GNU GPL. That is, it can be used for academic (or any other) research purposes, but cannot be integrated in commercial software. By downloading the software and data, you acknowledge the terms and conditions of the GPL. TUTORIAL ======== This document contains quickstart guidelines for end users who wish to apply the pretrained NER models. For further instructions on training your own NER model, please see http://www-nlp.stanford.edu/software/crf-faq.shtml and refer to the README.txt distributed with the Stanford-NER as well. USAGE ===== The Stanford NER system requires Java 1.5 or later. We have only tested it on the SUN JVM. (1) Download the Stanford-NER version 1.2 from http://www-nlp.stanford.edu/software/CRF-NER.shtml and unpack the archive (2) cd stanford-ner-2011-05-18/ (2) Download the chosen German classifier from the website. (3) Move the classifier to stanford-ner-2011-05-18/classifiers/ (4) Tokenize the test file by using the following command :- * java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer test > test.tok (5) perl -ne 'chomp; print "$_ O O O O\n"' test.tok > test.tok.ready (6) Use the following command to tag your file :- * java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/classifierName -testFile test.tok.ready (7) Output format : Token NE-tag O ( tab separated columns ) GERMAN NER CLASSIFIERS ====================== (1) Huge German Corpus-generalized classifier : This classifier has been trained on the CoNLL 2003 German data and has been generalized with the distributional similarity lexicon formed using the 175 million tokens of the HGC which have been clustered in 600 clusters. HGC is a collection of news-wire text, thus we suggest to use it for tagging the same. (2) deWac-generalized classifier : This classifier has been trained on the CoNLL 2003 German data and has been generalized with the distributional similarity lexicon formed using the 175 million tokens of the deWac which have been clustered in 400 clusters. deWac corpus has been created by scraping off content from the web, hence it is unclean and contains data from all genres, thus we suggest to use it for tagging all other kinds of documents. -------------------------------------------------------- | HGC | deWac | -------------------------------------------------------- Type of Corpus | Newswire-Text | Data from Web (Raw) | -------------------------------------------------------- Amount of Data | 175M tokens | 175M tokens | -------------------------------------------------------- #Clusters | 600 | 400 | -------------------------------------------------------- CONTACT ======= For more information, bug reports, and fixes, contact: Manaal Faruqui Dept of Computer Science & Engg, IIT Kharagpur Kharagpur 721302, West Bengal, INDIA manaalfar@gmail.com http://cse.iitkgp.ac.in/~manaalf