SIGF V2: Significance testing for evaluation statistics

An important issue in empirical computational linguistics is to assess the differences between two performance figures for two models m and n on the same dataset. For many popular evaluation metrics (such as Precision, Recall, or F-Score), assessing the significance of these differences is non-trivial, since the assumptions that standard tests make are not met. (For example, the chi square test makes an independence assumption between the categories).

According to statistics literature, one general way to go about this is to generate a population of new models from the predictions of the existing models. There are two ways for doing so: by doing the bootstrap (i.e. sampling with replacement), or (approximate) randomization. Since the bootstap assumes that the sample is representative, we have implemented an assumption-free randomization framework (Yeh 2000, Noreen 1989).

The (rough) idea is that if that the difference in performance between two sets of predictions, m and n, is significant, random shuffling of the predicions will only very infrequently result a larger performance difference. The relative frequency of this actually happening can be interpreted as the significance level of the difference.

Version 2 of sigf (June 08) provides implementations of F-Score and Average together with a revamped architecture that makes extending the framework for new statistics much easier. See the supplied documentation for details.


The system can be downloaded here: sigf-v2.tgz (24 KB).
Documentation is included in the archive. Note that SIGF V.2 needs Java 1.5. If you would like to use the old SIGF on Java 1.4, drop me a line.

A much more recent implementation of the same idea is available from Dmitry Ustalov's github page.


Feedback is always welcome at


If you use this software, please cite it as
  title = 	 {User's guide to \texttt{sigf}: Significance testing by 
                  approximate randomisation},
  author =	 {Sebastian Pad\'o},
  year =	 2006


A. Yeh. 2000. More accurate tests for the statistical significance of result differences. Proceedings of COLING 2000, pp. 947--953.

E. Noreen. 1989. Computer-intensive Methods for Testing Hypotheses: An Introduction. John Wiley and Sons Inc.