# SIGF V2: Significance testing for evaluation statistics

An important issue in empirical computational linguistics is to assess
the differences between two performance figures for two models
**m** and **n** on the same dataset. For many popular evaluation metrics
(such as Precision, Recall, or F-Score), assessing the significance
of these differences is non-trivial, since the assumptions that
standard tests make are not met. (For example, the chi square test
makes an independence assumption between the categories).

According to statistics literature, one general way to go about this is to generate a population of new models from the predictions of the existing models. There are two ways for doing so: by doing the bootstrap (i.e. sampling with replacement), or (approximate) randomization. Since the bootstap assumes that the sample is representative, we have implemented an assumption-free randomization framework (Yeh 2000, Noreen 1989).

The (rough) idea is that if
that the difference in performance between two sets of predictions,
**m** and **n**, is significant, random shuffling of the
predicions will only very infrequently result a larger performance
difference. The relative frequency of this actually happening can
be interpreted as the significance level of the difference.

**Version 2 of sigf** (June 08) provides implementations of F-Score
and Average together with a revamped architecture that makes extending
the framework for new statistics much easier. See the supplied
documentation for details.

## Download

The system can be downloaded here: sigf-v2.tgz (24 KB).

Documentation is included in the archive. Note that SIGF V.2 needs Java 1.5. If you would like to use the old SIGF on Java 1.4, drop me a line.

A **much more recent implementation** of the same idea is available from Dmitry Ustalov's github page.

## Contact

Feedback is always welcome at pado%40cl.uni-heidelberg.de.

## Reference

If you use this software, please cite it as@Manual{sigf06, title = {User's guide to \texttt{sigf}: Significance testing by approximate randomisation}, author = {Sebastian Pad\'o}, year = 2006 }

## Literature

A. Yeh. 2000. More accurate tests for the statistical significance of result differences. Proceedings of COLING 2000, pp. 947--953.

E. Noreen. 1989. Computer-intensive Methods for Testing Hypotheses: An Introduction. John Wiley and Sons Inc.