How should we estimate the significance of R-precision results? We would like a hypothesis test to tell us that two different R-precision results are significantly different at the 95% confidence level, for instance. Recall that R-precision is computed as an average of the individual R-precision scores for each item in the data set, which is
When comparing R-precision results from two different experimental conditions,
we are seeking to invalidate a null hypothesis that the mean of $p_i$ is significantly different under one condition than the other. If the distribution of $p$ is Gaussian, then the classic statistical test for this case is the two-sample z-test. However, if we look at the empirical distribution (histogram) of $p_i$, we can see that it does not appear to be Normal. This stands to reason because it is a ratio of a small number (the number of correct items) to a slightly larger number ($R_i$, the number of items in the same class). However, the z-test can be used for non-Gaussian data if the sample size is large enough (over two dozen samples) because of the central limit theorem.
However, there may be better choices. First of all, this situation may call for a paired test, because the same data items are used in both experimental conditions. Secondly, there is a nonparametric alternative to the paired t-test, known as the Wilcoxon signed rank test, which is more appropriate given the un-Gaussian-like distribution of the statistic. This test relies on the fact that if each pair is drawn from the same distribution, there should be roughly as many cases when the first is greater than the second as vice versa. If one item of the pair is consistently larger or smaller than the other, it is evidence that they are drawn from distributions with different means.
PPK vs PPK w/ covariance floor:
0.6377 (.3948 .3940)
Normalized vs unnormalized (uspop1000, EMD):
0.148 (.3959 .4009)
PPK vs EMD (uspop37):
9e-17 (.3948 .3007)
PPK vs PPK-MIW (30g)
6e-15 (.3948 .3213)
PPK vs MC
3e-11 (.3948 .4395)
PPK 50g vs 30g (uspop1000)
.596 (.440 .439)
PPK 50g vs 20g (uspop1000)
.457 (.440 .438)
PPK vs EMD (uspop1000)
1e-17 (.440 .396)
However, the situation is different when comparing R-precision scores from two different data sets. In that case, we can no longer compute a paired test. Furthermore, if different techniques were used to compute each score, we don't know whether the difference in scores is related to the difference in data sets or the technique. To get an idea of the inherent variability in the R-precision score due to the difference between data sets, we randomly chose 30 subsets of 350 songs each from uspop, and computed PPK kernels. The mean was 0.448 and the standard deviation was .035.
If we assume that the score is distributed Normally (the histogram looks roughly Normal), then we can use a one-sample t-test against the null hypothesis that the mean is known to be 0.448. Even better would be to compute a paired t-test across all random subsets, using the two techniques as pairs. However, this would be quite computationally expensive. More importantly, it would require having the same data sets available for both techniques, which may not be possible when trying to compare a new technique to a previously published result by another researcher.
Posted by madadam at August 16, 2006 03:04 PM