The question then becomes what to use as a measure of frame "usefulness". Several possibilities come to mind:
I implemented option 3, which will be called the Mutual Information Weighted GMM. This plot shows the MFCC frames and the corresponding weights for each frame. As an implementation detail, I had to force the weight to zero if the MFCC frame energy drops below a threshold, because silence and fadeout frames were winding up with very high weights. It's not obvious what types of frames are assigned high weight; it may be that we're giving too much weight to very unlikely frames, and some kind of compromise is needed.
On uspop37 with a PPK kernel, using the R-precision metric, performance is not better than with the standard GMMs. Sharded 3 ways, model training takes 160 minutes per shard on uspop37 (350 songs) using old uspop37 models.
GMMs: .3948; max hubness 128
MI-weighted GMMs: .3213, max-h 162
PPK in C++: 330 seconds, R-precision: .394. That's with a covariance floor of 0.00001.
Monte Carlo KL-d in C++, 2000 samples: 2 hours, R-precision .439. Best so far.
KL-EMD in C++: 20 minutes, R-precision .301.
On uspop1000:
PPK in C++: 45 minutes, R-precision: .440
KL-emd: +8 hours(?), R-precision: .396
Finally, a few PPK kernels built from models with different numbers of Gaussians, on uspop1000:
50: .4402
30: .4396
20: .4382
I implemented it using in C++ using Torch, which had most of the machinery I needed, and it takes 150 ms to compute one pair - two orders of magnitude faster. A 350x350 kernel will take about 2 hours. And that's without doing any of the obvious optimizations I can think of like not resampling from each model across each kernel row.
Results coming soon...
Some kind of correlation between hubness and local R-precision is visible; the correlation coefficient turns out to be .25 for this kernel (negligible p-value). So superficially it seems the opposite from what we'd expect: hubs help precision. But I think this is a misleading correlation, because hubs may have high precision for themselves, but at the expense of precision for many other items.
The first plot is for the second-highest hub, and the next plot is taken from the middle of the hub distribution. I chose this pair because they seem fairly similar in that the smallest component is modeling silence at the beginning and ending of the track (which is pretty typical, but not universal).



I was also wondering whether simply enforcing a minimum variance floor would help things. I computed a PPK kernel over uspop37, and there doesn't appear to be any difference. R-precision is still .394, the max hubness is still 128.
Returning to the three questions about small-variance components, it doesn't seem that there is much significant difference between the statistics for minimum-variance components and randomly chosen components. The following table shows the means of various statistics computed over the 3 minimum variance components from each song (top row), and 3 randomly-chosen components from each (bottom row). The priors are close, and min-variance components seem to be closer to the mixture mean and further from the origin.
So the remaining question to be tackled is the most interesting: what kind of mfcc frames do these small-variance components model?
![]() |
![]() |
![]() |
![]() |
Looking at the size of the determinants in this and the last post, and at the plots of minimum determinant vs hubness, it seems that one of the few obvious differences between the models for strong hubs and normal songs is that normal songs have one or two mixture components with a much smaller covariance matrix than the rest, while hubs have a larger minumum determinant. Interestingly, the spread of the determinants of the remaining components in the normal songs is about the same as the spread of the hubs' components. There are two ways to think about what this means: (I) the "normal" songs have a few tight components which do strange things when compared to other models. The songs without this strange property are actually behaving as they ought to, but since they're the only ones doing so, they appear to be hubs. (II) Very tight components are very specific, while broader components are more forgiving. Therefore, models which don't have any tight components are a more likely match to any arbitrary model than a model with a very tight covariance. In other words, the unlikelihood of a component with a tight covariance matrix dominates, and two models with tight components in general seem more dissimilar than if one of the models doesn't have any tight components.
Questions to be answered:
And assuming that these tight covariance components are indeed the culprit, what is the appropriate response? If they're modeling silence, that's easy enough. More generally, what happens if we just discard them? And how does this relate to Aucouturier's observations about "homogenizing" the models.
The first two figures are for the top two hubs under the EMD-KL kernel, and the second two are average hubs drawn from the middle of the hub distribution.
![]() |
![]() |
![]() |
![]() |
For starters, I implemented the PPK for diagonal GMMs in matlab. It's 50% faster than Pampalk's ma_cms.m KL-EMD code, and that even uses Rubner's implementation of EMD in C underneath. (270 ms vs 140 ms for a 50-mixture diagonal GMM w/ 20 dimensions.) If it performs well, I'll implement in C++/Torch, hopefully it will be fast enough to compute full-size kernels for uspop in a reasonable amount of time.
To examine the performance of the PPK kernel, I first created a new subset of uspop to match the size the test database used in JJ Aucouturier's thesis. In order to compare R-precision performance results from my experiments with the results in Aucouturier's thesis, ideally I would be able to use the same database. However, currently the audio for the full database is not available to me. The next best thing is simply to match the size of the clusters and dataset, so that my results can roughly be in the same ballpark. However, note that I expect our results to be somewhat (if not significantly) worse than Aucouturier's, because he hand-picked the test database clusters to be musically homogenous.
Aucouturier's test database had 350 songs coming from 37 clusters (artists), averaging around 9 songs per cluster. I randomly chose a new subset of uspop, which I'll call uspop37, which also has 37 artist-clusters, and only one album per artist.
Finally, some early results. Using the PPK on uspop37, the R-precision was .3948. I then computed an EMD-KL kernel on the same dataset for comparison, and PPK is actually better: the R-precision for EMD-KL was .3007. However, it's worse than JJ's best results of around 0.6. My guess is that my data set is simply more difficult, because of the lengths JJ went to in order to get homogenous clusters. However, I should also try the monte carlo sampling technique, which is what he used, to be sure.
Hubness was pretty bad for the PPK kernel, considering we're using a smaller dataset - the worst hub has hubness@20 of 128 (out of 350). Here is the hubness histogram and the top hubs. Clapton unplugged dominates, but I don't see anything overly strange.
'eric_clapton/Unplugged/13-Old_Love.gmm'
'everclear/So_Much_For_The_Afterglow/10-White_Men_In_Black_Suits.gmm'
'chicago/Chicago_17/02-We_Can_Stop_The_Hurtin_.gmm'
'eric_clapton/Unplugged/07-Layla.gmm'
'eric_clapton/Unplugged/08-Running_on_Faith.gmm'
'eric_clapton/Unplugged/06-Nobody_Knows_You_When_You_re_Down_and_Out.gmm'
'peter_gabriel/Secret_World_Live_Disk_1_/08-Kiss_That_Frog.gmm'
'ricky_martin/Ricky_Martin/04-Shake_Your_Bon-Bon.gmm'
'everclear/So_Much_For_The_Afterglow/13-Like_A_California_King.gmm'
'ricky_martin/Ricky_Martin/12-I_Count_The_Minutes.gmm'
Interestingly, the PPK kernel performs better than the EMD kernel despite having a much worse hub. The top hub for EMD has hubness@20 of 88, compared to 128 for the worst PPK hub.
<script id=eq1 language=TeX>
\documentclass[12pt]{article}
\pagestyle{empty}
\begin{document}
\begin{displaymath}
\int H(x,x')\psi(x')dx' = -\frac{\hbar^2}{2m}\frac{d^2}{dx^2} \psi(x)+V(x)\psi(x)
\end{displaymath}
\end{document}
</script>
<script> tex_img('eq1',200); </script>
Or you can inline like this, but then you need to do the escaping:
<script> math_img("\\frac{a^2+b^2}{c^2+d^2}=e^2"); </script>
For each plot below, 1000 GMMs were fit to the artificial data, then KL-EMD kernels were computed between the models, and then the hubness histogram of the kernel was plotted. All GMMs have 2 mixture components, and the data it was fit to was generated from two Gaussians. Two types of artifical data was generated:
The first plot shows the hubness@20 histograms for Non-UBM data in the left column, and UBM data in the right column, for increasing dimensionality from 2 to 32. In this plot, alpha is set to 0.5.
The second plot shows the hubness@20 histograms for UBM data with alpha increasing from left to right and dimensionality increasing top to bottom.
Weird... why aren't I seeing the exponential-like bad hubness histogram that I see in single-gaussian simulations? I suspect it has to do with the covariance matrices. Real mfcc data leads to hubs, and so does single-point Euclidean distance kernels, but here I'm seeing that GMMs with nice round covariance matrices do not. Or maybe the way that I'm generating the artifical data resists hubness because I'm intentionally spreading data out along the 2-sigma sphere, rather than filling the space more completely which as I noticed earlier generates more hubs because of central points. I'll try the same experiment but by distributing the centers of the artifical gaussians uniformly in the 2-sigma sphere.
Yup, hubness returns: When the data is distributed uniformly rather than deliberately spread out, hubness gets worse as the dimensionality increases.
Now I should go back and do the non-UBM vs. UBM experiment using data that isn't deliberately spread around the sphere.
The plots in the left column are histograms of inter-point distances, and the right column are histograms of "hubness@20", i.e. the number of other points for which the point appears in the top-20 neighbors. Finally, I was curious whether hubness is correlated with centrality; it seems that hubs should be the points closest to the center of the distribution from which points are drawn. The third column plots hubness as the dependent axis against squared distance from the origin (the distribution had zero mean), and indeed the negative correlation is apparent and noted above with p values.
As we get into higher dimensions, the right column looks more and more exponential and stretched out, as the distribution of distances gets wider (note the x-axis scale in the left column); overall, points are getting further and further apart, so that fewer points are central, but those that are central have become even more extreme hubs because there is less competition in the central region.
Also as the dimensionality goes up, the hubness histograms starts to look like the hubness histogram for the actual kernels I've seen. But still, for the same dimensionality, the real kernels have worse hubs then the random gaussian-distributed points. Below is a hubness histogram for the same dimensionality.
Maybe GMMs act differently then single gaussians (which is what I was simulating). For a fairer contest, I should generate nice single gaussian points, then fit a GMM (with the same number of components?) and do the KL-EMD on that...