July 27, 2006

Mutual-Information weighted GMMs

If we take the perspective that background frames hurt performance, that is to say, there are frames which are fairly useless for distinguishing song models because they are generic, shared by all models, and thus part of the "universal background", then what can we do to improve the situation? One approach is to assign a weight to each frame based on its usefulness, and retrain the models taking into account this weight. The aim is to be thrifty with modeling power, spending it only on those frames that make each song distinct from the others.

The question then becomes what to use as a measure of frame "usefulness". Several possibilities come to mind:

  1. UBM likelihood ratio weighting: Train a Universal Background Model, as well as each song model normally. Then, for each frame, compute the ratio , where p(x|y) is the probability of frame x under its correct song model y, and p(x) is the overall likelihood of frame x under the universal background model. Then retrain, weighting each frame by this ratio.
  2. Entropy weighting: Train all song models first, and then for each frame of song i, compute the entropy of P_j(x), the vector of probabilities of that frame under each of the song models j. then weight by the inverse of that.
  3. Mutual info weighting: Weight by the mutual information between the class label and the mfcc frame. The class label is the song ID. If Y represents the class label, and X represents the frame, then

    Say the classes have uniform prior, so p(y) is 1/N, and p(x|y) is just the GMM for class y evaluated at the point x. Finally, p(x) is the overall likelihood of frame x, which we can compute either as the sum of the probability under each model, e.g. , or as the probability under the UBM which is trained on all the data together. Furthermore, we don't actually want to sum over x, but rather we want to compute for each of the frames x that we see, and use this value as the frame weight during retraining.
Interestingly, option (3) is a bit like options (1) and (2) combined, because p(x|y)/p(x) is like the UBM ratio of (1), and then we're taking the expected log of that under the distribution p(x|y), which is similar to computing the entropy in (2).

I implemented option 3, which will be called the Mutual Information Weighted GMM. This plot shows the MFCC frames and the corresponding weights for each frame. As an implementation detail, I had to force the weight to zero if the MFCC frame energy drops below a threshold, because silence and fadeout frames were winding up with very high weights. It's not obvious what types of frames are assigned high weight; it may be that we're giving too much weight to very unlikely frames, and some kind of compromise is needed.

On uspop37 with a PPK kernel, using the R-precision metric, performance is not better than with the standard GMMs. Sharded 3 ways, model training takes 160 minutes per shard on uspop37 (350 songs) using old uspop37 models.
GMMs: .3948; max hubness 128
MI-weighted GMMs: .3213, max-h 162

Posted by madadam at 01:22 AM

July 24, 2006

kernel performance

On uspop37:

PPK in C++: 330 seconds, R-precision: .394. That's with a covariance floor of 0.00001.
Monte Carlo KL-d in C++, 2000 samples: 2 hours, R-precision .439. Best so far.
KL-EMD in C++: 20 minutes, R-precision .301.

On uspop1000:

PPK in C++: 45 minutes, R-precision: .440
KL-emd: +8 hours(?), R-precision: .396

Finally, a few PPK kernels built from models with different numbers of Gaussians, on uspop1000:
50: .4402
30: .4396
20: .4382

Posted by madadam at 11:36 AM

July 21, 2006

Monte Carlo KL-d

In matlab, the Monte Carlo method of estimating the KL-divergence between GMMS is too slow, at least using the netlab implementation of sampling and computing gmm probability. Using 2000 samples, it takes over 1.5 seconds to compute a single model pair, so a 350x350 kernel would take over 2 days, and forget about a 1000x1000 kernel.

I implemented it using in C++ using Torch, which had most of the machinery I needed, and it takes 150 ms to compute one pair - two orders of magnitude faster. A 350x350 kernel will take about 2 hours. And that's without doing any of the obvious optimizations I can think of like not resampling from each model across each kernel row.

Results coming soon...

Posted by madadam at 10:55 AM

July 20, 2006

Hubness and r-precision

What is the relationship between hubness and R-precision? Intuitively, it seems that extreme hubs should hurt R-precision because they are frequent false positives. Here's an interesting visualization of the relationship. The black-and-white matrix represents near neighbors (the top 20 neighbors are white in each row), and plotted above that are both hubness@20 (which equals the sum of each column) along with the local R-precision value (fraction of that item's cluster that actually appear as neighbors). [This is a fragment, zoomed in so you can see the detail. So the rows may not all add up to the 20, which they would on the entire matrix. ] Some kind of correlation between hubness and local R-precision is visible; the correlation coefficient turns out to be .25 for this kernel (negligible p-value). So superficially it seems the opposite from what we'd expect: hubs help precision. But I think this is a misleading correlation, because hubs may have high precision for themselves, but at the expense of precision for many other items.
Posted by madadam at 12:03 AM

July 18, 2006

state sequences

These plots are designed to help clarify what MFCC frames are being captured by the small-variance components. The top pane is the cepstragram, and the bottom is the maximum activation state sequence. Not that this does not take into account the prior, so it's not the same as the maximum likelihood state sequence. The states are sorted by the determinant of the component, which is plotted along the left edge.

The first plot is for the second-highest hub, and the next plot is taken from the middle of the hub distribution. I chose this pair because they seem fairly similar in that the smallest component is modeling silence at the beginning and ending of the track (which is pretty typical, but not universal).

Posted by madadam at 07:09 PM

small variances

These plots were constructed to further examine what's happening with the smallest-variance mixture components. The first figure shows scatterplots of several statistics of the 3 smallest-variance (smallest determinant) components across all songs in uspop1000, as computed by the EMD-KL kernel on unnormed data. The determinant is plotted against hubness, prior, distance from the global mixture mean, and distance from the origin. It appears to be multimodal, which is especially visible in the following 3-d plot, which contains hubness, determinant, and distance from the origin.

I was also wondering whether simply enforcing a minimum variance floor would help things. I computed a PPK kernel over uspop37, and there doesn't appear to be any difference. R-precision is still .394, the max hubness is still 128.

Returning to the three questions about small-variance components, it doesn't seem that there is much significant difference between the statistics for minimum-variance components and randomly chosen components. The following table shows the means of various statistics computed over the 3 minimum variance components from each song (top row), and 3 randomly-chosen components from each (bottom row). The priors are close, and min-variance components seem to be closer to the mixture mean and further from the origin.

So the remaining question to be tackled is the most interesting: what kind of mfcc frames do these small-variance components model?

Posted by madadam at 12:21 PM

July 17, 2006

GMM visualization, part 2

Here's a similar set of figures, but using the kernel and models from normalized data (the MFCCs were globally normalized to zero mean and unit variance in all dimensions before modelling).

Looking at the size of the determinants in this and the last post, and at the plots of minimum determinant vs hubness, it seems that one of the few obvious differences between the models for strong hubs and normal songs is that normal songs have one or two mixture components with a much smaller covariance matrix than the rest, while hubs have a larger minumum determinant. Interestingly, the spread of the determinants of the remaining components in the normal songs is about the same as the spread of the hubs' components. There are two ways to think about what this means: (I) the "normal" songs have a few tight components which do strange things when compared to other models. The songs without this strange property are actually behaving as they ought to, but since they're the only ones doing so, they appear to be hubs. (II) Very tight components are very specific, while broader components are more forgiving. Therefore, models which don't have any tight components are a more likely match to any arbitrary model than a model with a very tight covariance. In other words, the unlikelihood of a component with a tight covariance matrix dominates, and two models with tight components in general seem more dissimilar than if one of the models doesn't have any tight components.

Questions to be answered:

  • Where are these components in respect to the rest? on the fringe? in the center?
  • Are these components important to the mixture - ie do they have high priors?
  • What type of MFCC frames are being modelled by these components? Is it silence? Something else that should be filtered out?

And assuming that these tight covariance components are indeed the culprit, what is the appropriate response? If they're modeling silence, that's easy enough. More generally, what happens if we just discard them? And how does this relate to Aucouturier's observations about "homogenizing" the models.

Posted by madadam at 05:20 PM

July 14, 2006

GMM visualization

To get a better idea of that the GMMS look like, I'm playing with some visualizations. First, the mixture components have been sorted by the prior, which that brings out the fact that most of the high-likelihood components are close to each other as can be seen in the upper-left plots. The four plots are:
  • The inter-center distance matrix (of distances between all pairs of mixture centers) in log scale (upper left);
  • The priors (lower left), in the same order as the inter-center distance matrix above;
  • A scatterplot of the distance from the global mixture mean against the determinant for each mixture component, both in log scales (upper right);
  • A scatter plot of the norm of the mixture mean against the prior of that mixture component (lower right).

The first two figures are for the top two hubs under the EMD-KL kernel, and the second two are average hubs drawn from the middle of the hub distribution.

Posted by madadam at 05:55 PM

Hub stability, early result

One question was whether hubs are intrinsically related to the distribution of mfcc frames underneath, or whether it's an artifact of GMM modelling. I computed several kernels over the same data (uspop1000), and since the GMMs are randomly initialized with kmeans (and initial centers are taken as random data points), the models are slightly different for each kernel. But it appears that the same songs reappear as hubs, with very slight differences each time. Below is a plot of the number of times each song appears as one of the top 20 hubs (cut off after the first 30 because it's all zeros after that). Because it takes quite a while to compute a 1000 x 1000 EMD-KL kernel with my computational setup, I only managed to compute 7 kernels.
Posted by madadam at 02:07 AM

July 13, 2006

Enter the PPK

The Product Probability Kernel is another way to compare distributions. Intuitively, it is like taking the dot product between the two distributions, and it has nice theoretical properties, including connections to the KL-divergence (it is related to a bound on the KL-d). Most importantly for practical considerations, it is much faster to compute for mixture models than the KL-EMD because there is a closed form.

For starters, I implemented the PPK for diagonal GMMs in matlab. It's 50% faster than Pampalk's ma_cms.m KL-EMD code, and that even uses Rubner's implementation of EMD in C underneath. (270 ms vs 140 ms for a 50-mixture diagonal GMM w/ 20 dimensions.) If it performs well, I'll implement in C++/Torch, hopefully it will be fast enough to compute full-size kernels for uspop in a reasonable amount of time.

To examine the performance of the PPK kernel, I first created a new subset of uspop to match the size the test database used in JJ Aucouturier's thesis. In order to compare R-precision performance results from my experiments with the results in Aucouturier's thesis, ideally I would be able to use the same database. However, currently the audio for the full database is not available to me. The next best thing is simply to match the size of the clusters and dataset, so that my results can roughly be in the same ballpark. However, note that I expect our results to be somewhat (if not significantly) worse than Aucouturier's, because he hand-picked the test database clusters to be musically homogenous.

Aucouturier's test database had 350 songs coming from 37 clusters (artists), averaging around 9 songs per cluster. I randomly chose a new subset of uspop, which I'll call uspop37, which also has 37 artist-clusters, and only one album per artist.

Finally, some early results. Using the PPK on uspop37, the R-precision was .3948. I then computed an EMD-KL kernel on the same dataset for comparison, and PPK is actually better: the R-precision for EMD-KL was .3007. However, it's worse than JJ's best results of around 0.6. My guess is that my data set is simply more difficult, because of the lengths JJ went to in order to get homogenous clusters. However, I should also try the monte carlo sampling technique, which is what he used, to be sure.

Hubness was pretty bad for the PPK kernel, considering we're using a smaller dataset - the worst hub has hubness@20 of 128 (out of 350). Here is the hubness histogram and the top hubs. Clapton unplugged dominates, but I don't see anything overly strange.

    'eric_clapton/Unplugged/13-Old_Love.gmm'
    'everclear/So_Much_For_The_Afterglow/10-White_Men_In_Black_Suits.gmm'
    'chicago/Chicago_17/02-We_Can_Stop_The_Hurtin_.gmm'
    'eric_clapton/Unplugged/07-Layla.gmm'
    'eric_clapton/Unplugged/08-Running_on_Faith.gmm'
    'eric_clapton/Unplugged/06-Nobody_Knows_You_When_You_re_Down_and_Out.gmm'
    'peter_gabriel/Secret_World_Live_Disk_1_/08-Kiss_That_Frog.gmm'
    'ricky_martin/Ricky_Martin/04-Shake_Your_Bon-Bon.gmm'
    'everclear/So_Much_For_The_Afterglow/13-Like_A_California_King.gmm'
    'ricky_martin/Ricky_Martin/12-I_Count_The_Minutes.gmm'

Interestingly, the PPK kernel performs better than the EMD kernel despite having a much worse hub. The top hub for EMD has hubness@20 of 88, compared to 128 for the worst PPK hub.

Posted by madadam at 04:55 PM

July 10, 2006

stumped

Well, when the data isn't nicely spread out, hubness is actually better with strong background radiation. So does that refute the thesis that hubs are caused by centrality due to a strong common background (and therefore could be fixed by taking into account and possibly ignoring the universal background)? There's conflicting evidence, because the correlation between hubness and centrality was very strong with single-point artificial data. Here, with GMM-modelled artificial data, it seems that there are fewer hubs when more data is shared between the models. I don't know what to make of that for the moment. Maybe it's because I'm only using two mixture components, so that when one of them is the background model, we're basically comparing single Gaussians, so the whole background issue goes away? Whereas with real data, not all songs are equally similar to the background; the hypothesis is that it's those songs that happen to be more like the background that become hubs.
Posted by madadam at 03:40 PM

inline latex

Finally found a good solution for embedding latex in the blog. I stole the idea from someone at work, and wrote a cgi script (adam/tex2gif.cgi) that accepts tex as a url parameter, converts it to a gif using code stolen from textogif by John Walker, and serves the image. It does server-side caching because the conversion process is pretty slow. Then there's some convenience javascript (adam/html_embedding.js) that makes it easier to write latex in the html without escaping backslashes and stuff.
<script id=eq1 language=TeX>
  \documentclass[12pt]{article}
  \pagestyle{empty}
  \begin{document}
  \begin{displaymath}
  \int H(x,x')\psi(x')dx' = -\frac{\hbar^2}{2m}\frac{d^2}{dx^2} \psi(x)+V(x)\psi(x)
  \end{displaymath}
  \end{document}
</script>
<script> tex_img('eq1',200); </script>

Or you can inline like this, but then you need to do the escaping:

<script> math_img("\\frac{a^2+b^2}{c^2+d^2}=e^2"); </script>
Posted by madadam at 03:21 PM

July 09, 2006

hubness and UBM

Continuing to explore the question of whether hubs are a natural consequence of high-dimensional data, or whether they are caused (or aggravated) by a strong background "radiation". In this simulation, the goal is to use GMMs and the same KL-EMD kernel that was used in earlier experiments with real MFCC data, but with artificial data that is either well-separated or has a strong background model component.

For each plot below, 1000 GMMs were fit to the artificial data, then KL-EMD kernels were computed between the models, and then the hubness histogram of the kernel was plotted. All GMMs have 2 mixture components, and the data it was fit to was generated from two Gaussians. Two types of artifical data was generated:

  1. Non-UBM data, in which all Gaussian components are centered on a random point on the sphere with distance 2*sigma away from the origin. The idea is that the components will be nicely spread out in the space.
  2. UBM data, in which one Gaussian from each model is centered on the origin, and that component is given a prior of alpha (0

    The first plot shows the hubness@20 histograms for Non-UBM data in the left column, and UBM data in the right column, for increasing dimensionality from 2 to 32. In this plot, alpha is set to 0.5. The second plot shows the hubness@20 histograms for UBM data with alpha increasing from left to right and dimensionality increasing top to bottom. Weird... why aren't I seeing the exponential-like bad hubness histogram that I see in single-gaussian simulations? I suspect it has to do with the covariance matrices. Real mfcc data leads to hubs, and so does single-point Euclidean distance kernels, but here I'm seeing that GMMs with nice round covariance matrices do not. Or maybe the way that I'm generating the artifical data resists hubness because I'm intentionally spreading data out along the 2-sigma sphere, rather than filling the space more completely which as I noticed earlier generates more hubs because of central points. I'll try the same experiment but by distributing the centers of the artifical gaussians uniformly in the 2-sigma sphere.

    Yup, hubness returns: When the data is distributed uniformly rather than deliberately spread out, hubness gets worse as the dimensionality increases. Now I should go back and do the non-UBM vs. UBM experiment using data that isn't deliberately spread around the sphere.

    Posted by madadam at 08:52 AM

July 07, 2006

distribution of distances in a random kernel

We were wondering what a "good" kernel would look like, particularly whether hubness simply is an inevitable result of the curse of dimensionality. I generated Gaussian points (zero mean, identity covariance) and plotted a histogram of the interpoint distances. For dimensions from 2 up to 32, it's below. Apparently it's Chi-squared with 2d degrees of freedom, where d is the dimensionality. That (almost) makes sense because assuming each dimension is independent and normal, then the euclidean norm is a sum of d gaussians squared, which is distributed according to . The Euclidean distance is , so I would have thought it would be \chi^2 with 3d degrees, not 2d. What am I missing? As the dimensionality goes up, it becomes more Gaussian, as sums of random variables of any kind tend to do.

The plots in the left column are histograms of inter-point distances, and the right column are histograms of "hubness@20", i.e. the number of other points for which the point appears in the top-20 neighbors. Finally, I was curious whether hubness is correlated with centrality; it seems that hubs should be the points closest to the center of the distribution from which points are drawn. The third column plots hubness as the dependent axis against squared distance from the origin (the distribution had zero mean), and indeed the negative correlation is apparent and noted above with p values.

As we get into higher dimensions, the right column looks more and more exponential and stretched out, as the distribution of distances gets wider (note the x-axis scale in the left column); overall, points are getting further and further apart, so that fewer points are central, but those that are central have become even more extreme hubs because there is less competition in the central region.

Also as the dimensionality goes up, the hubness histograms starts to look like the hubness histogram for the actual kernels I've seen. But still, for the same dimensionality, the real kernels have worse hubs then the random gaussian-distributed points. Below is a hubness histogram for the same dimensionality. Maybe GMMs act differently then single gaussians (which is what I was simulating). For a fairer contest, I should generate nice single gaussian points, then fit a GMM (with the same number of components?) and do the KL-EMD on that...

Posted by madadam at 10:57 PM