July 27, 2006

Mutual-Information weighted GMMs

If we take the perspective that background frames hurt performance, that is to say, there are frames which are fairly useless for distinguishing song models because they are generic, shared by all models, and thus part of the "universal background", then what can we do to improve the situation? One approach is to assign a weight to each frame based on its usefulness, and retrain the models taking into account this weight. The aim is to be thrifty with modeling power, spending it only on those frames that make each song distinct from the others.

The question then becomes what to use as a measure of frame "usefulness". Several possibilities come to mind:

  1. UBM likelihood ratio weighting: Train a Universal Background Model, as well as each song model normally. Then, for each frame, compute the ratio , where p(x|y) is the probability of frame x under its correct song model y, and p(x) is the overall likelihood of frame x under the universal background model. Then retrain, weighting each frame by this ratio.
  2. Entropy weighting: Train all song models first, and then for each frame of song i, compute the entropy of P_j(x), the vector of probabilities of that frame under each of the song models j. then weight by the inverse of that.
  3. Mutual info weighting: Weight by the mutual information between the class label and the mfcc frame. The class label is the song ID. If Y represents the class label, and X represents the frame, then

    Say the classes have uniform prior, so p(y) is 1/N, and p(x|y) is just the GMM for class y evaluated at the point x. Finally, p(x) is the overall likelihood of frame x, which we can compute either as the sum of the probability under each model, e.g. , or as the probability under the UBM which is trained on all the data together. Furthermore, we don't actually want to sum over x, but rather we want to compute for each of the frames x that we see, and use this value as the frame weight during retraining.
Interestingly, option (3) is a bit like options (1) and (2) combined, because p(x|y)/p(x) is like the UBM ratio of (1), and then we're taking the expected log of that under the distribution p(x|y), which is similar to computing the entropy in (2).

I implemented option 3, which will be called the Mutual Information Weighted GMM. This plot shows the MFCC frames and the corresponding weights for each frame. As an implementation detail, I had to force the weight to zero if the MFCC frame energy drops below a threshold, because silence and fadeout frames were winding up with very high weights. It's not obvious what types of frames are assigned high weight; it may be that we're giving too much weight to very unlikely frames, and some kind of compromise is needed.

On uspop37 with a PPK kernel, using the R-precision metric, performance is not better than with the standard GMMs. Sharded 3 ways, model training takes 160 minutes per shard on uspop37 (350 songs) using old uspop37 models.
GMMs: .3948; max hubness 128
MI-weighted GMMs: .3213, max-h 162 Posted by madadam at July 27, 2006 01:22 AM