Mutual-Information weighted GMMs
If we take the perspective that background frames hurt performance, that is to say, there are frames which are fairly useless for distinguishing song models because they are generic, shared by all models, and thus part of the "universal background", then what can we do to improve the situation? One approach is to assign a weight to each frame based on its usefulness, and retrain the models taking into account this weight. The aim is to be thrifty with modeling power, spending it only on those frames that make each song distinct from the others.
The question then becomes what to use as a measure of frame "usefulness". Several possibilities come to mind:
- UBM likelihood ratio weighting: Train a Universal Background Model, as well as each song model normally. Then, for each frame, compute the ratio
, where p(x|y) is the probability of frame x under its correct song model y, and p(x) is the overall likelihood of frame x under the universal background model. Then retrain, weighting each frame by this ratio.
- Entropy weighting: Train all song models first, and then for each
frame of song i, compute the entropy of P_j(x), the vector of
probabilities of that frame under each of the song models j. then
weight by the inverse of that.
- Mutual info weighting: Weight by the mutual information between the class label and the mfcc frame. The class label is the song ID. If Y represents the
class label, and X represents the frame, then
Say the classes have uniform prior, so p(y) is 1/N, and p(x|y) is just
the GMM for class y evaluated at the point x. Finally, p(x) is the
overall likelihood of frame x, which we can compute either as the sum
of the probability under each model, e.g.
, or as the
probability under the UBM which is trained on all the data together.
Furthermore, we don't actually want
to sum over x, but rather we want to compute
for each of the frames x that we see, and use this value as the frame weight during retraining.
Interestingly, option (3) is a bit like options (1) and (2) combined,
because p(x|y)/p(x) is like the UBM ratio of (1), and then we're
taking the expected log of that under the distribution p(x|y), which
is similar to computing the entropy in (2).
I implemented option 3, which will be called the Mutual Information Weighted GMM. This plot shows the MFCC frames and the corresponding weights for each frame. As an implementation detail, I had to force the weight to zero if the MFCC frame energy drops below a threshold, because silence and fadeout frames were winding up with very high weights. It's not obvious what types of frames are assigned high weight; it may be that we're giving too much weight to very unlikely frames, and some kind of compromise is needed.
On uspop37 with a PPK kernel, using the R-precision metric, performance is not better than with the standard GMMs. Sharded 3 ways, model training takes 160 minutes per shard on uspop37 (350 songs) using old uspop37 models.
GMMs: .3948; max hubness 128
MI-weighted GMMs: .3213, max-h 162
Posted by madadam at July 27, 2006 01:22 AM