June 28, 2006

crappy kernels

a typical row, sorted: and a histogram of the distances:
Posted by madadam at 12:20 AM

June 27, 2006

covariance size vs hubness

Some (unsatisfyingly inconclusive?) results. Below are some correlation plots of various measures of covariance size vs. hubness across all 1000 songs in this subset of uspop.

The top row is from unnormalized data, and the bottom from normalized data. The idea was that normalization was necessary in order to be able to compare sizes of covariance matrices between models; but it doesn't appear to have made a big difference. Perhaps because the data had pretty similar distributions across all songs?

In this figure, in the left column covariance size is computed as the weighted sum of the log determinants of all mixture components, weighted by the component's prior. Similarly, the right column uses the weighted sum of (not-log) traces. All of these models are diagonal-covariance GMMs, so the trace and determinant of each component are the sum and product, respectively, of the diagonal.

The numbers above each plot are the correlation coefficient and p-value. In the next figure, instead of the weighted sum of determinants (or traces), I take the max or min of the determinants across all mixture components. So overall, it looks like the only decent lead is that the minimum determinant is slightly correlated with hubness. But does this support the "tight covariance caused by repetition" theory of hubs? There's a twist, because this is the minimum, so a positive correlation with hubness is saying that models with large minimums are more likely to be hubs. That's actually the converse of what the "tight covariance" theory says. More thought required about why large minimums would be more of an indication of hubness than large maximums. Are large minimums confounded with something else that's more intuitive?

There's probably a "correct" way to measure covariance size of GMMs in this context... wonder what it is.

Posted by madadam at 05:23 PM

normalized vs not

[Updated on 7/19 after I fixed the bug in computing R-precision.]

Unnormalized features: R-prec: .3959
Normalized features: R-prec: .4009

JJ got up to .65 on a smaller dataset, but keep in mind there are 340 classes and only 1000 songs; in other words, very small classes. i should try a subset with fewer classes, more songs per artist: something matched to the size of JJ's test database (350 songs, 37 artists) - but not necessarily as "easy" as his set (his was chosen to be well-separated timbrally, and homogenous within the classes).

To use one of JJ's hubness measures, i computed the number of times each song appears as a 20-near neighbors of the other songs. (out of the 1000-song subset of uspop). The maximum hubness number was 202, and 221 under the normalized data.

Here's a plot showing the correlation between the hubness scores for each of the 1000 songs under the unnormalized (x-axis) and normalized (y-axis) kernels. The correlation coefficient is .726 with a neglible p-value.

Here's a plot of the distribution of hubness numbers (sorted), showing the power-law-likeness:

There are 25 songs that we'll consider hubs - they have hubness factor more than 100:

   'carly_simon/No_Secerts/03-You_re_So_Vain.gmm'
   'police/Live_Disc_Two_-_Atlanta_Synchronicity_Concert/15-So_Lonely.gmm'
   'cat_stevens/Teaser_And_The_Firecat/10-Peace_Train.gmm'
   'sting/Ten_Summoner_s_Tales/06-Seven_Days.gmm'
   'tina_turner/Wildest_Dreams/08-Confidential.gmm'
   'george_michael/Faith/07-Look_at_Your_Hands.gmm'
   'michael_jackson/Thriller/05-Beat_It.gmm'
   'tina_turner/Twenty_Four_Seven/08-Without_You.gmm'
   'joe/My_Name_Is_Joe/07-Get_Crunk_Tonight.gmm'
   'paula_abdul/Forever_Your_Girl/10-One_Or_The_Other.gmm'
   'deep_purple/Nobody_s_Perfect_Disc_2_2_/06-Smoke_On_The_Water.gmm'
   'dave_matthews_band/Crash/02-Two_Step.gmm'
   'bob_marley/Confrontation/09-I_Know.gmm'
   'queen/The_Game/09-Coming_Soon.gmm'
   'phil_collins/Face_Value_/09-Thunder_And_Lightning.gmm'
   'styx/The_Grand_Illusion/05-Miss_America.gmm'
   'richard_marx/Repeat_Offender/11-Children_Of_The_Night.gmm'
   'nick_cave_and_the_bad_seeds/Murder_Ballads/09-O_Malley_s_Bar.gmm'
   'everclear/So_Much_For_The_Afterglow/10-White_Men_In_Black_Suits.gmm'
   'def_leppard/Adrenalize/01-Let_s_Get_Rocked.gmm'
   '112/Room_112/04-Love_Me_Feat_Mase_.gmm'
   'r_kelly/TP-2_COM/16-I_wish_Remix_To_the_homies_that_we_lost_.gmm'
   'soul_asylum/Grave_Dancers_Union/07-New_World.gmm'
   'cake/Fashion_Nugget/11-Nugget.gmm'
   'paul_simon/Concert_in_the_Park_Disc_1/11-Proof.gmm'
For the kernel derived from normalized mfccs, there are only 17 songs with more than 100 hubness. The list is similar (recall the hubnesses are well-correlated), but there are some new additions.
    'carly_simon/No_Secerts/03-You_re_So_Vain.gmm'
    'michael_jackson/Thriller/01-Wanna_Be_Startin_Somethin_.gmm'
    'culture_club/Greatest_Moments/09-Move_Away.gmm'
    'sting/Ten_Summoner_s_Tales/06-Seven_Days.gmm'
    'deep_purple/Nobody_s_Perfect_Disc_2_2_/06-Smoke_On_The_Water.gmm'
    'michael_jackson/Thriller/05-Beat_It.gmm'
    'dave_matthews_band/Crash/02-Two_Step.gmm'
    'def_leppard/Adrenalize/01-Let_s_Get_Rocked.gmm'
    'styx/The_Grand_Illusion/05-Miss_America.gmm'
    'r_kelly/TP-2_COM/16-I_wish_Remix_To_the_homies_that_we_lost_.gmm'
    'aerosmith/Live_Bootleg/14-I_Ain_t_Got_You.gmm'
    'cat_stevens/Teaser_And_The_Firecat/10-Peace_Train.gmm'
    'zz_top/Tres_Hombres/02-Jesus_Just_Left_Chicago.gmm'
    'soul_asylum/Grave_Dancers_Union/07-New_World.gmm'
    'nick_cave_and_the_bad_seeds/Murder_Ballads/09-O_Malley_s_Bar.gmm'
    'elton_john/Too_Low_For_Zero/11-Earn_While_You_Learn_Bonus_Track_.gmm'
    'nelly/Country_Grammar/15-Never_Let_Em_C_U_Sweat.gmm'
Posted by madadam at 02:52 PM

June 23, 2006

hubness experiment howto

1. Create the listfile of mp3s for the set we want to compute the kernel over.

2. If we want to do global normalization, compute the normfile using work/globalnorm.pl

3. Run mothra, which computes pfiles on the fly from mp3s and then trains models. Optionally shard it if you want to run it on several machines:

work/run-mothra.pl -list LISTFILE -pfile_dir PFILE_DIR -model_root MODEL_ROOT [-norm_file NORMFILE -nshards N -shard M]

4. a. Create the model listfile by copying the listfile and s/\.mp3$/\.gmm/
b. Run mocomp, which compares models and outputs a kernel:

work/Torch3/mothra/Linux_opt_float/mocomp -model_root MODEL_ROOT MODEL_LISTFILE KERNEL

5. Create the labelfile for the artist-classification task:

work/make-labels.pl < LISTFILE > LABELFILE

6.a. Load the kernel and the labels into matlab, e.g: kernel = load('KERNEL_FILE'))
b. Compute the R-precision and get the binary N-neighbor matrix and compute hub count:

[p,nn] = rprec(kernel, labels); h=sum(nn);

Posted by madadam at 03:36 PM

global normalization

ok, i have most of the machinery in place to do the experiment about covariance size and hubness: look at the sizes of the covariance matrices, and plot them against the song "hubness". The idea was that under one hypothesis (hubs are caused by universal background radiation), hubs will have very broad covariances because they're similar to the background, but under another (hubs are caused by very localized tight gaussians fit to common repetitive frames) the covariances will be small.

But in order to be able to compare the norms of covariance matrices across models, the data has to be globally normalized to unit variance in all dimensions. otherwise gaussians which stretch out along dimensions that are just more highly scaled out naturally will seem bigger than gaussians which stretch out along tighter dimensions.

since there are lot of songs, it would be unwieldy to create one giant (15G) pfile and run qnnorm on it. instead, I took a sampling approach, by computing qnnorm over each file, then averaging them into a global norm file. It was a bit rough & ready; ideally I would compensate for different frame lengths of different songs, but it probably won't make a big difference.

The command was:

~/work/globalnorm.pl -nsamples 500 -output uspop-global.norm <
libraries/uspop_trans.list

Posted by madadam at 01:25 PM

June 19, 2006

playing with GMMs

I'm starting to look at the GMMs more closely to figure out what's going on with hubs. One thing I wondered was how well-spread the mixture components are; are they all on top of each other, are there are few outliers that might be dominating, or what?

here's a plot of a GMM for one song (wham/freedom). the top is the euclidean distance kernel between mixture centers, and the middle is the trace (sum of evalues) of the covariance matrices, and the bottom is the determinant (prod of evalues). The determinant is what we ought to be interested in, it's generally thought of as the size of the matrix. but curiously, the outlier shows up as a peak in the trace. i'll have to think about that.

[thought about it for a minute and realized that they're closely related by a log, but i'm not sure if there should be any major difference.]

Next step would be to look at which mfcc frames activate which components, and what happens with these outliers. silence? or are they catch-all components?

Posted by madadam at 01:15 PM