August 06, 2003

Artist/Album/Track name normalization

Here, for posterity, are my recommendations for normalizing the names of artists, albums, and tracks:

Artist names are particularly important to get normalized to the same forms. Hence, they have severe normalization:

  1. Names are all mapped to lower case
  2. Delete apostrophes ("'") and periods (".").
  3. Everything else except a-z 0-9 maps to "_". Multiple _'s in sequence fold into a single _. Trailing _'s are dropped.
  4. Don't reorder proper names - it's just too hard, and there's no clear boundary between proper names and band names. No more deejay_alice, cyrus_billy_ray, amos_tori etc.
  5. Always drop leading "the". the_beatles and the_verve were the only ones who escaped this in aset400, but uspop2002 had lots of *_the. I guess always drop a leading indefinite article too, although I think "A New Found Glory" (new_found_glory) is the only one.

Other examples:

   N'sync -> nsync
   D'Angleo -> dangelo
   R. Kelly -> r_kelly
   P.J. Harvey -> pj_harvey
   Run-D.M.C. -> run_dmc
   The Presidents of the United States of America ->
                                presidents_of_the_united_states_of_america

Finally, there are some special cases for semantically-equivalent names:

   Bruce Springsteen and the E Street Band -> buce_springsteen
   Tom Petty and the Hearbreakers -> tom_petty
   Bob Marley and the Wailers -> bob_marley

Album and Track names are less severely normalized:

  • no case folding for album and track names (really problematic, but still)
  • dashes are allowed in addition to alphanumerics
  • everything else maps to _, including period and apostrophe
  • multiple _'s merge into one
  • leading/trailing _'s are OK

I'm preparing versions of aset400 (the reference list of artist names) and uspop2002 (list of 8772 track names used in the ISMIR03 experiments) based on these rules.

The idea is to update this entry as and when necessary.

(Not to undermine the exclusivity of this blog, but I've copied this information to my musicsim site here.)

    DAn.
Posted by dpwe at August 6, 2003 07:07 PM