Sunday, April 30, 2006

Inadvertent Hashing and CDDB

Most people use the CDDB either knowingly or unknowingly. It's the backend to most computer CD players that downloads track/title information from a central site when you're playing a CD. Most player software allows you to add in information if CDDB doesn't return a hit, and I've done this every now and then. But overall I have been amazed by the hit rate of CDDB.

I assumed that CDs contained all the track information, and many do. But apparently, CDDB exploits an inadvertent hashing mechanism: the sequence of track times.
The CDDB database has information that allows your computer to identify a particular music CD in the CD drive and list its album title and track titles. Their service is used by RealJukebox, MusicMatch, WinAmp, and others. The title information is not stored on most CDs. The only information in the CD data is the number of tracks (songs) and the length of each. This is the information your CD player displays. What CDDB does is let the software on your PC take that track information, send a CD signature to CDDB through Internet protocols (if you're connected) and get back the titles. It works because songs are of relatively random length. The chances are good almost all albums are unique. (Figure there are about 10 songs on an album, and they each run from a minute and a half or so to three and a half minutes long, so the times vary by 100 seconds. There are 100x100x...x100 = 100**10 = 10**11 = 1 hundred billion = an awful lot of possible combinations.) An album is identified by a signature that is a special arithmetic combination of the times of all the tracks.

You'd figure that CDDB just bought a standard database with all the times and titles. Well, there wasn't one. What they did was accept Internet-relayed postings with the track timing information and the titles typed in by a volunteer.


Categories

4 comments:

  1. Is it CDDB or FreeDB that they use? Original CDDB screwed a lot of people over when they became commercial. More information in the Wikipedia entry: CDDB Wikipedia entry 

    MusicBrainz is a very cool alternative with a different approach:
    MusicBrainz Wikipedia entry
    http://musicbrainz.org/ 

    Posted by D

    ReplyDelete
  2. well you have to remember that CDDB is not being hacked. if your only goal is to get a decent hash function, it's a lot easier than trying to get a hash function that can't be hacked.

    Moreover, mathematicians don't own the copyright on good ideas :) 

    Posted by Suresh

    ReplyDelete
  3. CDDB (or Briklin) might not need a mathematician, but a calculator would help. 100^10 = 10^20, not 10^11. However, even if you start with this assumption (which is dubious, since track times come from a lower entropy distribution), the "birthday paradox" means that the probability of a collision is roughly 10^20/(#cds^2), so once there are 10^10 CDs or so, a collision becomes likely.  

    Posted by David Applegate

    ReplyDelete
  4. "so once there are 10^10 CDs or so, a collision becomes likely. "

    NOT EVEN THAT. I have several hundred CDs and I run into collisions about 1 in 100 times. 

    Posted by Anonymous

    ReplyDelete

Disqus for The Geomblog