top of page

We explored some existing datasets including Million Song Dataset, related complementary datasets and Yahoo music dataset, as well as several music APIs including Spotify API, YouTube API and Genius API. Data that are utilized during this project are summarized as following:

  • Last.fm Dataset

This dataset provides a song’s tags and most similar songs for most of the tracks in MSD. The tags are generated by users from Last.fm API. There are 33,355 different tags in total for 9,330 songs from the training subset. One song can be associated with multiple tags which cover information about genre, emotion, occasion, and etc.

  • Million Song Dataset (MSD):

The core of MSD is the feature analysis and metadata for one million songs. The derived features include sample rate, duration, loudness, energy, etc. Other metadata include information about the song, album and artist, such as releasing date, artist location. There are also algorithm estimated features: artist familiarity and artist hotness.

  • Spotify API:

Spotify provides the online streaming services for millions of songs. We match the songs within our selected database through echo nest and Spotify ID, and use the streaming links to connect the song tracks with the returned recommendations list.

  • MusiXmatch (MXM) Dataset

The MXM dataset provides lyrics for 77% of the MSD tracks. The lyrics come in bag-of words format: each track is described as the word-counts for a dictionary of the top 5,000 words across the set. All lyrics can be directly matched to MSD using MSD IDs and MXM IDs.

  • Azlyrics raw Lyrics

Azlyrics.com contains the raw lyrics of songs needed. Given song name and artist name, we are able to scrape the raw lyrics from azlyrics.com. We have scrapped approximately 80,000 raw lyrics with song name and artist name, and around 14,000 of them are matched with last.fm tagged songs.

bottom of page