MODELING

BASELINE

AND

BASELINE MODEL

BIG PICTURE

FURTHER EXPLORATION

Further Exploration

During our modeling process, two primary sources of input data are either existing bag of words data from MusiXmatch, or raw lyrics from AzLyrics.

We start off with some manual feature extraction and dimensionality reduction methods using both source of data, and then apply the traditional predictive models including Support Vector Machines. Random Forest, and Gaussian Naïve Bayes.

To further improve on the results, we used unsupervised method Latent Dirichlet Allocation(LDA) to explore the most appropriate targets for prediction, and Word2Vec to create word embedding as better predictors. We also implemented a recurrent neural network model Long Short Term Memory(LSTM) that specializes in sentiment analysis as our advanced predictive model. The final model we propose is a combination of Word2Vec and LSTM, which is also the one that gives us the highest accuracy.

To start off with a baseline model, we first gained inspiration of using moods as prediction targets from multiple existing studies, and our selection includes 4 major mood tags – happy, sad, energetic and relax. Our work includes performing feature engineering and applying the common classification methods including Gaussian Naive Bayes, Support Vector Machine(SVM), and Random Forest.

The focus of our work is to turn the original bag of words format of 5000 words into a collection of features with appropriate dimensionality and enough predictive power. We took two main steps of feature engineering:

Words selection:

To ensure that we deal only with words that have strong predictive power, we consider removing the standard stop words from Python library as well as the top common words across all targeted tags. We also attempted to convert the word counts into term frequency-inverse document frequency(tf-idf) so that we take into account the importance of a word to a song based on how many documents does the word show up in across entire corpus. Based on the model outputs, we observe the greatest improvement in classification accuracy when we only remove the stop words for two out of the three classification models we ran, and we further apply the dimensionality reduction algorithms on word counts excluding the stop words.

Dimensionality Reduction:
- Principal Component Analysis(PCA)
- Hashing Vectorizer

We could capture 90% of variance within the data with 800 principal components, but the best classification accuracy was reduced to 41.93% with SVM.

Using raw lyrics as inputs, hashing vectorizer tokenizes the words into different bins according to the chronological order of the words in the sentence. The number of bins we use is the reduced dimension of data. Taking the ordering of words into account when reducing dimension, we could achieve a classification accuracy of 48% with 50 bins for approximately 3000 songs.

Based on the above analysis, we define baseline model to be a combination of hashing vectorizer and Random Forest classifier, which gives us 48% accuracy with 4 mood tags.

Noticing that the selected mood tags are somewhat too specific to a particular area, and we are ignoring a great majority of tag possibilities from the original 30,000 distinct tags, we feel the need to expand to a more reasonable and wider range of prediction targets. At the same time, since the baseline model gives less than 50% accuracy only 4 tags, we could expect the accuracy to be even lower when we expand the response class number. Therefore, it is also necessary for us to seek improvement from the modeling side as the traditional methods may not be good enough for such complicated problem.

Therefore, we attempted to improve our results through the following two ways:

Reconsider our choice of target groups and expand the set of predicted tags. The problem we are trying to solve is how to select from the 30,000 tag universe in order to have a better reflection of the tag topics. We continue to explore the natural separation in topics from our lyric corpus with unsupervised learnings Latent Dirichlet Analysis (LDA): a generative topic model to find latent topics in a text corpus. The model assumes that each document is a mixture over an underlying set of topics, and each topic is a mixture over a set of word probability.

We increased the number of topics from 1 to 100, and used the perplexity to measure the performance of the LDA model. The result shows that as the number of topic increases, the perplexity decreases, indicating the model interprets the model better. However, the decreasing rate decreases as n goes higher. Therefore, we chose n=25 as a balance of accuracy and model complexity.

We got some very interesting result from LDA clustering under n=25. The clusters are illustrated in the interactive graph below. By visual inspection, we can first guess that cluster 17 is related to “party” with the most relevant words “dance”, “party”, “boys”, “hot”, “hey”, etc., and another cluster 9 related to “religions” with the most relevant words “god”, “heaven”, “Jesus”, “lord”, “holy”, etc. It seems that we can get a quite good class of these special topics even with unsupervised learning method. However, topic like “love” tend to include multiple emotions, both happy and sad. These results suggest that the information encoded with lyric data might be more power for predicting theme rather than emotion.

Formally, we look into the songs that belong to each single cluster and check for the tags associated with those songs. To match a topic with the cluster, we simply take the majority vote from the song tags. We then manually remove some non-informative cluster topics and ended up with 17-selected tags suggested by LDA.

Eventually, we re-run the baseline model with the new responses and achieved a test accuracy of 19%. Even the accuracy is not particularly unsatisfactory with this many response variables, but we would like to get a better accuracy that can be helpful when tagging a single track or making song recommendations. This encourages us to proceed with the exploration of more informative predictors with Word2Vec and better predictive model like LSTM to improve on the accuracy.

2. Improve modeling approach from both the predictors and the model:

Predictors - Create word embedding instead of using simple word counts as more informative predictors.

Word2Vec is a shallow neural network model with powerful ability to uncover the underlying semantic and logical relationship between words. A classic example of Word2Vec is to project words into lower dimensional continuous vector space, where in the hidden layer we capture the joint distribution of the words.

To create word vector embedding from the original lyrics, we first attempted to train our own Word2Vec model with the lyrics corpus from over 80,000 songs we have scrapped, but the resulting word embedding is not very powerful. Major constraints include the limitation of variety of words or sentence structures in lyrics, and such model does not seem to capture the semantic relationship as good as we expect it to be. A visualization of our self-trained Word2Vec model is shown below, and we can see that the distribution of word vectors are somewhat weirdly distributed in only certain region of the entire space, and the coverage of words is low. The picture below is a screenshot of the resulting embedding, and the full dynamic version can be access at http://projector.tensorflow.org/?config=https://gist.githubusercontent.com/anonymous/22b39cbd7463081d720b0aca73d90be8/raw/e0e812ea9bb480d99724fad9c94d59301e1f3a70/tensor.json if interested.

In contrast, we can take a look at a relatively informative and well-trained Word2Vec visualization at http://projector.tensorflow.org. Two screenshots are again shown below. We should expect to see more words and better language structure captured by the model. For instance, if we search for "happy", we should see some relatable keywords like "funny" and "lucky" ranked top in closeness. Another example is that if we look for Berlin, not only will we see "Germany", we should also expect "Paris" and "Vienna" to appear in the neighbor set, proving that Word2Vec is able to capture some logical relationship within words.

After some research, we go with a deliberately pre-trained model using Google News dataset that uses approximately 100 billion words and transforms each word into a 300-dimensional vector. We will use this model to transform all the original song lyrics into the desired vector space and use them as the new predictors.

Model - Use advanced predictive model that suits better for language analysis

Long Short-Term Memory is a recurrent neural network (RNN) architecture. With loops in the network, RNNs take as their input not just the current input example also what they perceived back in time. Yet for language processing, the challenge lies in that sometimes the context is clear with nearby information alone, so we only need the recent past to make sense of the current meaning, yet sometimes we need further context from earlier text. The structure of LSTM memory units solves this problem of unknown time lag needed in the network. Moreover, generic RNNs can be very computationally intensive as adding loops increases computation time exponentially. LSTM is much more efficient and is thus able to learn over very long time steps.

Our model set-up generally follows a previous LSTM based IMDB movie review processing model described by Brownlee. We view the first 150 words of each song, use the 300-dimensional word2vec model to create word embedding for the lyrics, pass them through a hidden layer of 100 LSTM memory units, and finally a layer of softmax activation mapped to 17 target tags. Due to the time constraint as we were approaching to the end of the year as we finalized our modeling solution, we trained 3 epochs overall as the trade-off between increase in validation set accuracy and additional training time became less favorable. The model could benefit from further parameter tuning and longer training, but overall we find the final model a good prototype for illustrating the topical and sentiment information available in lyrics and the way to extract it.

As discussed above, our final model is Word2Vec+LSTM, and we attempted to predict 17 tags mostly suggested by LDA.

Anchor 1

Anchor 2

MUSIC FOR ALL

SPOTIFY

MODELING

BASELINE

AND

MORE

BASELINE MODEL

BIG PICTURE

FURTHER EXPLORATION