top of page

Previous analysis on lyrics text processing elaborated on the unique challenges the researchers encountered. Texts from lyrics, unlike prose or news articles, tend to be short, repetitive, and often ambiguous. Literature on lyrics analysis has agreed on the limitation of lyrics data and content. One of the earliest works on extracting topics from lyric corpora is by Logan et al. (2004). They explored the latent semantic analysis (LSA) method for analyzing the lyric corpus and found lyrics can be used to uncover natural genre clusters. However, the accuracy of the method is low, and Logan et al. concludes that the classification can benefit from additional information carried by acoustic features. Streck et al. (2014) applied Latent Dirichlet allocation

(LDA) on a subset of the Million Song Dataset, and investigated how well automatically generated topics fit manual topic annotations.

 

Most recently, Lukic, Alen. (2015) made a comparison of most widely used topic modeling approaches, and pointed out that the lack of topical alignment between the unsupervised and supervised topics was likely a symptom of insufficient data in the human-annotated validation set. 

 

One apparent challenge often cited in language analysis is the high dimensionality of text information using a one-hot-coding or word count approach. McCormick (2017) introduced two vectorization methods in Document Clustering Example in SciKit-Learn. Hashing vectorizer is a fast and space-efficient way to vectorize features. It hashes each word into a vector index and stores the word counts by using feature hashing. One problem with this methodology is the risk of hashing collisions, but this is a relatively minor concern as the chance of collision tends to be low. Thus, this method works well in practice.

 

Mikolov el. al. (2013a) introduced the word2vec algorithm, which deals with the high

dimensionality of text data by projecting the universe of words to a continuous vector space of lower dimension. Neural network language model (NNLM) has been explored by many researchers. Prior to word2vec, Bengio et al (2003) documented the popular approach of a feedforward neural network with a linear projection layer and a non-linear hidden layer used to learn jointly the word vector representation and a statistical language model. Mikolov et al. (2013a) introduced the Skip-gram model, an efficient method for learning high quality vector representations of words from large amounts of unstructured text data, and shows that their methodology is able to learn many syntactic and semantic regularities with high accuracy, and

can produce higher quality output with much higher efficiency computationally compared to the previous methods such as LDA or NNLM. Mikolov et al. (2013b) further introduced several extensions to the original algorithm that can improve both the quality of the vectors and the training speed. Moreover, the paper presents a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

 

Long Short-Term Memory is a recurrent neural network architecture proposed in Hochreiter and Schemidhuber, 1997. Recurrent networks are artificial neural networks with loops, which allow them to take as their input not just the current input example also what they perceived back in time. The chain-like structure is especially appropriate for sequential data, language processing included. By adding a final softmax function, we can take advantage of the chain-like structure of LSTM and perform classification with the model for tag prediction. 

References

​

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.

​

Brown, W. Taylor, Introduction to Word Embedding Models with Word2Vec, https://taylorwhitten.github.io/blog/word2vec

​

Graves, Alex. Supervised Sequence Labelling with Recurrent Neural Networks, http://www.cs.toronto.edu/~graves/preprint.pdf

​

Hochreiter and Schmidhuber, 1997, Long Short-Term Memory. http://www.bioinf.jku.at/publications/older/2604.pdf

​

Mikolov et. al., Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/pdf/1301.3781.pdf

​

Rong, Xin, word2vec Parameter Learning Explained, https://arxiv.org/pdf/1411.2738.pdf

​

Sievert, Carson, and Kenneth E. Shirley. "LDAvis: A method for visualizing and interpreting topics." Proceedings of the workshop on interactive language learning, visualization, and interfaces. 2014.

​

Sievert, Carson, and Kenneth E. Shirley. "LDAvis: A method for visualizing and interpreting topics." Proceedings of the workshop on interactive language learning, visualization, and interfaces. 2014.

​

Olah, Christopher. “Understanding LSTM Networks.” http://colah.github.io/posts/2015-08-Understanding-LSTMs/

bottom of page