The objective of this project is to expand my music selection through the use of logistic regression. Typically I like to explore music similar to what I listen to but I want to see if I can use machine learning to create a playlist of songs outside the sphere of my normal listening with the potential of me still enjoying the songs.
Metrics for Success: Although accuracy is important for the model, I will determine the success of this project based on two things, music exposure and likability. For exposure, I ideally want to have a playlist entirely of artists new to me in genres which I would not normally listen to. Ideal likability of the playlist would be at least 50% since my intention is to compose the playlist of songs I may or may not like.
The difficult part will be in categorizing songs between: Liked, Might Like, and Don't Like since I can tell what songs I like but songs I dislike is not as clear. In order for me to categorize music, I will have to do some data exploration and define these categories.
The data I am using consists of two separate CSV files. The first file I created from accessing the Spotify API and gathering my top 50 tracks from the past 6 months, documentation on this API endpoint can be read here. The second file comes from a Kaggle dataset a user uploaded which contains about 233,000 tracks.
All of the code for this project can be found at this github repository. Check out some of my other work here.
Begin by importing all necessary modules and getting a look at the data.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stat
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
%matplotlib inline
Simply observing the descriptive statistics, the first category which caught my eye is instrumentalness due to the apparently large gap between the 50th and 75th percentile.
By visualizing the distribution, it appears I largely favor songs which are on the low end in instrumentalness. This will be helpful in defining songs that I like.
This distribution actually doesn't give too much insight since Spotify documentation defines "values below 0.33 most likely represent music and other non-speech-like tracks" and since the entire distribution is below this value, I will use this measurement to help define songs I will like.
Spotify documentation defines danceability as a measurement of how suitable a track is for dancing with 0.0 being least danceable and 1.0 being most danceable. This measurement is based on various features such as: tempo, regularity, rhythm stability, and beat strength however it is not clear how each of these affects the final calculation for danceability.
It appears my personal preference for danceability follows a relatively normal distribution so I think standard deviations will be especially useful in defining songs I like, might like, and don't like.
The histogram and quantile lines show that I favor songs on the lower end of the tempo range with 75% of the songs being between 75 BPM and just above 125 BPM.
From the visualization, I think the percentile values will be good measurements for defining songs I like.
One other interesting observation I noted is despite tempo being one of the factors in danceability, the distributions are different. It's possible this indicates tempo actually plays a small role in the calculation of danceability.
The distribution for loudness is negatively skewed so I think the first and third quantile values will be most useful in accurately defining my preferences.
According to Spotify, values for loudness typically range between the -60 to 0 decibels so my preference for loudness is on the higher range.
Spotify defines energy as a "measure of intensity and activity", with typical energetic tracks being fast, loud, and noisy.
As seen from the positively skewed distribution, I tend to prefer songs which are on the lower end of the energy spectrum with 75% of these songs being less than 0.6 on a scale from 0.0 to 1.0.
Despite my preference for loudness being in the higher range, it makes sense that I have a tendency for lower energy songs since energy also accounts for how fast a song is and from the Tempo Distribution it was seen I prefer slower songs.
Valence is described as a measurement of how positive a song sounds ranging from 0.0 to 1.0, where values closer to 0 represent songs which convey sadness and values closer to 1 represent songs which sound more cheerful. However, Spotify documentation does not give any more details on the actual calculation of this metric.
For my personal preference it looks like I prefer songs which convey more positivity with 75% of the songs being greater than 0.5 and a mean value a little greater than 0.6.
Spotify gives duration in milliseconds however I decided to convert to minutes:seconds and group the songs for better interpretability. At first I noticed a large portion of songs between the 3-minute to 5-minute range but I was skeptical to attribute duration as a determinant for me liking a song.
To test this skepticism I observed the distribution of a much larger sample of songs and it is seen that in terms of duration, my liked songs are distributed similarly to the larger sample.
In addition to observing the different distributions, I conducted a Chi-Squared test with confidence interval of 95% and the Null hypothesis:
H0 : The two distributions are independent from another and I can attribute duration to me liking a song
and Alternative Hypothesis:
H1 : The two distributions are not independent and the distribution of my liked songs follows the larger population
Since the p-value is less than .05, I can reject the null hypothesis and confidently ignore duration as a determinant of my liking of a song.
Performing a similar test for key and mode features:
The p-value is greater than .05 for key and less than .05 for mode, so I will take into account my preference of key but not mode.
Some of the higher correlations among the data are with valence so I will look into those relationships first.
Although the correlation coefficients show moderate correlation there is still a good amount of visible variability indicating these features to not have a linear relationship which would significantly affect the regression model
Looking at loudness versus energy, there is a linear realtionship but again the relationship is not strong enough to be considered collinear.
Another interesting observation is the tempo-danceability relationship seen which confirms my earlier suspicion of tempo's small role in danceability.
Now I will have to categorize the tracks based on what I found from the data.
0 will represent a song I don't like, 1 will represent a song I might like, 2 will represent a song I like
The criteria for a song I like is as follows:
The criteria for a song I might like is as follows:
After some preprocessing, the data was ready to be trained and tested.The initial logistic regression model gave really bad results, so I did some modification with training iterations and class weights to get the (temporary) final model.
From the confusion matrix as well as the precision and recall scores, this model didn't perform too well. Looking at the matrix and low recall, the model gives a lot of false negatives especially with the Might Like
class which is the main target for this project.
Still, I don't want to get too caught up on these metrics in comparison to the results of the actual playlist
After training the model and using it to predict a separate dataset, I extracted 50 songs from the Might Like
category and created a playlist through Spotify's API
Upon looking at the 50 tracks, there were 9 artists I already knew of and 1 track which I would consider in my active listening
Another interesting thing I noted is 6 of the tracks in the playlist are actually segments from stand-up specials which could indicate the criteria I set for speechiness was too high.
After subtracting the 6 standup segments(which I did enjoy) and the 1 song which I considered in my active listening, the total number of tracks to evaluate went down to 43. Of the 43, I ended up liking 19 of the songs so just a little less than half.
I ended up falling short of the metrics I set, however I still discovered some good music outside of my normal listening.
Some of the causes to the performance not being as expected could be the small sample size used for defining the classes, better tuning for hyperparameters, and even the simple difficulty of classifying subjectivity.
In the future it could be worth exploring other classification algorithms such as such as decision tree or even digging into clustering for this use case.