Building a Spotify Recommendation System
In this post I’ll show the steps I took in order to build my own spotify recommendation system that will automaticlly add new playlists to my account based on the songs I liked.
Using Spotify API and gathering data
Importing all the packages we are gonna need:
import spotipy
from sklearn.neighbors import KNeighborsClassifier
from spotipy.oauth2 import SpotifyOAuth
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
Defining personal information that we will later need in order to work with the API:
client_ID ='YOUR_CLIENT_ID'
client_SECRET='YOUR_CLIENT_SECRET'
redirect_URL='http://localhost:9000'
user = 'YOUR_USERNAME' #you can get this by clicking on your account at the top right corner of spotify app/web
def authentication(scope):
return spotipy.Spotify(auth_manager=SpotifyOAuth(client_id=client_ID,
client_secret=client_SECRET, redirect_uri=redirect_URL, scope=scope))
scope = "user-read-recently-played"
sp = authentication(scope)
playlists = sp.current_user_playlists()
playlists_ids = []
for idx, item in enumerate(playlists['items']):
playlists_ids.append(item['id'])
Getting a list of all track ids from my playlists:
all_tracks = []
for pid in playlists_ids:
auth_manager = SpotifyClientCredentials(client_ID, client_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)
playlist_tracks = sp.playlist_items(playlist_id=pid , fields='items,name,id')
for idx, item in enumerate(playlist_tracks['items']):
track = item['track']
all_tracks.append((track['artists'][0]['name'], track['name'], track['id'], pid))
all_tracks = list(map(list, all_tracks))
Converting the list to a dataframe and adding to each track audio features such as: acousticness, energy, instrumentalness, liveness, loudness etc.
values = []
keys = (sp.audio_features(all_tracks[0][2]))[0].keys()
for i in range(len(all_tracks)):
r = sp.audio_features(all_tracks[i][2])
for key in r[0]:
values.append(r[0][key])
all_tracks[i] = all_tracks[i] + values
values = []
df = pd.DataFrame(all_tracks, columns = ['artist_name', 'track_name', 'track_id', 'playlist_id'] + list(keys))```
Instead off ranking by hand my tracks I gave the rank 0 to all tracks in a specific playlist and rank=1 to all the other.
df['ratings'] = df['playlist_id'].apply(lambda x: 1 if x == '37i9dQZF1DWYbUY40ZDAwb' else 0)
Spliting the data to train and test:
y = df.iloc[:, 22:23]
X1 = df.iloc[:, 4:14]
X2 = df.iloc[:, 20:22]
X = pd.concat([X1,X2], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
The ML models
Chossing the optimal number of features for the random forest classifier:
rf = RandomForestClassifier(n_estimators=1000, random_state=42)
rfecv = RFECV(estimator=rf, step=1, n_jobs=-1, cv=StratifiedKFold(2), verbose=1, scoring='roc_auc')
rfecv.fit(X_train, y_train.values.ravel())
print("Optimal number of features: {}".format(rfecv.n_features_))
Rescaling all the features to be between 0 and 1:
minmax_scaler = MinMaxScaler()
X_train_scaled = pd.DataFrame(minmax_scaler.fit_transform(X_train), columns=X_train.columns)
Implementing the random forest classifier:
rfc = RandomForestClassifier(n_estimators=1000, random_state=42)
rfc_gcv_parameters = {'min_samples_leaf': [1, 3, 5, 8], 'max_depth': [3, 4, 5, 8, 12, 16, 20],}
rfe_gcv = GridSearchCV(rfc, rfc_gcv_parameters, n_jobs=-1, cv=StratifiedKFold(2), verbose=1, scoring='roc_auc')
rfe_gcv.fit(X_train, y_train.values.ravel())
rfe_gcv.best_estimator_, rfe_gcv.best_score_
print('random forest classifier:')
print(classification_report(y_test.values.ravel(), rfe_gcv.predict(X_test)))
Implementing the k-nearest-neighbors classifier:
knn = KNeighborsClassifier(n_jobs=-1)
knn_gcv_params = {'n_neighbors': range(1, 10)}
knn_gcv = GridSearchCV(knn, knn_gcv_params, n_jobs=-1, cv=StratifiedKFold(2), verbose=1, scoring='roc_auc')
knn_gcv.fit(X_train, y_train.values.ravel())
knn_gcv.best_params_, knn_gcv.best_score_
print('knn:')
print(classification_report(y_test.values.ravel(), knn_gcv.predict(X_test)))
Using the Random Forest model to recommend:
I used kaggle’s dataset (https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify?select=genres_v2.csv) as a source from which the algorithm will find tracks with the highest probability to be added to my playlist.
spotify_dataset = pd.read_csv(r'address_of_your_file')
X = X[X.columns.intersection(spotify_dataset.columns)]
X_recommend = spotify_dataset.copy()
X_recommend = X_recommend[X_recommend.columns.intersection(X.columns)]
spotify_dataset['ratings'] = rfe_gcv.predict(X_recommend)
spotify_dataset['prob_ratings'] = rfe_gcv.predict_proba(X_recommend)[:,1]
result = (spotify_dataset[spotify_dataset['ratings'] == 1].sort_values(by='prob_ratings', ascending=False))
result = result.drop_duplicates(subset=['song_name'], keep='last')
print(result[['song_name', 'id']].head(20))
The output:
song_name id
14041 Bet 1hv5qektVlfqlu1P858s5G
21346 Ride 0yNi8hNyv0DzMVFtKIAP1S
20300 RACECAR 7JdarpX08FLhmfNZpxqgUO
9418 Blackout 5TEQB7WfKZZNhNKoaMRiB7
9715 Grandma's Porch. 7r7O86q51J9iHzfjOcwlTj
9575 Cash App 4bMLzfbjYJ9v3wvlpI6wtE
18412 Prisoner (feat. Dua Lipa) 5JqZ3oqF00jkT81foAFvqg
9895 C4 3NiqzLj51KIwM9yMfai276
18404 Watermelon Sugar 6UelLqGlWMcVH1E5c4H7lY
7992 Take One 6fI3tBVND8zUXZi9rr2Yps
20390 First Person 6bvV5L5afKykg819xAIJWt
7583 Let My People Go 07ZzLIfJvo14UJyhKjN3z4
15873 Cp24 356tIL4ewLC8zHQCjhMrrF
6322 Do What I Want 4IWGnyOHDrVZEtPWfs4s7q
10927 Laid To Rest 1iDaAHOQvaxWGXx0VMYwAd
287 Analog Keys 7yMvF3mjdsFStdOiMpiFNx
1126 Over the Rainbow 5ocuRCDSWiUMZcWI4Utd9g
15130 Out Of Love (feat. Internet Money) 0IJA9KP6rT55jrP1YpTdhx
20067 Creepshow (feat. IDK & Chip tha Ripper) 0iLfMB2S2ilazjeFH91NiT
18420 You Got Me Like 2oygttOZA8dTFxHevUYGKm
Add the recommended tracks to my account
tracks_to_add = result[['id']].head(20)
scope = "playlist-modify-private"
sp = authentication(scope)
add_new_playlist = sp.user_playlist_create(user=user, name="recommended #1", public=False)
sp.playlist_add_items(playlist_id=add_new_playlist['id'],items=tracks_to_add['id'])
Final result
All tracks had been added to my account successfuly and a new playlist were added. Although the results are not perfect (the dataset which I used does not contain the latests hits and my model does not take into consideration the ‘popularity’ factor) I do like some of the recommended songs, but I may be biased 😉.