INSPIRE: Insightful Spotify Recommendations
James Chen, Max Chen, Shalin Mehta, Brandon Truong, Sam Fuchs
Spotify’s song recommendations can be confusing: compared to other media streaming platforms, Spotify’s recommendations provide little insight into why such songs are suggested to a user. On the other hand, Netflix recommendations will attribute a recommendation to specific content that one has previously watched (i.e. “watch these shows because you have watched this show”), while Spotify will simply display songs “based on this playlist.” Given the wealth of analytics and visualization that Spotify has created for demographic data, song recommendations should not be any less insightful. Our project, INSPIRE, seeks to bridge the gap between song recommendations and the inner workings of a recommendation system by providing additional analysis regarding why a specific song is recommended to the user.
2. Data Collection - Brandon
Using SpotiPy, a Python endpoint for the Spotify API, we initially pulled data for 2000 pop songs. Spotify’s API contains metadata, such as aggregate measurements like overall danceabilibility, acousticness, speechiness, and loudness, for each song. There were 18 features total. At first, we wanted to explore whether we could make meaningful clusterings based on the features in the pop genre, but the clusterings yielded were not insightful. As a result, we decided to request more diverse song data. In the second data pull, we gathered 63217 songs with the 18 features. Analysis on this dataset yielded better results that are discussed in the following sections.
Figure 1. Visual Depiction of the Feature Matrix
3. Methodology and Results
3.1 Exploratory Data Analysis
The first step involved exploring the distributions of each feature pulled from the Spotify API. We first noticed that there were a few categorical features: mode (i.e. modality, whether a song was in a major or minor key), time signature, and key. We noticed that the majority of the features (speechiness, danceability, etc.) were ratios on a scale from 0.0-1.0. However, we discovered two features - tempo and loudness - that were not ratios (Figure 2). Tempo was in terms of beats per minute, and ranged from 0 to 230, while loudness was in terms of decibels and ranged from 0 to -60. In order to reasonably use all of these features for a machine learning model, we normalized tempo (by dividing all values by the maximum) and loudness (by dividing all values by the minimum) to effectively create ratios.
Figure 2. Distributions of Tempo and Loudness
3.1.1 Design Decisions
As shown in Figure 2, we created two histograms to show the distributions of tempo and loudness. For the tempo histogram, we chose the color green, as green conveys a sense of liveliness. For the loudness histogram, we chose orange, a softer version of red to convey harshness or noise. We flipped the axes, so that a song that is noisier would be more negative.
3.2 Recommendation Creation
In theory, a Spotify user listening to a given song would want to listen to other songs that have similar characteristics. As such, we should be able to use a clustering algorithm in order to find patterns in how different songs group together - songs that fall into the same cluster can be considered “similar.” Within each of these clusters, we could then calculate the ten best pairwise distances for each song, i.e. find the ten songs that are closest to each song in terms of euclidean distance.
We decided to use the K-Means clustering algorithm. Despite being a simple algorithm, the dataset is not too large, so the curse of dimensionality should not have been an issue. Additionally, all of our features had been normalized to be on comparable scales. In order to determine the ideal value of k-value for K-Means, we used the elbow plot method, which plots the number of clusters against the amount of explained variance at each cluster number. The biggest drop in explained variance is the optimal cluster size for the given data set.
Our initial clustering approach included all of the numerical features, after normalizing features that were not ratios, such as tempo and loudness. We also only tested the initial approach on a set of songs from the “pop” genre. Our initial approach also made the mistake of including “mode;” mode is the feature representing whether a song is major or minor, i.e. a binary feature. As a result, the elbow plot method chose K = 2 clusters, which should clearly raise flags. We realized that the algorithm was clustering solely on whether or not a song was major or minor, and had to change our approach.
As we would eventually have to expand our model to a larger dataset anyways, we decided to repeat the clustering on a dataset of approximately 60,000 featured songs, rather than 2,000 pop songs. This time, we removed mode, and other features that did not logically work as a ratio (such as duration). We then used the elbow plot method to check the optimal number of clusters again, and the result was K = 8 clusters. K = 8 is more reasonable than 2, as it is a possibility that there are approximately 8 distinct genres in the set of 60,000 featured songs. We then successfully found the top ten pairwise distances for each song within each cluster. The process of creating the visualization of the clustering is discussed in section 3.3.2.
3.3 User Interface / User Experience
One of the main focuses of this project was the user experience, as the goal was to allow a user to quickly gain insight into their song recommendations. As such, the user experience is multifaceted, and in section 3.3 we break it down into the following: the search bar (3.3.1), the scatter plot (3.3.2), the radar chart (3.3.3), the list of song recommendations (3.3.4), the comparison graph (3.3.5), and playlist summarization (3.3.6). For each of these features, we discuss individual design decisions. In terms of color scheme, we used a light green background with black or white text in order to resemble the colors of Spotify, while appropriately choosing text colors that would be easy on the users’ eyes.
3.3.1 Search Bar
We wanted to keep our search bar tool simple and easy to understand. As a result, we decided to adopt a Google-like search bar, where the user can simply type in the song to which he or she wants to find recommendations and similarities among other songs. Currently, the user can enter one song into the search bar. However, in the future we want the user to be able to enter both a song and potentially a few song features (such as the ones we use to showcase similarity in the radar chart in section 3.3.3) so that he or she can find more nuanced similar songs with special emphasis on specific song characteristics.
Figure 3: Search bar for songs
3.3.2 Scatter Plot
In order to present this clustering to an audience, we needed a simple and quick visualization that would immediately show the user how far apart a recommended song was from the song used as the search term. However, the clustering was done on a set of approximately 15 features, so it was impossible to show the distances between songs in 15 dimensional space. We had to project the clustered data into 2 principal components, which could then be used as axes for a scatterplot, allowing us to quickly show the user how far apart a recommended song is (Figure 4).
However, this visualization’s interpretability could be improved. While removing axes provides a cleaner look, the casual observer often prefers two sets of axes, in order to understand what exactly the distances mean. Similarly, we could implement the addition of songs that are not similar to the search term, but have their nodes faded out to indicate that they are not similar. In its current state, it can be hard to interpret, as there is no point of reference for each song. A song that is indeed similar to the search term may appear as the opposite, because there is no comparison to a song that is in actuality not similar.
Figure 4: Scatterplot in which each dot represents a song and its position in the 2D-clustering output
3.3.3 Radar Chart - Song Comparison Visualization
Multidimensional data is always a challenge to visualize effectively. However, we wanted to make an effort to show the user the feature composition of his or her song choice. As a result, we decided to use a radar chart to visualize the different values of each feature of the song that the user typed into the search bar.
Originally, our radar chart visualization looked very messy and complicated because there was no logic to the order in which we placed the feature labels around the diagram. Therefore, we decided to improve this by placing the features in order of decreasing mean so that the chart was more interpretable and aesthetically pleasing. The feature with the largest mean is energy, while the feature with the lowest mean is speechiness. In the figure, we can see that the ordering occurs in clockwise form.
The radar chart also includes an interactive component, where the user can mouse over a specific point and it will pop out to signify that it has been selected, while also displaying the actual normalized value of the feature. At the same time, the corresponding feature label turns a different color as well, just to make it more clear to the user which feature they are currently inspecting.
All of these design decisions can be seen in Figure 5.
Figure 5: Radar Plot
In addition to just placing the user’s selected song on the radar chart, we want to provide greater insight into why our algorithm recommends the nine or ten specific songs that are chosen as a result of the user’s search. This part of our radar chart is still in the works with regards to integrating it with the web application.
Locally, we are in the process of allowing the user to select a song (or multiple songs) out of the list of recommended songs, which will then be overlayed on the radar chart to showcase how all these songs relate to the user’s song of choice (along with each other as well) with regards to the nine features. In the chart below, we see the user’s song in blue and a recommended song in red. We can see that the danceability, valence, and duration_ms values of these two songs are quite similar, which provides some additional information regarding why this song may have been recommended to the user.
Figure 6: Comparative Radar Plot
3.3.4 Recommended Song List
In section 3.3.2, we discussed the usage of a scatter plot in order to show a user “how similar” two songs are. However, we also needed a user to instantly see what song they should explore next, as other recommendation systems do. On Netflix, the recommendation explains that you should watch a show based on a show you previously watched. Similarly, our application explains that you are suggested ten songs because you listened to the song in the search term.
We implemented a scrollable list of songs, as ten Spotify embeddings in a row tend to be an eyesore. Instead, a user only has to see four different songs at a given time. However, a future improvement would be listing the songs next to the scatterplot. This would eliminate the need for a user to scroll down in order to see a listing of all of the songs. The songs are presented to a user in order of how similar each song is to the search term.
Figure 7: Suggestions from Song Input
While the radar chart showcases the normalized values for each song feature, we also decided to include a bar graph showcasing how the values of the user’s chosen song features compare to the features of all the songs in the data set. The x-axis has a number from -1 to 1 (varies depending on the song chosen for viewing purposes). This value represents the percent of songs that have a lower feature value for that specific feature. For instance, in the figure below, this specific song has a valence percentile of roughly 5%, which means that it has a higher valence than only 5% of the songs in the data set. We decided that this graph would be useful for comparing the user’s song choice to the entire data set, as it may provide insight into how the chosen song’s features stand up to the rest of the songs available.
Figure 8: Comparison between a single song and aggregate playlist
3.3.6 Playlist Analysis Feature
The newest feature is a work in progress: playlist analysis. The goal of this feature is to allow a user to directly gain insight into their own listening habits, because they are often not well-defined, and how well a person knows their own taste in music can vary drastically from person to person. The feature’s only functionality currently is performing a selection of the three most likely/common genres within the playlist. This should allow a user to figure out what genre of music they listen to most commonly, or in certain settings (such as going to the gym or studying). This feature can be expanded to provide insight into the most common themes of a playlist, such as low liveness or high acousticness.
In order to present these findings to the user, there are four parts. At the top of the feature, we clearly tell the user that they can select a playlist to have its features analyzed. On the left side of the screen, the user has a list of their playlists (which are given to the application through Spotify authentication), while on the right, the playlist being analyzed displays the songs it contains. This orientation is selected because most users are likely used to reading from left to right, so it feels more natural for the user to select a playlist from the left then read the songs on the right. On the bottom, we state the most common genres within the playlist, so that it doesn’t distract too much from the rest of the playlist options, while clearly explaining this characteristic to the user. This could be expanded to display a small radar chart beneath the list of songs demonstrating how the characteristics (such as acousticness) of the playlist compare to that of other songs on average.
Figure 9: Analyzing Playlist Features
4. Conclusions and Future Work
On a rather subjective basis, the quality of recommendations was decent. It made sense that songs by Taylor Swift could lead to recommendations from Shawn Mendes. Additionally, it seems that since clustering was based solely on song characteristics, recommendations weren’t biased towards similarly popular songs; ideal recommendations should be able to help a user discover new music that they haven’t heard of, rather than songs that are already very popular. However, all of these remarks are on the notion that we do not possess user listening history data, which is a much more powerful indicator of what songs are likely to be “good” recommendations.
On that notion, this work could be extended to include history of song searches. While we cannot use listening history, we can eventually create data tracking what songs a user searches in our recommendation system. This could potentially lead to stronger recommendations.
The search bar could also be extended to include additional parameters, so that a user can act on the insight they gain from our recommendation systems. For instance, if a user notices that they listen to a particular feature (e.g. liveness) frequently, they could search for songs that have a high liveness feature, finding more songs that they could add to their own playlist.
The playlist summarization feature also has a lot of room for growth, but also a lot of potential. It could be extended to include the features that are most representative of the playlist, (e.g. high speechiness or low liveness), rather than simply the primary genres of the playlist. This can potentially be a very important insight into a user’s own listening habits.
Insightful Spotify recommendations are possible — it is certainly feasible to create strong recommendations while also providing a user with insight into their recommendations and listening history. Sometimes it is more valuable to understand the “why” behind a specific group of results rather than just the results themselves because it can uncover numerous discoveries that can provide even more useful insights to the user. And that is exactly what INSPIRE aims to achieve — to answer the missing “why” in Spotify’s song recommendation feature.