Spotify Music Data Analysis

Published in

Web Mining [IS688, Spring 2021]

14 min readMay 8, 2021

By Ajay kumar Selvaraj Rajagopal, Himavarsha and Saleena John

The definition of success is different to every musician. For some musicians, success means the ability to get food on the plate and pay the bills through making music. For some musicians, seeing their music playing on several music platforms might be defined as success. For other musicians, their wish might be to find their song on the Billboard Top 100 songs. Spotify gives an opportunity to all musicians to prove themselves as successful. Would just artistic excellence and quality of music be enough for a piece of music to be successful?

Spotify is one of today’s leading commercial music streaming services providing music content of two-tiered service to its users. This app is essentially useful in streaming music that provides entertainment to a wide range of audience with various preferences. This application is the best free online radio on the market, which provides an easy user interface. Some features of this application include browsing, and searching for the album, artist, genre, record label, and adding the record to the playlist, ensures that song playing is prioritized at all times. Spotify users can subscribe to a premium version or can simply enjoy a free version of this application. The free version of this application supports features like online listening with compromised audio quality with an interruption from multiple back to back advertisements. The premium version of this application provides offline streaming with advertisement-free music and with high audio quality songs. This application has over 155 million premium subscribers and offers over 70 million songs from various genres of music and albums. Spotify, unlike its competitive music streaming services, have something to offer for all of its users. Some information from their records include data from music industry professionals, artists, and consumers, in order to identify user’s needs and to improve these as positive developments of their software.

Spotify is successful for several reasons. First, the great user experience it provides. Using Spotify is really simple, as the application design is centred around playlists. Users can add and play a song inside a playlist. Next, Spotify can be integrated with all devices, meaning that music that is played on a phone can be easily switched to playing on a computer without missing the music. Last, but not least, its simple monthly premium prices attract many customers which provides full control over the music, which can be shared with family and friends.

In this project, we will see the analysis on how an album is defined as a success by considering the similarities among popular albums, top albums of each year, sound profile of the songs in those albums, the number of songs in those albums, the duration of those albums, genre of those albums, the popularity level of those albums.

To begin our project, we extracted some information about albums, tracks and artists from Spotify and Web APIs. Spotify provides developers access to some of their data regarding playlists, users and artists through their web API. We used Spotipy, a python library that provides full access to all of the music data provided by the Spotify platform. We also used Billboard.py which we scrapped the basic details from their website, like the getting chart details and the albums for every year. We created a Spotify developers account and created our own client id and secret id which we can use to access endpoints. One of the features we explored is the popularity of tracks and artists. As of now, we grouped data of top 50 tracks over 5 years to do the exploratory analysis, but the problem with this is that the results were vague once the analysis was done. We finally decided on analyzing the data from 2000 to 2020. The data consisted of basic track details like album name, the artist name, the song name, the genre for the artist, the release date, track length, popularity of the song based on Spotify. We also collected the audio features of the songs such as danceability, acousticness, energy, instrumentalness, liveness, loudness, speechiness, tempo, time signature. The data is in a compressed format, and the song IDs were in JSON format. For each year, we created one API call, extracted the data and merged all the files to do the analysis. We tried to check if the intervention of technology affected the artist’s popularity.

With the data we have, we wanted to find out how the audio features affect the popularity of a song. In the code below, we can see the code for the artist’s popularity. We created empty lists to store the results in the list. Next, we are searching for the year 2020, to get track listings, limited to 50 results. We want to see the results for artists’ name, track name, track id, and their popularity.

Data Description

Above shown is the snapshot of the data we collected. It contains a total of 17 columns and 2100 entries of top songs over the years. Most of the columns are self-explanatory. The audio features are:

Danceability: Danceability is defined as how suitable a track is for dancing based on a combination of musical elements. This depends on the music features, such as tempo, rhythm stability, beat strength, overall regularity, etc… As you can see in the graph, the value of 0.0 is the least danceable and 1.0 is the most danceable.
Acousticness: This value describes how acoustic the song is. A score of 1.0 means the song is most likely an acoustic one.
Energy: Energy is defined as the sense of forwarding motion in music that will keep the listener engaged and listening. We can identify the music’s energy when the drums get busier and play louder, and the singer sings higher in an intense tone. An increase in volume and instrumentation, changing of performance style, basic beat, rhythm, and change in lyrics are contributions to the energy of music.
Instrumentalness: This represents the number of vocals in a song. The closer the value to 1.0, the more instrumental the song is.
Liveness: This value denotes the probability that the song is recorded with a live audience. According to the official documentation, “a value above 0.8 provides a strong likelihood that the track is live”.
Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing the relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 dB.
Speechiness: It denotes the presence of spoken words in a song. If the speechiness of a song is above 0.66, it is probably made of spoken words, a score between 0.33 and 0.66 is a song that may contain both music and words, and a score below 0.33 means the song does not have any speech.

Exploratory Data Analysis

Clustering

The approach we took with our dataset to find key aspects that are common among the top albums is Clustering. Using clustering will help us figure out the group of songs in our dataset that is similar to each other and we are doing this using the audio features that we got from Spotify. Before going ahead with clustering, we had to do something called standardization. Standardization is the process of putting different features on the same scale, with each scaled feature having a mean of 0 and a standard deviation of 1. This is important because the model is not familiar with the context of the data. If we do not standardize our data, the model would place a much larger weight on tempo and loudness, since those variables vary by much more than the variables that are distributed in the range from 0 to 1. Standardization allows for all features to be treated equally by the model. We accomplished this step by using sklearn’s StandardScaler import.

Song Feature

After standardization, we move on to k mean clustering. To implement k-means clustering, we must select a certain number of clusters, k, which distinctly splits the data. To do this we are going to use the elbow method. We implemented a clustering algorithm for varying values of k, ranging from 1 to 20 clusters. Then we Plotted the curve according to the number of clusters k. Then in the graph need to Look for a kink or elbow in the graph. Usually, the part of the graph before the elbow would be steeply declining, while the part after it — much smoother.

From the elbow in the graph, it is obvious to go with 4 clusters or distinct groups. While k-means clustering is not a supervised machine learning technique — meaning we cannot verify whether the results are accurate or not — we can visualize the clusters on a 2D plane to ensure that there is some sensical separation of data based on the sources of signals that are encapsulated by the components.

In the graph, all the clusters pretty much seem distinctive except for the orange one. And it is also obvious that two clusters have the majority of the songs compared to the other.

In the graph below we can see the number of songs in each cluster. Cluster 0 and 2 have each 1000 plus songs while the other two have around 300 songs each. This infers that there is some kind of a common factor among the majority of the songs that takes a spot in the billboard top 100 charts.

Now let's see how the audio features look like for each of these clusters. We are mainly seeing danceability, accousticness and energy.

Danceability

Pretty much all four of the cluster have a bell curve and have high danceability songs. From this, we can assume that the majority of the songs in the billboard top 100 are songs with a high tempo, rhythm, stability, beat strength, and overall regularity.

Energy

Energy is also the same as danceability for these four clusters, there is a normal distribution meaning that a large portion of these songs is high energy ones. The billboard top 100 seems to have a large number of songs that feel fast, loud, and noisy.

Accousticness

The acousticness though seems to be different from the other two audio features. But acousticness level of the clusters that have the most number of songs, cluster 0 and 2, seems to have the same pattern. Seems like a lot of songs in these clusters have a high accousticness as well. This means that there are songs in the cluster that have less speechiness.

Genre

Next, let’s talk about the genre. We wanted to see how much of an impact genre has in the way songs appear in the top billboard charts. But there was one problem we had to solve before we could do anything with the genre. Spotify categorizes songs that are not straight forward and sometimes very bizarre too. There were instances when some of the songs in the dataset fell into as many as 23 different Spotify genres. So we had to figure out a way to map the songs under a common and more reasonable genre since we were planning to do some exploratory data analysis specific to how genre plays a role in the ranking of the Billboard. We used a two-step process to translate Spotify’s genres to our own genre definition. First, we compared the list of genres that each song had with our own list that contains some overarching genre like “metal”, “rock”, “pop” and 5 others. In this way, we were able to convert “atmospheric post-rock” into rock and turn “deep northern soul” into soul. Then, for any artist that fell in multiple Spotify genres, we wrote a short Python script to “vote” on which genre to place the artist in. For instance, a genre definition like “atmospheric rock, psychedelic rock, alternate” would translate to “rock, rock, alternate” in the first step and then “rock” in the second step. Some of the genres for the songs had to be updated manually because this method did not prove to be an efficient one. After making the genres of the songs under a common list we proceeded to do some exploratory analysis with it.

Here’s a look at songs from the top 20 albums of Billboard that have reached the Hot 100, from 2000–2020. Each tiny rectangle denotes a different song. The larger the rectangle, the longer the song stayed on the Hot 100. We grouped the 3000+ songs into around 8 major genres. They are — rock, pop, metal, country, R&B, and rap/hip-hop.

Once each song was categorized, we looked at how a genre’s representation on the Hot 100 chart changed over time. To do so, we calculated the total number of Hot 100 spots occupied by each of our 8 major genres in a given year. Consider just rock a second. Even though they were popular only in the 70s, even in early 2000 we see a hype towards it but it quickly declined. The same goes for metal. Popular even in the early 2000s but lost the hype. The opposite goes for hip hop. It has gradually increased in popularity and seems to hit the charts now. The same goes for RnB as well.

This bubble graph shows the artists that made their genre popular. In the twenty years, country, pop and hip hop artists seem to be consistent with their place in billboard hot 100.

Next, we see the duration of the songs over the last two decades. The values are in milliseconds. So songs were around 4 and half minutes for a long time and suddenly in the last two years, they have gone less than 3 minutes. Reports say this is because the recording labels deliberately ask the artists to do so, to increase the streaming numbers both in radio and media services like apple music, tidal and pandora.

Mean Value of Audio Features

We plotted the mean values of audio features and got the bar chart shown below.

From the graph, it is clear that songs in Spotify are high in energy and danceability. This implies that Spotify is being used by a comparatively younger demographic. Speechiness and acousticness are less in most of the songs.

Song Trends Over the Years

The song trends over the years also show similar trends as the histogram and clustering algorithms. Songs are high in energy and danceability. While acousticness and liveness remain low.

Features Heatmap

We plotted a heatmap of all the audio features and got the resulting graph.

From the heatmap, we can see that there is a strong positive correlation between energy and danceability. We then plotted a scatter plot of energy vs loudness.

But also, from the heat map, we got a negative correlation between energy and danceability.

Likes and dislikes

Using the popularity column, we created another column that signifies how much the song is liked or disliked. We assumed popularity on a scale of 100 and songs with a popularity greater than 55 are considered to be liked and vice versa. The like/dislike histogram plot across various features has some interesting insights.

From these graphs, we can conclude that

People like songs with high danceability, energy and loudness
People prefer songs that have a duration of around 3 minutes
People dislike songs with high acousticness and speechiness
People also tend to like songs that are live recorded

Top Genre and Artists

The first pie chart in the slide shows the top 10 genres of the top charted songs every year.

And the second chart shows the top artist with the most number of songs in the respective genre. The colours in the second chart represent the genre in the first chart. To say, Taylor Swift had the most number of pop songs, Carrie Underwood had the most number of hip hop songs and so on.

Conclusion

We were working on the problem of finding how a song/album becomes successful on Spotify and what all factors determine this success.

One of the limitations that we faced while working on the project was that Spotify web API does not have genre data which majority determines the popularity of songs/albums. We incorporated genre data from other sources.

We had another question to analyze during the start of the project: whether the users are influenced by technology innovations in music. But when we analyzed the top charts of the 21st century, it was surprising to know that albums by The Beatles got into the top charts.

At the same time, if we observe the accousticness cluster, we can see it is having a sharp decline. This attributes to technological advances in sound engineering. Bands in the 60s and early 70s had to rely a lot on their mastery over the instruments but as time passed music production played a vital role in music. When artists saw that they could produce entire songs only with the help of studio effects, natural sounds(which contribute to acousticness) reduced. Hence the sharp decline. This new tech can also explain the increase in the energy quotient of a track and its danceability.

Time is an important factor, to keep listeners attention for the entire album, artists tend to make the song less than three minutes. So, in order for an album to be successful, make it a high energy dance number and keep the song duration to approximately 3 minutes.

Lessons Learned

We learned how to use clustering to find interesting patterns in the dataset.
Web scraping allowed us to get the data that we needed rather than using pre-defined data sets which may not include all the information we are looking for.

Restrictions

We couldn’t find a better way to sort the genres for the songs for better exploratory analysis and had to do some manual corrections.
Our dataset was small because we couldn’t get the data from the Spotify API beyond certain years. So this restricted us to do some broader analysis with the evolution of music.
This project relies on data that we collected from Spotify WebAPI. But the world of music is beyond the reach of just this API. There are many other external unquantifiable factors that could determine the success of certain music albums/songs. Well, to get that information is far beyond the scope of this project. This project could help the upcoming music artists to improvise their skills and techniques in order to attract the audience and become successful.