Netflix Movies and TV Shows recommender using cosine similarity

Saleena John
Web Mining [IS688, Spring 2021]
6 min readMay 6, 2021

--

image courtesy: Netflix

One of the hardest things while deciding to binge-watch Netflix is ‘what to watch?’. Or if I have watched one good movie and if I want to watch a movie similar to it how can I know it? To come up with a solution, I decided to do a recommender system of movies and TV shows on Netflix based on anything I want to watch or based on a movie I have previously watched.

This project recommends movies and shows on the Netflix streaming platform depending upon what you have watched.

I am using the Netflix Movies and TV Shows dataset from Kaggle to do this project. The recommender system uses Cosine Similarity along with some interesting visualizations using python. The data set I used contains 7787 records and 12 columns.

Libraries used

Numpy: Numpy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices. The dataset I used is in .csv format. All the array operations in this project are done using the numpy library.

Pandas: Pandas is an open-source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. All data analysis operations are performed using the pandas library.

Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits. Most of the visualizations are done using this library.

Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Nltk: NLTK is a leading platform for building Python programs to work with human language data. I used this library to work with various text in the dataset.

Sklearn: Sklearn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbors, and it also supports Python numerical and scientific libraries like NumPy and SciPy

Stop-words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Data Analysis

Let’s take a look at the data set w got.

df=pd.read_csv("/Users/saleenajohn/Desktop/WebMining/Assignment3/netflix_titles.csv")
df.head(5)
First five rows of the netflix_titles.csv dataset

The dataset contains 12 columns and 7787 entries. I did not use all of them as some rows were dropped because of insufficient data and only the column required for the recommender system were used.

I did some visualizations before going into the actual recommender system. I wanted to check whether movies or TV shows are more in the data set.

Movie and TV Shows count

Around 30% of the data in the dataset is of TV series.

Top 10 Countries in film production on Netflix

The bar chart above shows the top 10 countries in film production on Netflix. The top country is the United States followed by India and the United Kingdom.

Now, let’s get in the real act. By skimming through the data I collected, I saw some null fields. We don’t need any of these for the recommender system. So next step is to drop the rows that contain null values.

df.isnull().sum()
Information of null value columns

From the output snippet above, date_added is not an important column for the recommendation. And rating in this data set is the viewer discretion rating and not useful for the recommendation.

Cast and country are two important features and if this information is not available, it is better to drop them.

The director column has the most number of null values. So, dropping them would take away a huge junk of data. So instead of null, I decided to populate ‘Unknown’.

df= df.dropna(subset=['cast','country'], axis = 0)
df['director'] = df['director'].fillna("Unknown")
df = df.reset_index( drop=True)

The above code snippet drops the cast and country null values and populates ‘Unknown’ for null director fields.

Recommendation

Since we don't have a rating field, I decided to use cosine similarity to come up with recommendations.

Cosine similarity is the measure of similarity between two vectors, by computing the cosine of the angle between two vectors projected into multidimensional space. It can be applied to items available on a dataset to compute similarity to one another via keywords or other metrics. The similarity between two vectors (A and B) is calculated by taking the dot product of the two vectors and dividing it by the magnitude value as shown in the equation below. We can simply say that the CS score of two vectors increases as the angle between them decreases.

I created another column named ‘overall_infos’ by joining data from type, title, director, cast, description and country columns. The data set now looks like

Data set after adding overall_infos

Now, we have to do some cleaning and processing of this overall_infos column. First, we have to remove the stopwords from it, then convert all words to lower case.

overall_infos before text processing
overall_infos after text processing

Now, applying cosine similarity,

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
CV = CountVectorizer()
converted_metrix = CV.fit_transform(df_new['cleaned_infos'])
cosine_similarity = cosine_similarity(converted_metrix)

The cosine similarity matrix looks like this

Cosine similarity matrix

Now say if you want to watch a comedy movie or show

df[df['description'].str.contains('comedy')]

will return the below result

results returned for comedy in description

And if you want to watch a movie similar to a movie you have watched:

#this how we will get the id of the movie so we can check similarity between it and other movies
title = 'Stranger Things'

movie_id = df[df['title'] == title]['id'].values[0]
score = list(enumerate(cosine_similarity[movie_id]))
sorted_score = sorted(score, key=lambda x:x[1], reverse= True)

sorted_score = sorted_score[1:]
sorted_score[0:10]
i = 0
for item in sorted_score:
movie_title = df[df['id'] == item[0]]['title'].values[0]
print(i+1,movie_title)
i = i+1
if i > 4:
break

we get the top 5 recommendations as

References:

https://www.kaggle.com/sierram/cosine-similarity-wine-descriptions

--

--