Netflix Movies and TV Shows recommender using cosine similarity

Published in

Web Mining [IS688, Spring 2021]

6 min readMay 6, 2021

One of the hardest things while deciding to binge-watch Netflix is ‘what to watch?’. Or if I have watched one good movie and if I want to watch a movie similar to it how can I know it? To come up with a solution, I decided to do a recommender system of movies and TV shows on Netflix based on anything I want to watch or based on a movie I have previously watched.

This project recommends movies and shows on the Netflix streaming platform depending upon what you have watched.

I am using the Netflix Movies and TV Shows dataset from Kaggle to do this project. The recommender system uses Cosine Similarity along with some interesting visualizations using python. The data set I used contains 7787 records and 12 columns.

Libraries used

Numpy: Numpy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices. The dataset I used is in .csv format. All the array operations in this project are done using the numpy library.

Pandas: Pandas is an open-source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another package named Numpy, which provides support for multi-dimensional arrays. All data analysis operations are performed using the pandas library.

Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits. Most of the visualizations are done using this library.

Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Nltk: NLTK is a leading platform for building Python programs to work with human language data. I used this library to work with various text in the dataset.

Sklearn: Sklearn is a free machine learning library for Python. It features various algorithms like support vector machine, random forests, and k-neighbors, and it also supports Python numerical and scientific libraries like NumPy and SciPy

Stop-words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

Data Analysis

Let’s take a look at the data set w got.

df=pd.read_csv("/Users/saleenajohn/Desktop/WebMining/Assignment3/netflix_titles.csv")
df.head(5)

First five rows of the netflix_titles.csv dataset

The dataset contains 12 columns and 7787 entries. I did not use all of them as some rows were dropped because of insufficient data and only the column required for the recommender system were used.

I did some visualizations before going into the actual recommender system. I wanted to check whether movies or TV shows are more in the data set.

Around 30% of the data in the dataset is of TV series.

The bar chart above shows the top 10 countries in film production on Netflix. The top country is the United States followed by India and the United Kingdom.

Now, let’s get in the real act. By skimming through the data I collected, I saw some null fields. We don’t need any of these for the recommender system. So next step is to drop the rows that contain null values.

df.isnull().sum()

From the output snippet above, date_added is not an important column for the recommendation. And rating in this data set is the viewer discretion rating and not useful for the recommendation.

Cast and country are two important features and if this information is not available, it is better to drop them.

The director column has the most number of null values. So, dropping them would take away a huge junk of data. So instead of null, I decided to populate ‘Unknown’.

df= df.dropna(subset=['cast','country'], axis = 0)
df['director'] = df['director'].fillna("Unknown")
df = df.reset_index( drop=True)

The above code snippet drops the cast and country null values and populates ‘Unknown’ for null director fields.

Recommendation

Since we don't have a rating field, I decided to use cosine similarity to come up with recommendations.

Cosine similarity is the measure of similarity between two vectors, by computing the cosine of the angle between two vectors projected into multidimensional space. It can be applied to items available on a dataset to compute similarity to one another via keywords or other metrics. The similarity between two vectors (A and B) is calculated by taking the dot product of the two vectors and dividing it by the magnitude value as shown in the equation below. We can simply say that the CS score of two vectors increases as the angle between them decreases.

I created another column named ‘overall_infos’ by joining data from type, title, director, cast, description and country columns. The data set now looks like

Now, we have to do some cleaning and processing of this overall_infos column. First, we have to remove the stopwords from it, then convert all words to lower case.

overall_infos before text processing

overall_infos after text processing

Now, applying cosine similarity,

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarityCV = CountVectorizer()
converted_metrix = CV.fit_transform(df_new['cleaned_infos'])cosine_similarity = cosine_similarity(converted_metrix)

The cosine similarity matrix looks like this

Now say if you want to watch a comedy movie or show

df[df['description'].str.contains('comedy')]

will return the below result

results returned for comedy in description

And if you want to watch a movie similar to a movie you have watched:

#this how we will get the id of the movie so we can check similarity between it and other movies
title = 'Stranger Things'

movie_id = df[df['title'] == title]['id'].values[0]score = list(enumerate(cosine_similarity[movie_id]))
sorted_score = sorted(score, key=lambda x:x[1], reverse= True)

sorted_score = sorted_score[1:]
sorted_score[0:10]i = 0
for item in sorted_score:
    movie_title = df[df['id'] == item[0]]['title'].values[0]
    print(i+1,movie_title)
    i = i+1
    if i > 4:
        break

we get the top 5 recommendations as

References:

Movie Recommender Systems

Explore and run machine learning code with Kaggle Notebooks | Using data from The Movies Dataset

www.kaggle.com

https://www.kaggle.com/sierram/cosine-similarity-wine-descriptions

Netflix Movies and TV Shows recommender using cosine similarity

Libraries used

Movie Recommender Systems

Explore and run machine learning code with Kaggle Notebooks | Using data from The Movies Dataset

Written by Saleena John