Exploratory Analysis of Yelp API Data

Saleena John
Web Mining [IS688, Spring 2021]
5 min readApr 5, 2021

--

Yelp is an application that provides a platform for users to rate and write reviews about various businesses and events. The customers can rate from one to five stars. Yelp helps a customer to find the best business places based upon reviews and within closer proximity. I got the idea for this project while I and my friends were searching for good places to eat on our road trip to Washington DC. We are staying there for three days. So basically, this project focuses on food and restaurant categories of Yelp Web API.

Yelp Web API

Before describing Yelp API, let me give a gist of what an API is. API stands for Application Programming Interface. It is a software intermediary that allows communication between applications. In the software engineering world, API helps developers to integrate software components without writing code from the scratch. Data analysts on the other hand can use this data to come up with decisions and predictions. Numerous companies provide their selected set of data via API to the public. Yelp is one such company.

In order to access the data, one should create a developer account in Yelp which provides us with an API key and client id.

I am using Yelp’s Fusion API for this project. The Fusion documentation provides detailed steps on how to make calls to its various endpoints.

Data Collection

After creating the developer account, I used Jupyter notebook to make calls to the API and extract data. Pull requests were done using the requests library in python.

Making calls to API

This is the screenshot of the code I used.

First, I took all the categories in the Yelp web API. The category endpoint is provided in the Fusion documentation.

I used a for loop to get the data up to 10000 rows in an increment of 50 each.

Exploratory Analysis

Getting data from API in JSON format

The output of the above code fragment will be in JSON format as below.

JSON output fragment

The output gives all the business categories listed in the Yelp app. The main fields in the output are alias, title and parent aliases. Firstly, I am trying to find the top ten categories in the Yelp app. For that, I need a CSV file with all parent aliases.

The code above returns all the parent aliases and writes them into a CSV file named ‘parent_categories.csv’

Next, I need to find the top ten parent aliases and plot that as a bar chart.

Top ten business categories

I used value_counts() function from numpy library to do this. Then using matplotlib library, I plotted the following bar chart.

Top ten business categories on Yelp

I did this analysis to understand how to work with the API data.

Now, coming to the main agenda of the project — to find the five best restaurants in Washington DC. For this, I used the ‘Business search’ endpoint.

Using the requests library I pulled restaurants within 2 miles radius of the place we were going to stay. The code fragment is as follows

For the business search endpoint I used 1000 as an offset. The price was also limited to ‘$$’ since we are looking for economical places to eat. The code returned a JSON file. I used a for loop to return all the fields required for the analysis

The distance returned was in meters which I converted to miles.

There were two issues I encountered at this stage. First one was, the business search endpoint returns only 50 results even if I offset it to 1000. Technically, I cannot do anything to increase the number of results, but I came up with two ways to eliminate this issue. Either I can go for multiple requests with different offsets and combining them or refine the call request by adding more parameters.

I decided to collect 100 restaurants that are within 2 miles radius from the place of stay by making two calls to the API. I also added one more parameter ‘categories’ to get more refined data.

From the data that I collected, something that caught my attention is the number of reviews and rating columns. I used Tableau to plot the number of reviews and ratings of each restaurant and got the following graph.

Number of Reviews Vs. Rating

From this chart, we can see that even though restaurants like Oyamel and El Centro has a higher number of reviews, their rating is less. Number of reviews just adds more value to the ‘rating’ feature of businesses. I decided to select the restaurants that are on a neutral side. I dropped the restaurants that have a higher number of reviews and lower ratings, restaurants that have a lesser number of reviews, chain restaurants and sorted this result from closest to farthest.

The second one was that we would like to explore more local restaurants rather than chain ones and the results returned from the above code includes chain restaurants too. Yelp Fusion doesn’t have any settings to filter the chain restaurants. I found the chain restaurants using groupby() function and dropped those rows.

I finalized six restaurants from this list.

  • Busboys and Poets — 450K
  • Il Canale
  • Roti
  • Ristorante Piccolo
  • Mexicue
  • Mi Vida Restaurante

Among these, we are planning to go to only three restaurants that we will decide according to the reservation and dine-in facility.

--

--