Quantifying LA Cuisine
Comparing cuisines in Los Angeles by measures of popularity and cleanliness

a

Preface

Recently I came accross a dataset on LA county's open data portal containing health inspection results for the majority of restaurants and markets in the greater Los Angeles area. As a lover of culinary experiences and food in general, my interest was piqued immediately. What could this dataset tell me about my favorite restaurants and cusines? Only one way to find out!

initial Exploration

The health inspection dataset from LA's data portal has just over 390k observations of 18 variables, each row containing the details of a single recorded health violation. Here's a sample of the variables that I'll be using in this analysis.

ACTIVITY_DATE NAME SITE_CITY SITE_ZIP VIOLATION_CODE SCORE
1 2014-07-29 TIN TIN ROSEMEAD 91770 16F027 87
2 2014-08-05 LA PAZ BAKERY CANOGA PARK 91304 16F009 91
3 2014-08-28 LA ZOO LOS ANGELES 90027 16F009 90
4 2014-08-28 LA ZOO LOS ANGELES 90027 16F004 90
5 2014-07-01 CRIMINAL COURT BLDG - KITCHEN LOS ANGELES 90012 16F004 91

Since I'm interested in working with this data at the inspection level (slightly less granular than individual violations), the first step will be to aggregate each inspection by business name, location and assessed health score. I'll also add a "violations" variable to count the number of violations an establishment has received. Since some locations have been inspected multiple times, I've decided to to keep only the most recent inspection in the dataset. First, let's examine the relationship between health scores and violation counts.

library(dplyr)

data <- read.csv("lac_violation_data.csv")

data <- data %>%
  group_by(NAME,ACTIVITY_DATE,SITE_CITY,SITE_ZIP) %>%
  summarise(SCORE = mean(SCORE),
            VIOLATIONS = n()) %>%
  group_by(NAME,SITE_CITY,SITE_ZIP) %>%
  arrange(desc(ACTIVITY_DATE)) %>%
  slice(1) %>%
  select(NAME,SITE_CITY:VIOLATIONS)

summary(lm(VIOLATIONS ~ SCORE, data=data))
Call:
lm(formula = VIOLATIONS ~ SCORE, data = data3)

Residuals:
   Min     1Q Median     3Q    Max
-6.230 -0.600 -0.109  0.564 31.678

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 62.200104   0.191456   324.9   <2e-16 ***
SCORE       -0.612127   0.002039  -300.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.407 on 37107 degrees of freedom
Multiple R-squared:  0.7084,    Adjusted R-squared:  0.7084
F-statistic: 9.015e+04 on 1 and 37107 DF,  p-value: < 2.2e-16

An interesting distribution. It looks like the vast majority of inspections result in a score of 90 or higher with a heavy concentration of scores falling right at the 90 cutoff. There are a handful of scores in the 80-90 range and virtually no one score under 75 (probably the point at which you're at risk of getting shut down). The strange spike at 90 and the inspections with scores above 90 that deviate from the dominant linear trend leads me to believe that there is some flexibility in the scoring and evaluation process.

Gathering More Data

After aggregating the dataset we are down to about 37k observations with the following six variables:
Name, City, Zip, Score, Violations
I would like to add cuisine type to the dataset so that I can determine if certain cuisines tend to score hgher than others. Since manually reviewing 37k different places to add cuisine labels is out of the question, I turned to Yelp's API for help. Yelp's business categorization labels are exactly what I'm looking for and even though I may only be able to retrieve a subset of the entire dataset, it's defintely more preferable than the alternative! Yelp's API also provides other data that might be interesting (lat/long coordinates, ratings, reviews) so I'll try to grab these attributes as well. First I'm going to move my dataset into MySQL to help facilitate chunking up the dataset for API calls and to efficiently pipe data between R and Python.

library(RMySQL)

mydb = dbConnect(MySQL(), user='xxxx',
              password='xxxx',
              dbname='scraping',
              host='localhost')

dbWriteTable(mydb,
             name='la_health',
             value=data,
             field.types=list(NAME="VARCHAR(100)",
                              SITE_CITY="VARCHAR(50)",
                              SITE_ZIP="VARCHAR(20)",
                              SCORE="INT(7)",
                              VIOLATIONS="INT(7)"),
             row.names=FALSE)

Now we'll bring the data into our python environment in chunks using MySQL's LIMIT and OFFSET statements. Then apply some cleaning functions to the name, city and zip variables so that they can be used to make API calls.

import sqlalchemy
import pandas as pd
import re

db = sqlalchemy.create_engine('mysql+pymysql://xxxx:xxxx@localhost:3306/scraping')

la_data = pd.read_sql("SELECT * FROM la_health ORDER BY NAME LIMIT 5000", db)

def cleantxt(text):
    text2 = re.sub('#[\s|0-9][0-9]*$','',text)
    text2 = re.sub('[^A-Za-z0-9]+',' ',text2)
    text2 = re.sub(' ','+',text2)
    return text2

def cleanzip(zipcode):
    ziptext = str(zipcode)
    ziptext = ziptext[:5]
    return ziptext

la_data['NAME_CLN'] = la_data['NAME'].map(cleantxt)
la_data['CITY_CLN'] = la_data['SITE_CITY'].map(cleantxt)
la_data['ZIP_CLN'] = la_data['SITE_ZIP'].map(cleanzip)

Yelp uses oauth and returns api calls as json so we'll use the requests_oauthlib and json libraries to help manage this. The getYelp function makes a call to the Yelp API with a business name, city and zipcode. Not all of the results will be a match so we need to validate them. We check the call result for zipcode and city compatibility and succesful calls will recieve further processing. Fuzzy string matching is used to score the similarity of the original name with Yelp's returned name to help weed out false matches at a later point. The function returns a dictionary of original and Yelp supplied attributes that we will use to form an updated dataset.

import json
from requests_oauthlib import OAuth1Session
from fuzzywuzzy import fuzz

oauth = OAuth1Session(client_key,client_secret,resource_owner_key,resource_owner_secret)

def getYelp(name,city,zipcode,violations,score):
    try:
        r = oauth.get(url=r'https://api.yelp.com/v2/search?term='
            +name+r'&location='+city+r'+'+zipcode+r'&limit=1')
        obj = r.json().get('businesses')[0]
        yelpName = obj['name'].upper()
        yelpCity = obj['location']['city'].upper()
        yelpZip = obj['location']['postal_code']
        if yelpZip == zipcode and cleantxt(yelpCity) == city:
            matchscore = fuzz.ratio(cleantxt(yelpName),name)
            catlist = []
            for x in obj['categories']:
                catlist.append(x[0])
            try:
                yelpCat2 = catlist[1]
            except:
                yelpCat2 = "NA"
            try:
                yelpCat3 = catlist[2]
            except:
                yelpCat3 = "NA"
            yelpdict = {'yelpCat1' : catlist[0],
                        'yelpCat2' : yelpCat2,
                        'yelpCat3' : yelpCat3,
                        'yelpCity' : yelpCity,
                        'yelpZip' : yelpZip,
                        'yelpName' : yelpName,
                        'laName' : name,
                        'yelpLatitude' : obj['location']['coordinate']['latitude'],
                        'yelpLongitude' : obj['location']['coordinate']['longitude'],
                        'yelpRating' : obj['rating'],
                        'yelpReviews' : obj['review_count'],
                        'laViolations' : violations,
                        'laScore' : score,
                        'matchScore' : matchscore
                       }
            return yelpdict
    except:
        print("request failed")
        pass

To stay within Yelp's rate limits, I ran the getYelp function over a couple of days against 3k-5k records at a time and appended the results to a new table in the database.

yelpList = []
for row in range(len(la_data)):
    data = la_data.loc[row,]
    n = data['NAME_CLN']
    c = data['CITY_CLN']
    z = str(data['ZIP_CLN'])
    v = data['VIOLATIONS']
    s = data['SCORE']
    gy = getYelp(n,c,z,v,s)
    if gy is not None:
        yelpList.append(gy)

yelpdf = pd.DataFrame(yelpList)
yelpdf.to_sql('yelp_data',db, if_exists='append', flavor='mysql')

Of the 37k+ observations about 55% were matched on Yelp, and I concluded that about 83% of these matches were valid by using a cutoff threshhold for the fuzzy match score of the business name. The resulting dataset has just under 17k observations. A higher percentage of matches would have been nice but I feel that 17k observations is still a sufficient sample of the original population (albeit not exactly random).

Developing a Metric

At first I was seeking to compare cuisines by health scores alone but now we have Yelp ratings and review counts to work with as well. Since cleanliness and popularity are both valuable traits, creating a metric that encompasses both was a logical next step. Through regression and experimentation I created a weighted score bound between 0 and 1 and takes the following inputs:

  1. Health Score: Mapped to an exponential function (perfect/near perfect scores are rewarded immensely)
  2. Violation Count: Health score modifier mapped to a linear function. Recieving more violations than typical will lower the overall health score and vice versa
  3. Yelp Rating: Mapped to a linear function using Yelp's 1-5 star rating system
  4. Yelp review Count: Yelp rating modifier mapped to a logrithmic function. Very few reviews will lower the Yelp rating significantly but will have increasingly less affect after a certain threshold

Now I'll create a new variable "weighted score" using the logic above and plot the distribution.

yelp <- dbSendQuery(mydb, 'SELECT * FROM yelp_data')
yelp_df = fetch(yelp, n=-1)

lm(data3$VIOLATIONS ~ data3$SCORE)

coeff_score_b <- 1.1659144
coeff_score_a <- 1/(coeff_score_b^100)
coeff_vio_yint <- 62.201
coeff_vio_score <- -.6121
coeff_reviews_a <- .1162
coeff_reviews_b <- 1.0898

yelp_df <- yelp_df %>%
  mutate(transfScore = coeff_score_a * (coeff_score_b ^ laScore),
         transfVio1 = coeff_vio_yint + (coeff_vio_score * laScore),
         transfVio2 = (transfVio1 - laViolations) / transfVio1 / 10,
         transfRate = yelpRating / 5,
         transfRev = coeff_reviews_a * log(coeff_reviews_b * yelpReviews),
         weightedScore = ((transfScore + transfVio2) +
             (transfRate * transfRev)) / 2) %>%
  select(index:yelpZip,weightedScore)

Nice, now we have a normally distributed metric that can be used to determine which Los Angeles cuisines tend to serve up the tastiest food in the most sanitary environments.

Visualizing the Results

Now I'll compare the scores of a variety of cuisine types by plotting density distributions. The mean weighted score is approximately 0.4 so I'll draw a vertical line on each density plot to make comparison easier. Distributions concentrated to the right of the line should indicate higher scoring cuisines and distributions concentrated to the left should indicate lower scoring cuisines on average. The plotting code for the image below and all other graphics in this analysis can be viewed on github.

Interesting! European cuisines like Greek, French and Spanish seem to score a little higher than average while many asian cusines tend to have lower scores on average. Asian fusion and Japanese are exceptions to this observation with realtively standard looking distributions. Overall, Greek and Healthy establishments tend to score the highest using this metric while Markets and Southeast Asian food have the lowest average scores.

Let's wrap up this analysis by mapping the data colored by score to see if this new metric reveals any regional trends.