Ingredient Networks
Analyzing a corpus of 30k+ beer recipes

a

Preface

If you're a fan of beer, you know that there are a wide variety of styles to choose from. You also may know that there are thousands of different ingredients that can go into these styles, each producing distinct colors, bodies and flavors. There are multiple tools that hombrewers use to manage their recipes and one of the most prominent is BeerSmith. BeerSmith's online recipe database has over 34K recipes that contain detailed information on beer characteristics and ingredients used. As an amateur homebrewer and professional fan of beer, I thought it would be interesting to collect these recipe details and learn about the prefrences of BeerSmith's homebrewer community.

Building a Scraper

BeerSmith has a "recipes by style" webpage that I will use as the jumping off point for my scraper. Since all of the links to style pages on this page follow a specific pattern, the below function uses the findAll method from python's beautifulsoup package and a simple regex search to seek them out and return them as a list.

from bs4 import BeautifulSoup
from urllib.request import urlopen,Request
import re

def scrapeStyleUrls(url):
    try:
        url = Request(url, headers={'User-Agent' : "Magic Browser"})
        html = urlopen(url)
    except HTTPError as htper:
        return htper
    except URLError as urler:
        return urler
    bsObj = BeautifulSoup(html.read(),'lxml')
    urls = [x["href"] for x in bsObj.findAll("a",
        href= re.compile("^(http://beersmithrecipes.com/searchrecipe\?term=).*$"))]
    return urls

The next piece of the scraper needs to visit the style page links collected in the previous section and search for links to the actual recipes. The style pages are paginated so the function uses the following recursive approach: append all recipe urls to the url list, search for a "next page" button, pass the next page link to the parent function, repeat until no next page button is found.

def scrapeRecipeUrls(url):
    try:
        req = Request(url, headers={'User-Agent' : "Magic Browser"})
        html = urlopen(req)
    except HTTPError as htper:
        return htper
    except URLError as urler:
        return urler
    try:
        bsObj = BeautifulSoup(html.read(),'lxml')
        for x in bsObj.findAll("a", {"title":"View Recipe"}):
            recipeurls.append(x["href"])
    except:
        print("index failure")
    try:
        nxtPg = bsObj.find("a", text = "Next Page >>")
        scrapeRecipeUrls(nxtPg["href"])
    except:
        print("End of style - "+str(url))

Now we need a function to scrape the recipe information from each of the urls collected above. A quick study of a single recipe page reveals that almost everything that we're looking for is contained in two tables - one with recipe information (color, bitterness, ABV etc.) and one with a list of ingredients. In order to reduce redundant information in the final dataset, I decided to keep this two table structure and utilize the recipe's ID number as a primary/foreign key. The recUrl regex function is used to extract the recipe ID and a clean version of the recipe name from the url itself, and the recTbl regex function is used to extract key/value pairs from the recipe info table as this table does not make use of headers to define it's field names. The function returns recipe information as a dictionary and recipe ingredients as a list of dictionaries.

recTbl_regex = re.compile(
    r'(^<td><b>(?P<header>[A-Za-z]+\s?[A-Za-z]*):\s?</b>\s?(?P<data>[^<]+)</td>$)')
recUrl_regex = re.compile(
    r'(^http://beersmithrecipes.com/viewrecipe/(?P<recid>[0-9]+)/
    (?P<recname>[a-z0-9\-]+))')

def scrapeRecipe(url):
    try:
        req = Request(url, headers={'User-Agent' : "Magic Browser"})
        html = urlopen(req)
    except HTTPError as htper:
        return htper
    except URLError as urler:
        return urler
    try:
        recDict = {}
        urlparse = recUrl_regex.search(str(url))
        recDict['Rec_ID'] = str(urlparse.groupdict().get('recid'))
        recDict['Rec_Name'] = str(urlparse.groupdict().get('recname'))
        bsObj = BeautifulSoup(html.read(),'lxml')
        recType = bsObj.find("h3")
        recDict['Type'] = recType.get_text()
        recTbl = bsObj.find("table", {"class": "r_hdr"}).findAll("td")
        for x in recTbl:
            tdparse = recTbl_regex.search(str(x))
            if tdparse is not None:
                header = str(tdparse.groupdict().get('header'))
                data = str(tdparse.groupdict().get('data'))
                recDict[header] = data
        ingList = []
        ingTbl = bsObj.find("table", {"class" : "recipes"}).findAll("tr")
        for x in ingTbl:
            try:
                ingList.append({"Ingredient" : x.findAll('td')[1].get_text(),
                               "Ing_Type" : x.findAll('td')[2].get_text(),
                               "Rec_ID" : str(urlparse.groupdict().get('recid'))})
            except:
                pass
        return(recDict, ingList)
    except:
        print("recipe error "+str(url))
        pass

Now we're ready to scrape! After running the scrapeStyleUrls, scrapeRecipeUrls, and scrapeRecipe functions, we'll have a list of recipe dictionaries with recipe attributes and a list of ingredient dictionaries with ingredient attributes and recipe ID's.

styleUrls = scrapeStyleUrls(r'http://beersmithrecipes.com/styles')

recipeurls = []
for url in styleUrls:
    scrapeRecipeUrls(url)

recipeList = []
ingredientList = []
for url in recipeurls:
    try:
        recDict,ingList = scrapeRecipe(url)
        recipeList.append(recDict)
        for ingDict in ingList:
            ingredientList.append(ingDict)
    except:
        pass

Recipe Collection Analysis

The scraper was able to collect 30,267 recipes in total, each with 19 different attributes. After breifly scanning the dataset I found a couple of things that needed to be addressed before moving on with the analysis. First, the scraper picked up a handful of duplicate recipes that need to be removed. Second, the values of some variables lead me to beleive that a few of the recipes in this dataset are bogus. This makes sense because the recipes are user submitted and I'm fairly certain that BeerSmith does not maintain a strict review/approval process for their database. Discovering this also seemed like a pefect time to pare down the dataset to the variables that I'm truly interested in. I settled on four primary variables of interest that could also help me weed out unrealistic recipes.

  1. ABV - Alcohol by volume expressed as a percentage. Realistic values fall between 2% and 30%.
  2. Bitterness - Expressed in international bitterness units (IBU's). Realistic values fall between 0 and 150.
  3. Color - Expressed via the standard reference method (SRM). Realistic values fall between 0 and 100.
  4. Style Master - Generalized beer style of the recipe.

ABV, color and bitterness values in the dataset contain some unwanted text so below I'll clean them up, convert them to numeric, and filter out records that do not meet the criteria above.

library(dplyr)

recipes <- read.csv("data/bsrecipes.csv")

# clean and select fields of interest
recipes <- recipes %>%
  group_by(Rec_ID) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  mutate(ABV_pct = as.numeric(gsub("%","",ABV)),
         Bitterness_ibu = as.numeric(gsub(" IBUs","",Bitterness)),
         Color_srm = as.numeric(gsub(" SRM","",Color))) %>%
  select(ABV_pct, Bitterness_ibu, Color_srm, Style_Master, Rec_Type = Type)

# remove unrealistic recipes
recipes <- recipes %>%
  filter(ABV_pct >= 2 & ABV_pct <= 25,
         Bitterness_ibu > 0 & Bitterness_ibu <= 150,
         Color_srm < 100)

After cleaning we are down to a final count of 26,578 recipes. Now let's find out which styles are the most popular. I'll utilize SRM values to color my visualizations so I've mapped some approximate hex values that I grabbed from here to my dataset. The plotting code isn't shown below but can be viewed on github.

It seems that IPA's are the overwhelming favorite on beersmith and it's interesting to note that 13 styles account for 85% of all recipes in the database. Even though beers with mid-range SRM's are the most popular (thanks to IPA's and pale ales), there still seems to be plenty of love for dark beers with stouts and porters taking third and fifth place respectively.

Now let's take a look at preferences related to alchol level and bitterness by creating a scatterplot. We'll plot all of the recipes as points and overlay aggregate measurments for popular stlyes (> 200 recipes) as rectangular labels to see where each falls on average.

An interesting spread. Light beers like pilsners and kolsches are expectedly tame and have a fairly limited range of both bitterness and alcohol content. Dark beers seem to have the most variation in alcohol content while beers with mid range SRM's come in a wide variety of bitterness levels. All in all, the sweet spot for this corpus seems to be around 3%-9% ABV with 15-45 IBU's.

Visualizing the Ingredients

I explored a number of options for digging into the ingredients portion of this dataset and concluded that network diagrams would be a great way to see which ingredients are the most popular while also getting an idea of how they are typically used together.

In order to build a network visualization for each style of beer, I need to transform the ingredients of each recipe into a dataset of nodes and edges. Each node will be an ingredient, and I want the size of the node to correspond with how often the ingredient appears in the recipe corpus. The size & charge of the edges between nodes will then correspond to the frequency with which two ingredients appear together. In order to count these appearances, I'll increment the value of an edge by one each time two ingredients are used together. For example, a recipe with three ingredients would have three pairs counted - (ingredient 1, ingredient 2) / (ingredient 1, ingredient 3) / (ingredient 2, ingredient 3).

The final output needs to be a json object that I can feed to the D3 powered visualization I'll be building. The commented code chunk below shows the process of converting the raw csv files into a json object with R.

library(dplyr)
library(tidyr)

ingredients <- read.csv("data/bsingredients.csv", stringsAsFactors = F)
recipes <- read.csv("data/bsrecipes.csv", stringsAsFactors = F)

### Remove duplicates / select cols for join
recipes <- recipes %>%
  group_by(Rec_ID) %>%
  filter(row_number() == 1) %>%
  select(Rec_ID, Style_Master)


### Join style_master to ingredients, fix names
ingredients <- ingredients %>%
  left_join(recipes, by = 'Rec_ID') %>%
  filter(!(is.na(Style_Master))) %>%
  select(ing_type = Ing_Type, ing = Ingredient, ing_simple = Ingredient_simple,
         rec_id = Rec_ID, style = Style_Master)

### imported style_df dataframe from beersmith_recipes.r to filter out frequent styles
style_df <- style_df %>%
  arrange(desc(count)) %>% slice(1:15) %>% select(Style_Master)

### join style_df table to filter out low volume styles
### remove combos with less than 20 appearances to prune nodes & filter some edge cases
ingredient_nodes <- ingredients %>%
  group_by(ing_simple, ing_type, style) %>%
  filter(ing_type %in% c("Grain","Hops","Yeast")) %>%
  summarise(count = n()) %>%
  inner_join(style_df, by = c('style' = 'Style_Master')) %>%
  arrange(desc(count)) %>%
  filter(count > 20,
         ing_simple != "None",
         !(ing_simple == "Brewer" & ing_type == "Grain"),
         !(ing_simple == "Crystal" & ing_type == "Grain"))

### Manual cleanup in excel to merge simmilar nodes
# in_cleanup <- ingredient_nodes %>%
#   group_by(ing_simple, ing_type) %>%
#   summarise()
#
# write.csv(in_cleanup, 'in_cleanup.csv', row.names = F)
in_cleanup <- read.csv('data/in_cleanup.csv', stringsAsFactors = F)

### join cleaned ingredient column, recalc the count, spread to table format by style
ingredient_nodes <- ingredient_nodes %>%
  left_join(in_cleanup, by = c('ing_simple','ing_type')) %>%
  filter(ing_final != '') %>%
  group_by(ing_final, ing_type, style) %>%
  summarise(count = sum(count))

### check for overallping names regardless of ing_type
ingredient_nodes %>% group_by(ing_final) %>% summarise(count = n()) %>% arrange(desc(count))

### filter out ingredient/style combinations not in node list + add ing_final
pre_edge_df <- ingredients %>%
  inner_join(in_cleanup, by = c('ing_simple','ing_type')) %>%
  filter(ing_final != '') %>%
  inner_join(ingredient_nodes, by = c('ing_final','style')) %>%
  select(ing_final, rec_id, style) %>%
  group_by(ing_final, rec_id, style) %>%
  summarise() %>%
  arrange(rec_id,ing_final)

### DPLYR Filter too slow (3.6 sec per recipe) create min/max indexing of recipes instead
rec_index_df<- pre_edge_df %>%
  ungroup() %>%
  mutate(row_num = row_number()) %>%
  group_by(rec_id) %>%
  mutate(min_row = min(row_num),
         max_row = max(row_num)) %>%
  group_by(rec_id, min_row, max_row) %>%
  summarise()

### Function for getting combo dataframes and adding to final list
get_combos <- function(min_idx, max_idx) {

      filtered <- pre_edge_df[min_idx:max_idx,]
      rec_style = unique(filtered$style)

      combo_matrix <- combn(filtered$ing_final, 2)

      combo_df <- as.data.frame(t(combo_matrix), stringsAsFactors = F) %>%
        mutate(style = rec_style)

      return(combo_df)
}
### Itterate through each rec min/max in rec_index_df and get combos
### add combos to final list and concatenate
combo_list <- list()
i = 1

for(row_idx in seq_len(nrow(rec_index_df))){

  min <- as.numeric(rec_index_df[row_idx, 2])
  max <- as.numeric(rec_index_df[row_idx, 3])

  if(max > min){
    combo_list[[i]] <- get_combos(min,max)
    i <- i + 1
    }
}
combo_df_final <- bind_rows(combo_list)

ingredient_edges <- combo_df_final %>%
  group_by(V1, V2, style) %>%
  summarise(count = n())

saveRDS(ingredient_nodes, 'data/ingredient_nodes.rds')
saveRDS(ingredient_edges, 'data/ingredient_edges.rds')

# ### Nodes and edges complete ###

library(jsonlite)
#Example of how to create a json object to feed the forece simulation
  write_json(list(nodes = filter(ingredient_nodes, style == 'Porter'),
                  links = filter(ingredient_edges, style == 'Porter')), 'data/porter.json')

The network visualization below was created using R shiny and D3.js. Shiny passes json objects of nodes and edges to D3 and a force simlation is created in the browser. The repo for this visualization can be found on github and a full screen version can be viewed here.