A Delicious Analysis! (aka topic modelling using recipes)

A few months ago, I saw a link on twitter to an awesome graph charting the similarities of different foods based on their flavour compounds, in addition to their prevalence in recipes (see the whole study, The Flavor Network and the Principles of Food Pairing).  I thought this was really neat and became interested in potentially using the data for something slightly different; to figure out which ingredients tended to correlate across recipes.  I emailed one of the authors, Yong-Yeol Ahn, who is a real mensch by the way, and he let me know that the raw recipe data is readily available on his website!

Given my goal of looking for which ingredients correlate across recipes, I figured this would be the perfect opportunity to use topic modelling (here I use Latent Dirichlet Allocation or LDA).  Usually in topic modelling you have a lot of filtering to do.  Not so with these recipe data, where all the words (ingredients) involved in the corpus are of potential interest, and there aren’t even any punctuation marks!  The topics coming out of the analysis would represent clusters of ingredients that co-occur with one another across recipes, and would possibly teach me something about cooking (of which I know precious little!).

All my code is at the bottom, so all you’ll find up here are graphs and my textual summary.  The first thing I did was to put the 3 raw recipe files together using python.  Each file consisted of one recipe per line, with the cuisine of the recipe as the first entry on the line, and all other entries (the ingredients) separated by tab characters.  In my python script, I separated out the cuisines from the ingredients, and created two files, one for the recipes, and one for the cuisines of the recipes.

Then I loaded up the recipes into R and got word/ingredient counts.  As you can see below, the 3 most popular ingredients were egg, wheat, and butter.  It makes sense, considering the fact that roughly 70% of all the recipes fall under the “American” cuisine.  I did this analysis for novelty’s sake, and so I figured I would take those ingredients out of the running before I continued on.  Egg makes me fart, wheat is not something I have at home in its raw form, and butter isn’t important to me for the purpose of this analysis!

Recipe Popularity of Top 30 Ingredients

Here are the top ingredients without the three filtered out ones:

Recipe Popularity of Top 30 Ingredients - No Egg Wheat or Butter

Finally, I ran the LDA, extracting 50 topics, and the top 5 most characteristic ingredients of each topic.  You can see the full complement of topics at the bottom of my post, but I thought I’d review some that I find intriguing.  You will, of course, find other topics intriguing, or some to be bizarre and inappropriate (feel free to tell me in the comment section).  First, topic 4:

[1] "tomato"  "garlic"  "oregano" "onion"   "basil"

Here’s a cluster of ingredients that seems decidedly Italian.  The ingredients seem to make perfect sense together, and so I think I’ll try them together next time I’m making pasta (although I don’t like tomatoes in their original form, just tomato sauce).

Next, topic 19:

[1] "vanilla" "cream"   "almond"  "coconut" "oat"

This one caught my attention, and I’m curious if the ingredients even make sense together.  Vanilla and cream makes sense… Adding coconut would seem to make sense as well.  Almond would give it that extra crunch (unless it’s almond milk!).  I don’t know whether it would be tasty however, so I’ll probably pass this one by.

Next, topic 20:

[1] "onion"         "black_pepper"  "vegetable_oil" "bell_pepper"   "garlic"

This one looks tasty!  I like spicy foods and so putting black pepper in with onion, garlic and bell pepper sounds fun to me!

Next, topic 23:

[1] "vegetable_oil" "soy_sauce"     "sesame_oil"    "fish"          "chicken"

Now we’re into the meaty zone!  I’m all for putting sauces/oils onto meats, but putting vegetable oil, soy sauce, and sesame oil together does seem like overkill.  I wonder whether soy sauce shows up with vegetable oil or sesame oil separately in recipes, rather than linking them all together in the same recipes.  I’ve always liked the extra salty flavour of soy sauce, even though I know it’s horrible for you as it has MSG in it.  I wonder what vegetable oil, soy sauce, and chicken would taste like.  Something to try, for sure!

Now, topic 26:

[1] "cumin"      "coriander"  "turmeric"   "fenugreek"  "lemongrass"

These are a whole lot of spices that I never use on my food.  Not for lack of wanting, but rather out of ignorance and laziness.  One of my co-workers recently commented that cumin adds a really nice flavour to food (I think she called it “middle eastern”).  I’ve never heard a thing about the other spices here, but why not try them out!

Next, topic 28:

[1] "onion"       "vinegar"     "garlic"      "lemon_juice" "ginger"

I tend to find that anything with an intense flavour can be very appetizing for me.  Spices, vinegar, and anything citric are what really register on my tongue.  So, this topic does look very interesting to me, probably as a topping or a sauce.  It’s interesting that ginger shows up here, as that neutralizes other flavours, so I wonder whether I’d include it in any sauce that I make?

Last one!  Topic 41:

[1] "vanilla"  "cocoa"    "milk"     "cinnamon" "walnut"

These look like the kinds of ingredients for a nice drink of some sort (would you crush the walnuts?  I’m not sure!)

Well, I hope you enjoyed this as much as I did!  It’s not a perfect analysis, but it definitely is a delicious one 🙂  Again, feel free to leave any comments about any of the ingredient combinations, or questions that you think could be answered with a different analysis!

import os
rfiles = os.listdir('.')
rc = []
for f in rfiles:
if '.txt' in f:
# The recipes come in 3 txt files consisting of 1 recipe per line, the
# cuisine of the recipe as the first entry in the line, and all subsequent ingredient
# entries separated by a tab
infile = open(f, 'r')
all_rs = '\n'.join(rc)
import re
line_pat = re.compile('[A-Za-z]+\t.+\n')
recipe_lines = line_pat.findall(all_rs)
new_recipe_lines = []
cuisine_lines = []
for n,r in enumerate(recipe_lines):
# First we find the cuisine of the recipe
cuisine = r[:r.find('\t')]
# Then we append the ingredients withou the cuisine
new_recipe_lines.append(recipe_lines[n].replace(cuisine, ''))
# I saved the cuisines to a different list in case I want to do some
# cuisine analysis later
cuisine_lines.append(cuisine + '\n')
outfile1 = open('recipes combined.tsv', 'wb')
outfile2 = open('cuisines.csv', 'wb')

recipes = readLines('recipes combined.tsv')
# Once I read it into R, I have to get rid of the /t
# characters so that it's more acceptable to the tm package
recipes.new = apply(as.matrix(recipes), 1, function (x) gsub('\t',' ', x))
recipes.corpus = Corpus(VectorSource(recipes.new))
recipes.dtm = DocumentTermMatrix(recipes.corpus)
# Now I filter out any terms that have shown up in less than 10 documents
recipes.dict = Dictionary(findFreqTerms(recipes.dtm,10))
recipes.dtm.filtered = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))
# Here I get a count of number of ingredients in each document
# with the intent of deleting any documents with 0 ingredients
ingredient.counts = apply(recipes.dtm.filtered, 1, function (x) sum(x))
recipes.dtm.filtered = recipes.dtm.filtered[ingredient.counts > 0]
# Here i get some simple ingredient frequencies so that I can plot them and decide
# which I'd like to filter out
recipes.m = as.matrix(recipes.dtm.filtered)
popularity.of.ingredients = sort(colSums(recipes.m), decreasing=TRUE)
popularity.of.ingredients = data.frame(ingredients = names(popularity.of.ingredients), num_recipes=popularity.of.ingredients)
popularity.of.ingredients$ingredients = reorder(popularity.of.ingredients$ingredients, popularity.of.ingredients$num_recipes)
ggplot(popularity.of.ingredients[1:30,], aes(x=ingredients, y=num_recipes)) + geom_point(size=5, colour="red") + coord_flip() +
ggtitle("Recipe Popularity of Top 30 Ingredients") +
theme(axis.text.x=element_text(size=13,face="bold", colour="black"), axis.text.y=element_text(size=13,colour="black",
face="bold"), axis.title.x=element_text(size=14, face="bold"), axis.title.y=element_text(size=14,face="bold"),
# Having found wheat, egg, and butter to be the three most frequent ingredients
# (and not caring too much about them as ingredients in general) I remove them
# from the corpus and redo the document term matrix
recipes.corpus = tm_map(recipes.corpus, removeWords, c("wheat","egg","butter")) # Go back to line 6
recipes.dtm.final = DocumentTermMatrix(recipes.corpus, list(dictionary = recipes.dict))
# Finally, I run the LDA and extract the 5 most
# characteristic ingredients in each topic… yummy!
recipes.lda = LDA(recipes.dtm.filtered, 50)
t = terms(recipes.lda,5)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
[1,] "onion" "pepper" "milk" "tomato" "olive_oil" "milk" "milk" "tomato" "garlic"
[2,] "rice" "vinegar" "vanilla" "garlic" "garlic" "nutmeg" "pepper" "cayenne" "cream"
[3,] "cayenne" "onion" "cocoa" "oregano" "onion" "vanilla" "yeast" "olive_oil" "vegetable_oil"
[4,] "chicken_broth" "tomato" "onion" "onion" "black_pepper" "cinnamon" "potato" "garlic" "pepper"
[5,] "olive_oil" "milk" "cane_molasses" "basil" "vinegar" "cream" "lemon_juice" "pepper" "milk"
Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17
[1,] "milk" "soy_sauce" "vegetable_oil" "onion" "milk" "tamarind" "milk" "vegetable_oil"
[2,] "cream" "scallion" "milk" "black_pepper" "cinnamon" "onion" "vanilla" "pepper"
[3,] "vanilla" "sesame_oil" "pepper" "vinegar" "onion" "garlic" "cream" "cream"
[4,] "cane_molasses" "cane_molasses" "cane_molasses" "bell_pepper" "cayenne" "corn" "vegetable_oil" "black_pepper"
[5,] "cinnamon" "roasted_sesame_seed" "cinnamon" "bacon" "olive_oil" "vinegar" "garlic" "mustard"
Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24 Topic 25 Topic 26
[1,] "cane_molasses" "vanilla" "onion" "garlic" "onion" "vegetable_oil" "onion" "cream" "cumin"
[2,] "onion" "cream" "black_pepper" "cane_molasses" "garlic" "soy_sauce" "garlic" "tomato" "coriander"
[3,] "vinegar" "almond" "vegetable_oil" "vinegar" "tomato" "sesame_oil" "cane_molasses" "chicken" "turmeric"
[4,] "olive_oil" "coconut" "bell_pepper" "black_pepper" "olive_oil" "fish" "tomato" "lemon_juice" "fenugreek"
[5,] "pepper" "oat" "garlic" "soy_sauce" "basil" "chicken" "vegetable_oil" "black_pepper" "lemongrass"
Topic 27 Topic 28 Topic 29 Topic 30 Topic 31 Topic 32 Topic 33 Topic 34 Topic 35
[1,] "onion" "onion" "onion" "onion" "vanilla" "garlic" "onion" "onion" "garlic"
[2,] "garlic" "vinegar" "celery" "pepper" "milk" "onion" "pepper" "garlic" "basil"
[3,] "black_pepper" "garlic" "chicken" "garlic" "garlic" "vegetable_oil" "garlic" "vegetable_oil" "pepper"
[4,] "tomato" "lemon_juice" "vegetable_oil" "parsley" "cinnamon" "cayenne" "black_pepper" "black_pepper" "tomato"
[5,] "olive_oil" "ginger" "carrot" "olive_oil" "cream" "beef" "beef" "chicken" "olive_oil"
Topic 36 Topic 37 Topic 38 Topic 39 Topic 40 Topic 41 Topic 42 Topic 43 Topic 44
[1,] "onion" "onion" "onion" "cayenne" "garlic" "vanilla" "vanilla" "scallion" "milk"
[2,] "garlic" "garlic" "cream" "garlic" "onion" "cocoa" "cane_molasses" "garlic" "tomato"
[3,] "cayenne" "black_pepper" "tomato" "ginger" "bell_pepper" "milk" "cocoa" "ginger" "garlic"
[4,] "vegetable_oil" "lemon_juice" "cane_molasses" "rice" "olive_oil" "cinnamon" "oat" "soybean" "vegetable_oil"
[5,] "oregano" "scallion" "milk" "onion" "milk" "walnut" "milk" "pepper" "cream"
Topic 45 Topic 46 Topic 47 Topic 48 Topic 49 Topic 50
[1,] "onion" "cream" "pepper" "cream" "milk" "olive_oil"
[2,] "cream" "black_pepper" "vegetable_oil" "tomato" "vanilla" "tomato"
[3,] "black_pepper" "chicken_broth" "garlic" "beef" "lard" "parmesan_cheese"
[4,] "milk" "vegetable_oil" "onion" "garlic" "cocoa" "lemon_juice"
[5,] "cinnamon" "garlic" "olive_oil" "carrot" "cane_molasses" "garlic"