Current pipelines:

Background

This project is an attempt to create a set of programs that can:

Scrape a recipe from an internet site (e.g. taste.com)
Parse that recipe into constituent components (ingredients, units, quantities, notes)
Scrape current grocery store product prices from Coles
Identify what product can be used as each ingredient
Compose a detailed shopping list for a set of recipes, including information such as:
- Price per serving
- Overall price
- Possible ingredient alternatives (use a different brand of chicken, perhaps?)

These goals are attained through the use of three types of models:

Conditional Random Fields
Metric learning
Neural networks

Which are used to:

Parse recipes into constituent components
Convert Coles products into a vector space with more useful characteristics
Convert from an ingredient (e.g. "basil") to a Coles product in the above metric space (e.g. "Coles Basil Herb Punnet")

Setup

You'll need to install CRFSuite on your system to generate and use the CRF models in this project.

Optionally, you could instead use CRF++, but you will have to modify the bash scripts in bin/, specifically bin/train-model, bin/parse-ingredients and bin/parse-products to use CRF++ instead.

Training and tagging is done by converting annotations and input into CRF++ format as an intermediate step, so all you'll need to do is remove the command that converts from CRF++ to CRFSuite format, and use CRF++ to train and tag instead.

Specifically, these lines are of interest:

pipenv run python bin/crfpp_to_crfsuite.py 
crfsuite tag ... 
crfsuite train ...

You may also need to mess with bin/read_results.py, as CRFSuite only outputs tags, and not the items which were tagged (as opposed to CRF++, which outputs items and respective tag simultaneously).

Current pipelines:

These instructions assume that you are using CRFSuite. If you're using CRF++, sorry - some of these might need some additional effort if you want to make them work properly.

Training a model:

Pick an annotations file (e.g. "data/annotations/annotated-ingredients.json")
Pick an output model location (e.g. "models/ingredients.crfmodel")
Set these environment variables respectively e.g.

ANNOTATIONS_FILE="data/annotations/annotated-ingredients.json"
MODEL_FILE="models/nyt-ingredients.crfmodel"

Train the model

bin/train-model "$ANNOTATION_FILE" "$MODEL_FILE"

The bin/train-model script does these things in sequence:

Converts the annotations to an intermediate CRF++ friendly format.
Converts this intermediate format to a CRFSuite friendly format.
Uses CRFSuite to train a model on the data in the CRFSuite friendly format.

Parsing ingredients

Pick a model that you've previously trained on ingredient annotations (e.g. "models/ingredients.crfmodel")
Put all of your desired ingredients in a .txt file, separated by a newline. This will look like:

1 cup plain flour
250g shallots, diced finely
500mL beef stock
Salt, to taste (optional)

Run bin/parse-ingredients with your model file and ingredients file specified in the following order:

bin/parse-ingredients "models/ingredients.crfmodel" "ingredients.txt" > "output.json"

The output is a json file with the parsed ingredients, as well as the probability given by the model for these particular tags.

{"prob": 0.983397, "QTY": ["1"], "UNIT": ["cup"], "VARIANT": ["plain"], "NAME": ["flour"]}

Scraping and parsing products

Run pipenv run python scraping/coles.py. This scrapes the Coles website for products listed online, and dumps the data into data/products-full.json.
Pick a model that you've previously trained on product annotations (e.g. "models/metric/product.crfmodel")
Run bin/parse-products with your model file and data/products-full.json specified in the following order:

bin/parse-products "models/metric/products.crfmodel" "data/products-full.json" > "parsed-products.json"

This process generats a json file identical to the scraped products json file, with the addition of tags for that product and the probability of that tag as given by the model.

Creating a product metric space

This step takes a .csv file filled with triplets that are then used to train a metric learning algorithm. This generates a lower-dimensional vector space that can express the similarities between products in a more space-efficient way.

Metric Learning @ metric-learn

Similarity learning @ Wikipedia

We are using triplets instead of annotated singles or pairs. This is because we assume that products can fit in multiple classes (which are super- or sub-sets of each other e.g. meat, beef, beef + burger, beef + sausage), and that we have no way to specify the similarity directly between two items.

Using triplets allow us to define the similarity measure using comparisons between products relatively:

Beef Sausage is closer to Beef than Greek Yoghurt

instead of:

Beef Sausage is 0.28494 close to Beef

Q&A

If you have questions about this project, or want to use parts of it in your own projects, feel free to email me at reubendutton@gmail.com

Todo

Add more annotations for the metric learner and neural network models
Add more annotations for the ingredient and product parser CRF models
Create more accurate conversions between units (e.g. cups of basil to grams of Coles Basil Herb Punnet)
General optimization (slow :( )

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
bin		bin
data		data
scraping		scraping
tests		tests
training		training
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
add-section.py		add-section.py
enums.py		enums.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

data

data

scraping

scraping

tests

tests

training

training

.gitignore

.gitignore

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

add-section.py

add-section.py

enums.py

enums.py

Repository files navigation

Background

Setup

Current pipelines:

Training a model:

Parsing ingredients

Scraping and parsing products

Creating a product metric space

Q&A

Todo

About

Releases

Packages

Languages

reuben-dutton/recipe-processor

Folders and files

Latest commit

History

Repository files navigation

Background

Setup

Current pipelines:

Training a model:

Parsing ingredients

Scraping and parsing products

Creating a product metric space

Q&A

Todo

About

Topics

Resources

Stars

Watchers

Forks

Languages