Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import Open Food Facts Ingredients without Mongo #1540

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

strawpants
Copy link

Proposed Changes

Currently, I understood that downloading the ingredients from the Open Food Facts server requires an import using a running Mongo server. This pull request avoids using Mongo and unpacking the database file. It directly uses the JSONL data dump format as documented here , and loops over the JSON entries extracted from a Gzip stream.

The pull request adds an option to the management command import-off-products.py.

Furthermore, during the import I've noticed that many entries for license_authors were too long for the varchar(600) field so this pull requests also includes a change of this entry to TextField so it can accommodate longer strings. So this pull request does require a database migration.

Please check that the PR fulfills these requirements

  • Tests for the changes have been added (for bug fixes / features)
  • Added yourself to AUTHORS.rst

(I did not do this yet, in light of soliciting your thoughts on this first)

Other questions

  • I did not yet test whether the Mongo import still works as earlier (it is not trivial to test here because wger itself is running inside a docker ecosystem).
  • Do users need to run some commands in their local instances due to this PR
    (e.g. database migration)?
    Yes, the license_authors field has changed from varchar(600) to TextField

@rolandgeider
Copy link
Member

cool idea! How does this work performance/memory wise? (also, my first thought would be to download the dump somewhere very obvious, so that we don't forget to delete it afterwards)

@strawpants
Copy link
Author

My understanding is that, since the gzipped file is opened as a stream, it doesn't load the entire file into memory. The code just loops over the lines as they are decompressed om the fly. Also, the downloading itself is done in chunks, so memory wise I don't expect big issues here. The gzipped archive is kept for now (and not redownloaded if run again). But it could be an option to automatically remove it after successfully loading the content in the database.

I didn't try it yet, but ideally, one would do regular (e.g. every week) updates and use the open food facts delta files provided. I suspect they are the same format and since they are much smaller they can be applied incrementally with low effort. In this more advanced setup, it does make sense to add a new database table so the openfoodfact import events are registered and can be used to decide which files to download.

@rolandgeider
Copy link
Member

using the delta files would indeed be a huge improvement, at the moment the import is run... very sporadically. I would even suggest that we don't even need a new table, just writing the last date into a text file in the script folder would be enough or something similarly simple.

In any case, I'll try to take a look at your changes and run them on my machine today

for product in db.products.find(
{
if options['usejsonl']:
products=self.products_jsonl(languages=list(languages.keys()),completeness=self.completeness)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just my github or is the alignment off here?
image

import json
import requests
from gzip import GzipFile
off_url='https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a default in env file? That way if the URL changes, people can update with a hotfix to env file instead of pulling in a new release?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that only "upstream" wger regularly does this and other instances sync from us like with the exercises. Then we don't generate so much traffic for OFF (and don't need to respond as fast to such changes, if they ever occur)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants