Import Open Food Facts Ingredients without Mongo #1540

strawpants · 2024-01-01T21:32:24Z

Proposed Changes

Currently, I understood that downloading the ingredients from the Open Food Facts server requires an import using a running Mongo server. This pull request avoids using Mongo and unpacking the database file. It directly uses the JSONL data dump format as documented here , and loops over the JSON entries extracted from a Gzip stream.

The pull request adds an option to the management command import-off-products.py.

Furthermore, during the import I've noticed that many entries for license_authors were too long for the varchar(600) field so this pull requests also includes a change of this entry to TextField so it can accommodate longer strings. So this pull request does require a database migration.

Please check that the PR fulfills these requirements

Tests for the changes have been added (for bug fixes / features)
Added yourself to AUTHORS.rst

(I did not do this yet, in light of soliciting your thoughts on this first)

Other questions

I did not yet test whether the Mongo import still works as earlier (it is not trivial to test here because wger itself is running inside a docker ecosystem).
Do users need to run some commands in their local instances due to this PR
(e.g. database migration)?
Yes, the license_authors field has changed from varchar(600) to TextField

…uires no mongo)

rolandgeider · 2024-01-06T11:42:46Z

cool idea! How does this work performance/memory wise? (also, my first thought would be to download the dump somewhere very obvious, so that we don't forget to delete it afterwards)

strawpants · 2024-01-08T20:36:00Z

My understanding is that, since the gzipped file is opened as a stream, it doesn't load the entire file into memory. The code just loops over the lines as they are decompressed om the fly. Also, the downloading itself is done in chunks, so memory wise I don't expect big issues here. The gzipped archive is kept for now (and not redownloaded if run again). But it could be an option to automatically remove it after successfully loading the content in the database.

I didn't try it yet, but ideally, one would do regular (e.g. every week) updates and use the open food facts delta files provided. I suspect they are the same format and since they are much smaller they can be applied incrementally with low effort. In this more advanced setup, it does make sense to add a new database table so the openfoodfact import events are registered and can be used to decide which files to download.

rolandgeider · 2024-01-09T14:24:46Z

using the delta files would indeed be a huge improvement, at the moment the import is run... very sporadically. I would even suggest that we don't even need a new table, just writing the last date into a text file in the script folder would be enough or something similarly simple.

In any case, I'll try to take a look at your changes and run them on my machine today

ebwinters · 2024-01-23T18:29:28Z

wger/nutrition/management/commands/import-off-products.py

-        for product in db.products.find(
-            {
+        if options['usejsonl']:
+            products=self.products_jsonl(languages=list(languages.keys()),completeness=self.completeness)


is it just my github or is the alignment off here?

ebwinters · 2024-01-23T18:30:45Z

wger/nutrition/management/commands/import-off-products.py

+        import json
+        import requests
+        from gzip import GzipFile
+        off_url='https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz'


should this be a default in env file? That way if the URL changes, people can update with a hotfix to env file instead of pulling in a new release?

I was thinking that only "upstream" wger regularly does this and other instances sync from us like with the exercises. Then we don't generate so much traffic for OFF (and don't need to respond as fast to such changes, if they ever occur)

strawpants added 2 commits January 1, 2024 21:04

Add option to download ingredients as JSONL from Open Food Facts (req…

feb4540

…uires no mongo)

update authors

fff7c86

ebwinters reviewed Jan 23, 2024

View reviewed changes

Merge branch 'master' into import_off_jsonl

f5fcae9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Open Food Facts Ingredients without Mongo #1540

Import Open Food Facts Ingredients without Mongo #1540

strawpants commented Jan 1, 2024

rolandgeider commented Jan 6, 2024

strawpants commented Jan 8, 2024

rolandgeider commented Jan 9, 2024

ebwinters Jan 23, 2024

ebwinters Jan 23, 2024

rolandgeider Jan 23, 2024

Import Open Food Facts Ingredients without Mongo #1540

Are you sure you want to change the base?

Import Open Food Facts Ingredients without Mongo #1540

Conversation

strawpants commented Jan 1, 2024

Proposed Changes

Please check that the PR fulfills these requirements

Other questions

rolandgeider commented Jan 6, 2024

strawpants commented Jan 8, 2024

rolandgeider commented Jan 9, 2024

ebwinters Jan 23, 2024

Choose a reason for hiding this comment

ebwinters Jan 23, 2024

Choose a reason for hiding this comment

rolandgeider Jan 23, 2024

Choose a reason for hiding this comment