Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

cathalgarvey · 2018-03-30T16:38:53Z

This is intended as a way to allow a library consumer to specify their own JSON decoder function to use in extruct.

The idea is that some json decoders, like ujson, can offer large gains in performance in many or most cases.. but they're not reliable enough or well-supported enough to actually switch extruct over to.

So, a library consumer can now do:

import extruct.utils
extruct.utils.set_json_decoder(ujson.loads, (ValueError,))

...and extruct should commence using ujson instead of json. The second argument is the exceptions that the library raises on bad JSON input; the new hot-swapper will cast non-native exceptions to their native equivalents (ValueError in Python 2, json.JSONDecodeError in Python 3).

Speaking of which, this PR also contains a small fix for a subtle 2/3 bug in extruct where only ValueErrors were getting caught in one code-block.

The next step for this is to profile whether ujson et al actually add any performance gains; if they did, then we could add them as optional features enabled via pip or similar.

Fixes #49

…r in Python 2

@codinguncut

…tested!) Relevant to scrapinghub#37 (hey @codinguncut ;) ) Closes: scrapinghub#49

codecov · 2018-03-30T16:40:49Z

Codecov Report

Merging #69 into master will increase coverage by 0.06%.
The diff coverage is 92%.

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
+ Coverage   88.27%   88.34%   +0.06%     
==========================================
  Files           8        9       +1     
  Lines         307      326      +19     
  Branches       53       55       +2     
==========================================
+ Hits          271      288      +17     
- Misses         33       34       +1     
- Partials        3        4       +1

Impacted Files	Coverage Δ
extruct/rdfa.py	`100% <100%> (ø)`	⬆️
extruct/jsonld.py	`95.83% <100%> (ø)`	⬆️
extruct/utils.py	`89.47% <89.47%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c4d6dd...e469ec8. Read the comment docs.

lopuhin · 2018-03-30T16:51:20Z

The next step for this is to profile whether ujson et al actually add any performance gains; if they did, then we could add them as optional features enabled via pip or similar.

Do you intend to perform performance measurements in this PR? This is the primary motivation for the change, right?

cathalgarvey · 2018-03-30T16:54:01Z

Hey @lopuhin - right now, there should be no significant change, as it's still using the same decoder. It's just abstracted to permit people to use ujson (we've had a few requests for this feature). Probably, there's no value in actually switching extruct to a third-party decoder; json is stable, fast enough for most uses, and it'll be supported as well as Python itself is.

However, I do intend to then profile whether general uses of extruct get more performant with ujson or a better-supported "json, but faster" library. :)

kmike · 2018-04-03T17:09:55Z

Hey! Why is swapping required? If ujson is faster, then we can always use it when it can be imported, and fall back to stdlib if it can't. If ujson is not faster, then we can just leave it as-is.

To check how json-ld performs I think it makes sense to get a random sample from http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html (say, 1000 or 10000 records, 1 record per domain), then benchmark. This way we can also validate that there are no exceptions or segfaults.

cathalgarvey · 2018-04-03T18:43:53Z

Hey @kmike -
Some projects have done so, checking for ujson and using that if present. But I saw as many when I looked around for this, which had done this, and then decided it wasn't worth it.

Key grievances include:

ujson is essentially unmaintained.
Numbers too big for int don't decode to long automatically.
Lots of historical segfaults.
Raises ValueError rather than JSONDecodeError. This is consistent with Python 2 but not Python 3. So, much of the exception handler code I wrote for this PR would have been necessary for inter-version compatibility anyway.
May require a C compiler on some platforms, where Python's builtin JSON might be easier to use.
Automatic opt-in to ujson when it's present can lead to unexpected inconsistency; ujson may be installed with another package or dependency, and change the way that extruct works.

So, it's possible for callers to use ujson using this system, if they choose to, but we can otherwise avoid even having to verify that our JSON decoder is sane; the only officially supported decoder is Python's builtin one.

kmike · 2018-04-03T21:03:33Z

@cathalgarvey thanks for the pointers! It makes sense. ujson doesn't look well maintained indeed. My main concern is that we shouldn't be adding (and maintaining) feature just in case, without a real-world use case.

cathalgarvey · 2018-04-05T16:10:35Z

OK: I've added benchmarks comparing json, ujson, and yajl. The benchmarks use the existing JSON-LD-containing documents, as well as an additional 79 documents I pulled from the WDC sample for JSON-LD.

The results show that yajl does provide a useful improvement in speed. ujson is also marginally faster, but not as much. The StdDev / IQR of ujson and yajl are substantially better than the native decoder, so they also offer more predictable performance.

--------------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                               Min                 Max                Mean            StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_jsonld_schemaorg_yajl           208.9508 (1.0)      228.0395 (1.01)     215.0919 (1.0)      7.5262 (1.34)     212.4026 (1.0)       7.2622 (1.46)          1;0  4.6492 (1.0)           5           1
test_bench_jsonld_schemaorg_ujson          212.3884 (1.02)     226.2464 (1.0)      216.3982 (1.01)     5.6247 (1.0)      214.7260 (1.01)      4.9682 (1.0)           1;1  4.6211 (0.99)          5           1
test_bench_jsonld_schemaorg_nativejson     213.3618 (1.02)     233.4002 (1.03)     226.2365 (1.05)     9.0250 (1.60)     230.8768 (1.09)     14.8810 (3.00)          1;0  4.4202 (0.95)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(benchmarking provided by pytest_benchmark)

I'd push the commits directly, but let me first ask: I've compressed the 79 new html samples with Gzip. Is it OK to add those to the git tree? That's a lot of binary, though I don't think we'll ever be editing it so it's probably a one-time bump. I could move them into a new repository as a submodule, if we wanted to get fancy, or I could use git-lfs to store them?

The results above are on my own desktop and benchmarks like this may be prone to all sorts of externalisms, so I'd love to get some validation from others.

cathalgarvey · 2018-04-05T16:17:55Z

Just to clarify by the way, I only added benches for JSON-LD, but the other place JSON decoding happens in extruct right now is when collecting results in RDF-a.

The decoding of these RDF-a results might be a more homogenous task, so differences between libraries might be more obvious or less relevant. I don't have benchmarks for this yet, and ideally we'd find a way to not use JSON as an intermediary between the upstream library and extruct in the first place. :)

joaquingx · 2018-12-11T04:02:27Z

Hi 😄 , I just want to bump this PR, it will be merged?, what else is necessary to achieve merge?. I want to help here if it's needed 💪 .
Also want to reference that this PR's trying to solve #49.

Granitosaurus · 2019-01-25T07:06:16Z

I feel that this is over-engineering the issue a bit.

Why not give ability to return raw json and let user to convert to dict or whatever however they want? I've submitted a PR that does exactly that - just return raw json rather than a dict:
#103

cathalgarvey added 4 commits March 30, 2018 17:04

Added a JSON hot-swapping utility in utils.py, plus tests.

59e611e

Slight change to utils.json_loads to enable passthrough for ValueErro…

456e209

…r in Python 2

Enables JSON hot-swapping for jsonld and rdfa modules

34ac135

Added a test case exploring whether ujson seems to work (very poorly …

e469ec8

…tested!) Relevant to scrapinghub#37 (hey @codinguncut ;) ) Closes: scrapinghub#49

kmike mentioned this pull request Aug 22, 2018

Handle badly formatted JSON-LD data. #87

Open

joaquingx mentioned this pull request Dec 11, 2018

JSON-LD: Use UltraJSON if available #49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

cathalgarvey commented Mar 30, 2018 •

edited by Gallaecio

codecov bot commented Mar 30, 2018 •

edited

lopuhin commented Mar 30, 2018

cathalgarvey commented Mar 30, 2018

kmike commented Apr 3, 2018 •

edited

cathalgarvey commented Apr 3, 2018

kmike commented Apr 3, 2018

cathalgarvey commented Apr 5, 2018 •

edited

cathalgarvey commented Apr 5, 2018 •

edited

joaquingx commented Dec 11, 2018

Granitosaurus commented Jan 25, 2019

Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

Are you sure you want to change the base?

Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

Conversation

cathalgarvey commented Mar 30, 2018 • edited by Gallaecio

codecov bot commented Mar 30, 2018 • edited

Codecov Report

lopuhin commented Mar 30, 2018

cathalgarvey commented Mar 30, 2018

kmike commented Apr 3, 2018 • edited

cathalgarvey commented Apr 3, 2018

kmike commented Apr 3, 2018

cathalgarvey commented Apr 5, 2018 • edited

cathalgarvey commented Apr 5, 2018 • edited

joaquingx commented Dec 11, 2018

Granitosaurus commented Jan 25, 2019

cathalgarvey commented Mar 30, 2018 •

edited by Gallaecio

codecov bot commented Mar 30, 2018 •

edited

kmike commented Apr 3, 2018 •

edited

cathalgarvey commented Apr 5, 2018 •

edited

cathalgarvey commented Apr 5, 2018 •

edited