Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added JSON Hot-Swapping Utility, Enabled for JSON-LD and RDFa #69

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cathalgarvey
Copy link
Contributor

@cathalgarvey cathalgarvey commented Mar 30, 2018

This is intended as a way to allow a library consumer to specify their own JSON decoder function to use in extruct.

The idea is that some json decoders, like ujson, can offer large gains in performance in many or most cases.. but they're not reliable enough or well-supported enough to actually switch extruct over to.

So, a library consumer can now do:

import extruct.utils
extruct.utils.set_json_decoder(ujson.loads, (ValueError,))

...and extruct should commence using ujson instead of json. The second argument is the exceptions that the library raises on bad JSON input; the new hot-swapper will cast non-native exceptions to their native equivalents (ValueError in Python 2, json.JSONDecodeError in Python 3).

Speaking of which, this PR also contains a small fix for a subtle 2/3 bug in extruct where only ValueErrors were getting caught in one code-block.

The next step for this is to profile whether ujson et al actually add any performance gains; if they did, then we could add them as optional features enabled via pip or similar.

Fixes #49

@codecov
Copy link

codecov bot commented Mar 30, 2018

Codecov Report

Merging #69 into master will increase coverage by 0.06%.
The diff coverage is 92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
+ Coverage   88.27%   88.34%   +0.06%     
==========================================
  Files           8        9       +1     
  Lines         307      326      +19     
  Branches       53       55       +2     
==========================================
+ Hits          271      288      +17     
- Misses         33       34       +1     
- Partials        3        4       +1
Impacted Files Coverage Δ
extruct/rdfa.py 100% <100%> (ø) ⬆️
extruct/jsonld.py 95.83% <100%> (ø) ⬆️
extruct/utils.py 89.47% <89.47%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c4d6dd...e469ec8. Read the comment docs.

@lopuhin
Copy link
Member

lopuhin commented Mar 30, 2018

The next step for this is to profile whether ujson et al actually add any performance gains; if they did, then we could add them as optional features enabled via pip or similar.

Do you intend to perform performance measurements in this PR? This is the primary motivation for the change, right?

@cathalgarvey
Copy link
Contributor Author

Hey @lopuhin - right now, there should be no significant change, as it's still using the same decoder. It's just abstracted to permit people to use ujson (we've had a few requests for this feature). Probably, there's no value in actually switching extruct to a third-party decoder; json is stable, fast enough for most uses, and it'll be supported as well as Python itself is.

However, I do intend to then profile whether general uses of extruct get more performant with ujson or a better-supported "json, but faster" library. :)

@kmike
Copy link
Member

kmike commented Apr 3, 2018

Hey! Why is swapping required? If ujson is faster, then we can always use it when it can be imported, and fall back to stdlib if it can't. If ujson is not faster, then we can just leave it as-is.

To check how json-ld performs I think it makes sense to get a random sample from http://webdatacommons.org/structureddata/2017-12/stats/how_to_get_the_data.html (say, 1000 or 10000 records, 1 record per domain), then benchmark. This way we can also validate that there are no exceptions or segfaults.

@cathalgarvey
Copy link
Contributor Author

Hey @kmike -
Some projects have done so, checking for ujson and using that if present. But I saw as many when I looked around for this, which had done this, and then decided it wasn't worth it.

Key grievances include:

  • ujson is essentially unmaintained.
  • Numbers too big for int don't decode to long automatically.
  • Lots of historical segfaults.
  • Raises ValueError rather than JSONDecodeError. This is consistent with Python 2 but not Python 3. So, much of the exception handler code I wrote for this PR would have been necessary for inter-version compatibility anyway.
  • May require a C compiler on some platforms, where Python's builtin JSON might be easier to use.
  • Automatic opt-in to ujson when it's present can lead to unexpected inconsistency; ujson may be installed with another package or dependency, and change the way that extruct works.

So, it's possible for callers to use ujson using this system, if they choose to, but we can otherwise avoid even having to verify that our JSON decoder is sane; the only officially supported decoder is Python's builtin one.

@kmike
Copy link
Member

kmike commented Apr 3, 2018

@cathalgarvey thanks for the pointers! It makes sense. ujson doesn't look well maintained indeed. My main concern is that we shouldn't be adding (and maintaining) feature just in case, without a real-world use case.

@cathalgarvey
Copy link
Contributor Author

cathalgarvey commented Apr 5, 2018

OK: I've added benchmarks comparing json, ujson, and yajl. The benchmarks use the existing JSON-LD-containing documents, as well as an additional 79 documents I pulled from the WDC sample for JSON-LD.

The results show that yajl does provide a useful improvement in speed. ujson is also marginally faster, but not as much. The StdDev / IQR of ujson and yajl are substantially better than the native decoder, so they also offer more predictable performance.

--------------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------------
Name (time in ms)                               Min                 Max                Mean            StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_jsonld_schemaorg_yajl           208.9508 (1.0)      228.0395 (1.01)     215.0919 (1.0)      7.5262 (1.34)     212.4026 (1.0)       7.2622 (1.46)          1;0  4.6492 (1.0)           5           1
test_bench_jsonld_schemaorg_ujson          212.3884 (1.02)     226.2464 (1.0)      216.3982 (1.01)     5.6247 (1.0)      214.7260 (1.01)      4.9682 (1.0)           1;1  4.6211 (0.99)          5           1
test_bench_jsonld_schemaorg_nativejson     213.3618 (1.02)     233.4002 (1.03)     226.2365 (1.05)     9.0250 (1.60)     230.8768 (1.09)     14.8810 (3.00)          1;0  4.4202 (0.95)          5           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(benchmarking provided by pytest_benchmark)

I'd push the commits directly, but let me first ask: I've compressed the 79 new html samples with Gzip. Is it OK to add those to the git tree? That's a lot of binary, though I don't think we'll ever be editing it so it's probably a one-time bump. I could move them into a new repository as a submodule, if we wanted to get fancy, or I could use git-lfs to store them?

The results above are on my own desktop and benchmarks like this may be prone to all sorts of externalisms, so I'd love to get some validation from others.

@cathalgarvey
Copy link
Contributor Author

cathalgarvey commented Apr 5, 2018

Just to clarify by the way, I only added benches for JSON-LD, but the other place JSON decoding happens in extruct right now is when collecting results in RDF-a.

The decoding of these RDF-a results might be a more homogenous task, so differences between libraries might be more obvious or less relevant. I don't have benchmarks for this yet, and ideally we'd find a way to not use JSON as an intermediary between the upstream library and extruct in the first place. :)

@joaquingx
Copy link
Contributor

Hi 😄 , I just want to bump this PR, it will be merged?, what else is necessary to achieve merge?. I want to help here if it's needed 💪 .
Also want to reference that this PR's trying to solve #49.

@Granitosaurus
Copy link

I feel that this is over-engineering the issue a bit.

Why not give ability to return raw json and let user to convert to dict or whatever however they want? I've submitted a PR that does exactly that - just return raw json rather than a dict:
#103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON-LD: Use UltraJSON if available
5 participants