json std lib slow #51

ziodave · 2023-01-01T11:57:27Z

I ran a cProfile session and found that half of the time is spent within the JSON loads.

While a quick replacement of json with orjson brings improvements, if we could switch to msgspec we could seriously improve the indexing performance:

https://pythonspeed.com/articles/faster-python-json-parsing/

https://jcristharif.com/msgspec/

Only thing msgspec requires the json schema, as far as I could see these are the required fields:

field aliases
field claims
field descriptions
field id
field labels
field lastrevid
field sitelinks

Am I missing something?

The text was updated successfully, but these errors were encountered:

wetneb · 2023-01-02T07:42:00Z

Hi @ziodave, thanks for pointing this out!
I have no objection to optimizing this part of the pipeline - I did not think about it, but if it can make a difference for you, let's do it!

ziodave · 2023-01-02T07:44:13Z

For the time being, switching to orjson brought a great advantage. The profiling now shows that the SPARQL queries are slowing down the initial indexing. The queries are being cached in memory, probably storing them as file caches (and even pushing them to a GH repo) would help. I'll create a new issue for this ;-)

jcrist · 2023-01-03T18:19:41Z

Apologies for spying on this conversation (I have a github scraper setup so I can see how people are using msgspec).

Only thing msgspec requires the json schema

For the time being, switching to orjson brought a great advantage.

Just a quick clarifying point - msgspec doesn't require a schema, it can be used without one, the same as orjson.

In [1]: import msgspec

In [2]: msgspec.json.decode(b'{"hello": "world"}')
Out[2]: {'hello': 'world'}

A schema will speed things up further (and provide some guard rails for malformed data), but even without a schema msgspec should be significantly faster than the stdlib and mildly faster than orjson.

Either msgspec or orjson are a good option here if json performance is a limiting factor, choose whichever tool fits your requirements the best. Please don't feel any pressure to use msgspec based on my comment here.

ziodave · 2023-01-03T18:27:08Z

Thanks @jcrist, yes I started adding the schema as an improvement I am testing out on the languagemodel (along with reactive rxpy) and so far the profiler looks promising :-)

https://github.com/ziodave/opentapioca/blob/c227d07178bcd6475b22c942d1c4b2aee83f90b5/opentapioca/languagemodel.py#L179

ziodave · 2023-01-04T10:06:08Z

And now that I added some multiproc, we're back at that :-)

ziodave · 2023-01-04T10:51:52Z

@jcrist is there a way to ignore a typed value when the type is unknown?

Right now I get this error:

msgspec.ValidationError: Invalid value 'globecoordinate' - at `$.claims[...][0].mainsnak.datavalue.type`

But I don't care about globecoordinate values.

ziodave · 2023-01-04T14:10:15Z

I ended up defining it.

So far so good ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json std lib slow #51

json std lib slow #51

ziodave commented Jan 1, 2023

wetneb commented Jan 2, 2023

ziodave commented Jan 2, 2023

jcrist commented Jan 3, 2023

ziodave commented Jan 3, 2023

ziodave commented Jan 4, 2023

ziodave commented Jan 4, 2023

ziodave commented Jan 4, 2023

json std lib slow #51

json std lib slow #51

Comments

ziodave commented Jan 1, 2023

wetneb commented Jan 2, 2023

ziodave commented Jan 2, 2023

jcrist commented Jan 3, 2023

ziodave commented Jan 3, 2023

ziodave commented Jan 4, 2023

ziodave commented Jan 4, 2023

ziodave commented Jan 4, 2023