Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json std lib slow #51

Open
ziodave opened this issue Jan 1, 2023 · 7 comments
Open

json std lib slow #51

ziodave opened this issue Jan 1, 2023 · 7 comments

Comments

@ziodave
Copy link
Contributor

ziodave commented Jan 1, 2023

I ran a cProfile session and found that half of the time is spent within the JSON loads.

While a quick replacement of json with orjson brings improvements, if we could switch to msgspec we could seriously improve the indexing performance:

https://pythonspeed.com/articles/faster-python-json-parsing/

https://jcristharif.com/msgspec/

Only thing msgspec requires the json schema, as far as I could see these are the required fields:

field aliases
field claims
field descriptions
field id
field labels
field lastrevid
field sitelinks

Am I missing something?

@wetneb
Copy link
Member

wetneb commented Jan 2, 2023

Hi @ziodave, thanks for pointing this out!
I have no objection to optimizing this part of the pipeline - I did not think about it, but if it can make a difference for you, let's do it!

@ziodave
Copy link
Contributor Author

ziodave commented Jan 2, 2023

For the time being, switching to orjson brought a great advantage. The profiling now shows that the SPARQL queries are slowing down the initial indexing. The queries are being cached in memory, probably storing them as file caches (and even pushing them to a GH repo) would help. I'll create a new issue for this ;-)

@jcrist
Copy link

jcrist commented Jan 3, 2023

Apologies for spying on this conversation (I have a github scraper setup so I can see how people are using msgspec).

Only thing msgspec requires the json schema

For the time being, switching to orjson brought a great advantage.

Just a quick clarifying point - msgspec doesn't require a schema, it can be used without one, the same as orjson.

In [1]: import msgspec

In [2]: msgspec.json.decode(b'{"hello": "world"}')
Out[2]: {'hello': 'world'}

A schema will speed things up further (and provide some guard rails for malformed data), but even without a schema msgspec should be significantly faster than the stdlib and mildly faster than orjson.

Either msgspec or orjson are a good option here if json performance is a limiting factor, choose whichever tool fits your requirements the best. Please don't feel any pressure to use msgspec based on my comment here.

@ziodave
Copy link
Contributor Author

ziodave commented Jan 3, 2023

Thanks @jcrist, yes I started adding the schema as an improvement I am testing out on the languagemodel (along with reactive rxpy) and so far the profiler looks promising :-)

https://github.com/ziodave/opentapioca/blob/c227d07178bcd6475b22c942d1c4b2aee83f90b5/opentapioca/languagemodel.py#L179

@ziodave
Copy link
Contributor Author

ziodave commented Jan 4, 2023

And now that I added some multiproc, we're back at that :-)

image

@ziodave
Copy link
Contributor Author

ziodave commented Jan 4, 2023

@jcrist is there a way to ignore a typed value when the type is unknown?

Right now I get this error:

msgspec.ValidationError: Invalid value 'globecoordinate' - at `$.claims[...][0].mainsnak.datavalue.type`

But I don't care about globecoordinate values.

@ziodave
Copy link
Contributor Author

ziodave commented Jan 4, 2023

I ended up defining it.

So far so good ...

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants