Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow extraction for specific string #193

Open
Schwankenson opened this issue Apr 18, 2022 · 6 comments
Open

Very slow extraction for specific string #193

Schwankenson opened this issue Apr 18, 2022 · 6 comments

Comments

@Schwankenson
Copy link

I have one site with HTML strings, where I have really slow extraction times (~60 seconds). I just call extruct.extract with this string:

https://pastebin.com/QJbUdaA6

Other strings work in times like 1-5 seconds. Does somebody have an idea what`s wrong with this string? Is there something I can do?

Thank you all for working on this great python package!

@lopuhin
Copy link
Member

lopuhin commented Apr 18, 2022

@Schwankenson I didn't check the string yet but what might help is restricting the supported dialects by passing a custom syntaxes argument to extruct.extract, in case you can afford that. Depending on the data you deal with, you might find that some dialects are very rare but have large processing time, so it can make sense to disable them by default. For example, in one project we only use syntaxes=['microdata', 'opengraph', 'json-ld'] as they cover most kinds of semantic markup and are fast.

@Schwankenson
Copy link
Author

@lopuhin Great, thank you! Limiting it to json-ld and microdata shortens time to below one second!

@lopuhin
Copy link
Member

lopuhin commented Apr 18, 2022

Glad it helped, and thanks for checking 👍
I'd rather keep the issue open to see if we can fix this or update defaults or README

@lopuhin lopuhin reopened this Apr 18, 2022
@sitems
Copy link

sitems commented Jul 8, 2022

For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.

@azcarraga
Copy link

For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.

Super helpful thank you - this was the case for me too.

@dgtlmoon
Copy link

In my case it was rdfa syntax that was being slow, excluding rdfa from the syntaxes list changed the processing time from 800ms to 160ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants