Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new fallback for extracting json #144

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Vitiell0
Copy link

@Vitiell0 Vitiell0 commented Jul 27, 2020

Fixes #143

@codecov
Copy link

codecov bot commented Jul 27, 2020

Codecov Report

Merging #144 into master will decrease coverage by 0.55%.
The diff coverage is 76.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #144      +/-   ##
==========================================
- Coverage   89.15%   88.60%   -0.56%     
==========================================
  Files          12       12              
  Lines         535      553      +18     
  Branches      121      124       +3     
==========================================
+ Hits          477      490      +13     
- Misses         52       56       +4     
- Partials        6        7       +1     
Impacted Files Coverage Δ
extruct/jsonld.py 87.80% <76.19%> (-12.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd49a5f...50221f3. Read the comment docs.

@Gallaecio
Copy link
Member

Could you cover your changes with tests?

@Vitiell0
Copy link
Author

@Gallaecio I believe all the tests are still passing? The new helper func _extract_json_objects doesn't have a test but it's pretty straightforward.

@Gallaecio
Copy link
Member

The point of tests is to avoid regressions. So, ideally, the changes should include a test that “proves” that your changes work.

It’s quite important for long-term maintenance, and I think specifically in this case adding tests makes a lot of sense. For instance, it makes sure that in the future we do not break support for the scenario that you are fixing here while trying to support a different scenario. And we are bond to see such change attempts, people can be very creative in the way they break JSON 🙂

@advance512
Copy link

lgtm

@kennylajara
Copy link

kennylajara commented Dec 8, 2020

I don't know if you are willing to merge this pull request as it seem to be here for around 4 month now, but this solve the issue that the current HTML parser cause with ld+json scripts that uses non-english characters. Example:

{
"@context": "http://schema.org",
"@type": "NewsArticle",
"headline": "We found an acute just here: á",
"author": {
"@type": "Person",
"name": "Lastname with tilde: ñ"
}
}

I am pretty new to Python, so, I am not sure on how to implement the test on this language yet. If you can create and merge this pull-request it would be awesome.

@lopuhin
Copy link
Member

lopuhin commented Dec 8, 2020

Thanks for feedback @kennylajara , although the PR still needs more work IMO, it's great to know that it resolves problems for you so that we know it's important.

It looks like the problem you encountered is different from the original issue, could you please share the URL where you have seen such markup?

@kennylajara
Copy link

kennylajara commented Dec 8, 2020

Sure @lopuhin . You can see an example here: https://www.diariolibre.com/actualidad/gobierno-comenzara-a-entregar-el-bono-navideno-de-rd-1500-desde-este-11-de-diciembre-NG23168270

EDIT: I just noticed that the problem with the markup of the previous URL is not the tilde or the acute (the current code seems to work OK with other URLs that contain a similar markup).

The real problem is that the JSON+LD markup of that URL is broken due to non-escaped quotes in the news article description. But, at least, this code allows the scraper to ignore the broken element and extract the valid markup instead of just crashing.

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

@lopuhin
Copy link
Member

lopuhin commented Dec 8, 2020

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

@kennylajara
Copy link

Got it, you are right. errors="ignore" is working for me now.

Is there any other documentation? I don't see those options on the .README file.

I just wanna add one more comment:

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

Yes, almost... errors=ignore helps me to keep the scraper running but I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

@lopuhin
Copy link
Member

lopuhin commented Dec 8, 2020

Is there any other documentation? I don't see those options on the .README file.

Only here for now:

def extract(htmlstring,
base_url=None,
encoding="UTF-8",
syntaxes=SYNTAXES,
errors='strict',
uniform=False,
return_html_node=False,
schema_context='http://schema.org',
with_og_array=False,
**kwargs):
"""htmlstring: string with valid html document;
base_url: base url of the html document
encoding: encoding of the html document
syntaxes: list of syntaxes to extract, default SYNTAXES
errors: set to 'log' to log the exceptions, 'ignore' to ignore them
or 'strict'(default) to raise them
uniform: if True uniform output format of all syntaxes to a list of dicts.
Returned dicts structure:
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
return_html_node: if True, it includes into the result a HTML node of
respective embedded metadata under 'htmlNode' key.
The feature is supported only by microdata syntax.
Each node is of `lxml.etree.Element` type.
schema_context: schema's context for current page"""

I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

Right 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSONDecodeError: Extra data: line 21 column 1 (char 572) for URL https://lubelska.co.uk/
5 participants