new fallback for extracting json #144

Vitiell0 · 2020-07-27T20:26:42Z

Fixes #143

codecov · 2020-07-27T20:27:57Z

Codecov Report

Merging #144 into master will decrease coverage by 0.55%.
The diff coverage is 76.19%.

@@            Coverage Diff             @@
##           master     #144      +/-   ##
==========================================
- Coverage   89.15%   88.60%   -0.56%     
==========================================
  Files          12       12              
  Lines         535      553      +18     
  Branches      121      124       +3     
==========================================
+ Hits          477      490      +13     
- Misses         52       56       +4     
- Partials        6        7       +1

Impacted Files	Coverage Δ
extruct/jsonld.py	`87.80% <76.19%> (-12.20%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd49a5f...50221f3. Read the comment docs.

Gallaecio · 2020-07-28T10:29:47Z

Could you cover your changes with tests?

Vitiell0 · 2020-07-28T16:27:25Z

@Gallaecio I believe all the tests are still passing? The new helper func _extract_json_objects doesn't have a test but it's pretty straightforward.

Gallaecio · 2020-07-28T18:47:55Z

The point of tests is to avoid regressions. So, ideally, the changes should include a test that “proves” that your changes work.

It’s quite important for long-term maintenance, and I think specifically in this case adding tests makes a lot of sense. For instance, it makes sure that in the future we do not break support for the scenario that you are fixing here while trying to support a different scenario. And we are bond to see such change attempts, people can be very creative in the way they break JSON 🙂

advance512 · 2020-08-03T21:10:47Z

lgtm

kennylajara · 2020-12-08T02:50:27Z

I don't know if you are willing to merge this pull request as it seem to be here for around 4 month now, but this solve the issue that the current HTML parser cause with ld+json scripts that uses non-english characters. Example:

{
"@context": "http://schema.org",
"@type": "NewsArticle",
"headline": "We found an acute just here: á",
"author": {
"@type": "Person",
"name": "Lastname with tilde: ñ"
}
}

I am pretty new to Python, so, I am not sure on how to implement the test on this language yet. If you can create and merge this pull-request it would be awesome.

lopuhin · 2020-12-08T08:23:51Z

Thanks for feedback @kennylajara , although the PR still needs more work IMO, it's great to know that it resolves problems for you so that we know it's important.

It looks like the problem you encountered is different from the original issue, could you please share the URL where you have seen such markup?

kennylajara · 2020-12-08T12:14:56Z

Sure @lopuhin . You can see an example here: https://www.diariolibre.com/actualidad/gobierno-comenzara-a-entregar-el-bono-navideno-de-rd-1500-desde-este-11-de-diciembre-NG23168270

EDIT: I just noticed that the problem with the markup of the previous URL is not the tilde or the acute (the current code seems to work OK with other URLs that contain a similar markup).

The real problem is that the JSON+LD markup of that URL is broken due to non-escaped quotes in the news article description. But, at least, this code allows the scraper to ignore the broken element and extract the valid markup instead of just crashing.

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

lopuhin · 2020-12-08T13:59:57Z

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

kennylajara · 2020-12-08T14:35:55Z

Got it, you are right. errors="ignore" is working for me now.

Is there any other documentation? I don't see those options on the .README file.

I just wanna add one more comment:

The alternative solution is to wrap Extruct inside of a Try/Except, but this will ignore other important markup (like opengraph) while the code here will extract the opengraph and valid the part of the Json+ld markup.

Currently, this can be partly achieved with errors="ignore" or errors="log" argument to extruct.extract - at least it would continue other formats even if one of them fails.

Yes, almost... errors=ignore helps me to keep the scraper running but I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

lopuhin · 2020-12-08T14:45:34Z

Is there any other documentation? I don't see those options on the .README file.

Only here for now:

extruct/extruct/_extruct.py

Lines 17 to 43 in c1e4803

    
           def extract(htmlstring, 
        
                       base_url=None, 
        
                       encoding="UTF-8", 
        
                       syntaxes=SYNTAXES, 
        
                       errors='strict', 
        
                       uniform=False, 
        
                       return_html_node=False, 
        
                       schema_context='http://schema.org', 
        
                       with_og_array=False, 
        
                       **kwargs): 
        
               """htmlstring: string with valid html document; 
        
                  base_url: base url of the html document 
        
                  encoding: encoding of the html document 
        
                  syntaxes: list of syntaxes to extract, default SYNTAXES 
        
                  errors: set to 'log' to log the exceptions, 'ignore' to ignore them 
        
                          or 'strict'(default) to raise them 
        
                  uniform: if True uniform output format of all syntaxes to a list of dicts. 
        
                           Returned dicts structure: 
        
                           {'@context': 'http://example.com', 
        
                            '@type': 'example_type', 
        
                            /* All other the properties in keys here */ 
        
                            } 
        
                  return_html_node: if True, it includes into the result a HTML node of 
        
                                    respective embedded metadata under 'htmlNode' key. 
        
                                    The feature is supported only by microdata syntax. 
        
                                    Each node is of `lxml.etree.Element` type. 
        
                  schema_context: schema's context for current page"""

I still find this pull request usefull as it will not completly ingore the ld-json metadata. This code partially extract the metadata, just ignore the item containing the error.

Right 👍

new fallback for extracting json

50221f3

Vitiell0 mentioned this pull request Jul 27, 2020

JSONDecodeError: Extra data: line 21 column 1 (char 572) for URL https://lubelska.co.uk/ #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new fallback for extracting json #144

new fallback for extracting json #144

Vitiell0 commented Jul 27, 2020 •

edited by Gallaecio

codecov bot commented Jul 27, 2020 •

edited

Gallaecio commented Jul 28, 2020

Vitiell0 commented Jul 28, 2020

Gallaecio commented Jul 28, 2020

advance512 commented Aug 3, 2020

kennylajara commented Dec 8, 2020 •

edited

lopuhin commented Dec 8, 2020

kennylajara commented Dec 8, 2020 •

edited

lopuhin commented Dec 8, 2020

kennylajara commented Dec 8, 2020

lopuhin commented Dec 8, 2020

new fallback for extracting json #144

Are you sure you want to change the base?

new fallback for extracting json #144

Conversation

Vitiell0 commented Jul 27, 2020 • edited by Gallaecio

codecov bot commented Jul 27, 2020 • edited

Codecov Report

Gallaecio commented Jul 28, 2020

Vitiell0 commented Jul 28, 2020

Gallaecio commented Jul 28, 2020

advance512 commented Aug 3, 2020

kennylajara commented Dec 8, 2020 • edited

lopuhin commented Dec 8, 2020

kennylajara commented Dec 8, 2020 • edited

lopuhin commented Dec 8, 2020

kennylajara commented Dec 8, 2020

lopuhin commented Dec 8, 2020

Vitiell0 commented Jul 27, 2020 •

edited by Gallaecio

codecov bot commented Jul 27, 2020 •

edited

kennylajara commented Dec 8, 2020 •

edited

kennylajara commented Dec 8, 2020 •

edited