How to correct "nasty" jsonl+ld #53

maugch · 2017-07-01T21:03:16Z

I've found at least a couple of bad json+ld that extruct can't read.

  File "/cygdrive/d/recipeWorkspace/python/parsers.py", line 25, in readJsonLd
    data = jslde.extract(html)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 21, in extract
    return self.extract_items(lxmldoc)
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 25, in extract_items
    self._xp_jsonld(document))
  File "/usr/lib/python2.7/site-packages/extruct/jsonld.py", line 35, in _extract_items
    data = json.loads(HTML_OR_JS_COMMENTLINE.sub('', script))
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 20 column 778 (char 1342)

The reason are ellipsis inside the text. For example:

"recipeInstructions": [
		"1. blablabla two "buttons".5. Dab  Snowmen!"		
	]

Html allow this, but it's not possible to read it. Is there an easy way to correct similar issues automatically?

The text was updated successfully, but these errors were encountered:

maugch · 2017-07-01T21:20:25Z

Here's an example:
w w w .browneyedbaker.com/nutter-butter-snowmen/

redapple · 2017-07-03T12:29:36Z

Hey @maugch , thanks for the report.
Correcting this kind of unescaped double quotes looks non-trivial.
demjson and ujson both choke on this input.
There might be a way with demjson's return_errors=True:

>>> demjson.decode(r'''"test"quotes""''', return_errors=True)
json_results(object='test', errors=[JSONDecodeError('Unexpected text after end of JSON value', position=position_marker(offset=6,line=1,column=6), severity='error')], stats=None)

checking what chars is around the offset

maugch · 2017-07-03T12:43:04Z

I've had similar issues with other chars but I'm not sure exactly which, because every time I do a result[XX] where XX is the value on the the exception, I get either a blank space or a letter.
There must be a wordpress plugin that misses some chars. Right now I had issues mostly with Recipe schemas.

I suppose the only possible solution is to check for the next square bracket and take the ellipsis before it as the closing one and escape all others. A further check is if there are "," since it might be a list of strings. Actually it might be enough to check all " not followed by , (apart the last one followed by ].
Beware of \n \t lying randomly everywhere..

Granitosaurus · 2018-03-07T03:39:32Z

Might be unrelated but extruct json parser also chokes on \t characters in json.

Example case:

from url: https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes

There are some tabs in json that break extruct. And can be solved by replacing them away:

class ExtendJsonLdExtractor(JsonLdExtractor):
    def _extract_items(self, node):
        script = node.xpath('string()')
        script = script.replace('\t', '')
        <...>

I think extruct should either:

Expose itself for script parsing for monkey patching or injections:

  class JsonLdExtractor():
      def process_script(self, script):
          return 
  # then you can monkey patch your cleanup logic
  ext = JsonLdExtractor()
  ext.process_script = lambda script: script.replace('\t','')

Or implement some basic json cleanup in core code.
Preferably both :P

Granitosaurus · 2018-03-07T05:37:22Z

Btw @maugch I can't replicate your issue on www.browneyedbaker.com/nutter-butter-snowmen/

$ scrapy shell http://www.browneyedbaker.com/nutter-butter-snowmen/
In [1]: from extruct.jsonld import JsonLdExtractor
In [2]: JsonLdExtractor().extract(response.body_as_unicode())
In [3]: len(_)
Out[3]: 3

It works correctly here

cathalgarvey · 2018-03-21T18:53:37Z

Hi @Granitosaurus - I downloaded the URL you supplied above and I was able to decode the JSON using json, and I was able to extract it using JsonLdExtractor. Can you provide example code of this failing in your case?

My code, approximately:

>>> import user_agent, requests, json, extruct
>>> from scrapy.http import HtmlResponse
>>> r = requests.get('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', headers={'User-Agent': user_agent.generate_user_agent()})
>>> response = HtmlResponse('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', body=r.content)
>>> data = response.css('script[type="application/ld+json"]::text').extract_first()
>>> json.loads(data)
{'@context': 'http://schema.org/',
 '@type': 'Product',
 'aggregateRating': {'@type': 'AggregateRating',
  'ratingValue': '4.1053',
  'reviewCount': '19'},
 'brand': {'@type': 'Thing', 'name': 'NoTubes'},
 'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes.       Détails :         Largeur : 25 mm.       Longueur : 9.144 m (10 Yards).       Un rouleau convient pour 5 jantes 26&#039;&#039; ou 4 jantes 29&#039;&#039;.       Compatibilités :         ZTR 355 (26&quot;, 650b, 29&quot;).       ZTR Crest.       ZTR Arch EX.       ZTR Flow EX.      #shortcode_video .row { display:block; }  #shortcode_video .col { padding:15px; }',
 'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
 'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
 'offers': {'@type': 'Offer',
  'availability': 'http://schema.org/InStock',
  'price': '14.99',
  'priceCurrency': 'EUR',
  'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}
>>> extruct.jsonld.JsonLdExtractor().extract(r.content)
[{'@context': 'http://schema.org/',
  '@type': 'Product',
  'aggregateRating': {'@type': 'AggregateRating',
   'ratingValue': '4.1053',
   'reviewCount': '19'},
  'brand': {'@type': 'Thing', 'name': 'NoTubes'},
  'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes.       Détails :         Largeur : 25 mm.       Longueur : 9.144 m (10 Yards).       Un rouleau convient pour 5 jantes 26&#039;&#039; ou 4 jantes 29&#039;&#039;.       Compatibilités :         ZTR 355 (26&quot;, 650b, 29&quot;).       ZTR Crest.       ZTR Arch EX.       ZTR Flow EX.      #shortcode_video .row { display:block; }  #shortcode_video .col { padding:15px; }',
  'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
  'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
  'offers': {'@type': 'Offer',
   'availability': 'http://schema.org/InStock',
   'price': '14.99',
   'priceCurrency': 'EUR',
   'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}]

maugch · 2018-03-21T21:14:57Z

I did try again now and I don't get an exception. I suppose they corrected it. According to my previous comment, there was a text "buttons" that I don't see anymore. I see this on firefox:
the two “buttons”

My code is simple (now even simplified for this comment):

results  = response.css("script[type='application/ld+json']").extract()`
jslde = JsonLdExtractor()
data = jslde.extract(results[1])

cathalgarvey · 2018-03-27T10:26:10Z

Hey @maugch - Glad to hear your problem has resolved. Pity we couldn't capture test cases before it disappeared, though. :)

@Granitosaurus - Any chance you can replicate, and if so can you capture failing HTML so we can use it to build a test case?

cathalgarvey · 2018-03-30T20:18:19Z

Hey folks, I'll close this for now, but if anyone can find us a failure case we can work with, we'll reopen. :)

akirmse · 2018-05-14T21:47:28Z

Here's a tragic example:

http://montalvoarts.org/events/summernights18_salsa/

They omit a closing brace in their "location" field in their ld+json in every event on their site. When parsing manually, I'm able to correct this and extract the events. I'm looking at moving to extruct and it would be great if this site kept working.

lopuhin · 2018-05-15T06:06:59Z

For reference, this is json-ld from the site:

[{
  "@context" : "http://schema.org",
  "@type" : "Event",
  "name" : "Salsa Night",
  "startDate" : "2018-06-27T18:00:00",
  "location" : {
    "@type" : "EventVenue",
    "name" : "Montalvo Arts Center",
    "address" : "15400 Montalvo Rd, Saratoga, CA"
  }]

lopuhin · 2018-08-08T15:48:34Z

Some (but not all) issues raised in this thread were fixed in #85

maugch · 2018-08-22T12:48:25Z

Again another jsonld with wrong data. Again a Recipe site. I suppose there is a wordpress plugin that isn't working correctly. There is a ] at the end that shouldn't be there
`

`

Gallaecio added the enhancement label May 23, 2019

Gallaecio linked a pull request Jan 10, 2020 that will close this issue

Try to fix bad JSON due to unescaped double quotes #126

Open

Gallaecio mentioned this issue May 15, 2020

feat: add parser for JSON with JS comment #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to correct "nasty" jsonl+ld #53

How to correct "nasty" jsonl+ld #53

maugch commented Jul 1, 2017 •

edited

maugch commented Jul 1, 2017

redapple commented Jul 3, 2017

maugch commented Jul 3, 2017

Granitosaurus commented Mar 7, 2018

Granitosaurus commented Mar 7, 2018

cathalgarvey commented Mar 21, 2018

maugch commented Mar 21, 2018

cathalgarvey commented Mar 27, 2018

cathalgarvey commented Mar 30, 2018

akirmse commented May 14, 2018

lopuhin commented May 15, 2018

lopuhin commented Aug 8, 2018

maugch commented Aug 22, 2018

How to correct "nasty" jsonl+ld #53

How to correct "nasty" jsonl+ld #53

Comments

maugch commented Jul 1, 2017 • edited

maugch commented Jul 1, 2017

redapple commented Jul 3, 2017

maugch commented Jul 3, 2017

Granitosaurus commented Mar 7, 2018

Granitosaurus commented Mar 7, 2018

cathalgarvey commented Mar 21, 2018

maugch commented Mar 21, 2018

cathalgarvey commented Mar 27, 2018

cathalgarvey commented Mar 30, 2018

akirmse commented May 14, 2018

lopuhin commented May 15, 2018

lopuhin commented Aug 8, 2018

maugch commented Aug 22, 2018

maugch commented Jul 1, 2017 •

edited