Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with decoding quotation mark #2401

Open
1 task
HonzaKirchner opened this issue Apr 4, 2024 · 2 comments
Open
1 task

Issue with decoding quotation mark #2401

HonzaKirchner opened this issue Apr 4, 2024 · 2 comments
Labels
bug Something isn't working. t-console Issues with this label are in the ownership of the console team. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@HonzaKirchner
Copy link
Member

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

I am working with some TripAdvisor API endpoint, which returns a bunch of places in JSON format, near some coordinates. One of the places returned, in my case, is called Restaurace "Otevreno", with the double quotes.
TripAdvisor encodes the string as such: Restaurace "Otevreno".
When i take the response's body from the Cheerio Crawlee context, " is automatically decoded to ", thus breaking the JSON and making JSON.parse raise an error.

This is the API

Code sample

No response

Package version

latest

Node.js version

latest

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

Link to slack thread

@HonzaKirchner HonzaKirchner added the bug Something isn't working. label Apr 4, 2024
@mvolfik
Copy link
Contributor

mvolfik commented Apr 5, 2024

So, a little investigation writeup:

To construct the context object for the request handler, HttpCrawler._runRequestHandler() calls this._parseResponse(), which, in turn, calls this._parseHTML().

_parseHTML() is overriden in CheerioCrawler, and returns

{ 
  get body() {
    return isXml ? "..." : $.html({decodeEntities: true})
  },
  // ...
}

Apparently, this was added there to "save memory for highly parallel runs" (source). However, currently we don't get this effect anyways, since the result of _parseHtml() is immediately destructured here, so when requestHandler, context.body is already a string, and not the getter. (You can trivially verify this by putting a breakpoint/debugger;/console.log() into the getter, and checking at what moment (with what call stack) it is called.

Also, I don't think there's a way to have $.html() return what we want. When a website responds with ``, content-type: text/html:

/* consider a website that responds with
content-type: text/html
{"foo": "<p>bar &lt; &quot; baz</p>"}
*/
const crawler = new CheerioCrawler({
    requestHandler({ $ }) {
        console.log($.html({ decodeEntities: false }));
    },
});
// --> {"foo": "<p>bar < " baz</p>"}

const crawler = new CheerioCrawler({
    requestHandler({ $ }) {
        console.log($.html({ decodeEntities: false }));
    },
});
// --> {&quot;foo&quot;: &quot;<p>bar &lt; &quot; baz</p>&quot;}

const crawler = new HttpCrawler({
    requestHandler({ body }) {
        console.log(body.toString("utf8"));
    },
});
// --> {"foo": "<p>bar &lt; &quot; baz</p>"}

First case is current behavior of ctx.body, and imo it's bad, because it breaks HTML. Also, it's really confusing that CheerioCrawlingContext.body and HttpCrawlingContext.body return different values (that's kind of what prompted the original report by @gullmar).

The second case is probably also unjustifiable, since it would change current behavior heavily (websites probably contain a lot of quotes, &s and idk what else cheerio decides to escape).


Therefore, I propose we remove body getter, and instead return the original body buffer .toString("utf8"), to have the same data like HttpCrawler, but also keep body as a string to avoid breaking Actors.

It is also ok with me to call this a won'tfix, since website returning JSON with content-type: text/html is just weird.

@barjin
Copy link
Contributor

barjin commented Apr 8, 2024

Possibly (quite remotely) related to #2317

@mtrunkat mtrunkat added t-tooling Issues with this label are in the ownership of the tooling team. t-console Issues with this label are in the ownership of the console team. t-c&c Team covering store and finance matters. labels Apr 10, 2024
@mhamas mhamas removed the t-c&c Team covering store and finance matters. label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-console Issues with this label are in the ownership of the console team. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

5 participants